The big data revolution will require employment lawyers who can get “under the hood” of claims driven by big data analytics. Here are 10 questions that can help uncover error and bias in the work of data scientists.
Solon Barocas & Andrew D. Selbst argue (in a draft paper called Big Data’s Disparate Impact):
Where data is used predictively to assist decision-making, it can affect the fortunes of whole classes of people in consistently unfavorable ways. Sorting and selecting for the best or most profitable candidates means generating a model with winners and losers. If data miners are not careful, that sorting might create disproportionately adverse results concentrated within historically disadvantaged groups in ways that look a lot like discrimination.
Put differently by James Grimmelmann in”Discrimination by Database” (a summary of the Barocas & Selbst piece), “data mining requires human craftwork at every step.” And where there is human craftwork, bias and bigotry — whether intentional or not — can creep in.
The Ten Questions
(1) What question was asked? Finding out the exact question that was asked of the data can help understand how the answer was derived from the data. What was being sought and was it the best way to pose the question?
(2) Who was asking the questions? Was the data scientist tenacious (or creative) enough to pursue and craft the right questions? Josh Sullivan: “Fundamentally, what sets a great data scientist apart is fierce curiosity – it’s the X factor. You can teach the math and the analytical tools, but not the tenacity to experiment and keep working to arrive at the best question – which is virtually never the one you started out with.” HBR’s Get the Right Data Scientists Asking the “Wrong” Questions). Put another way (in Wired): “researchers have the ability to pick whatever statistics confirm their beliefs (or show good results) … and then ditch the rest. Big-data researchers have the option to stop doing their research once they have the right result.”
(3) How were the dataset(s) originally created and for what purpose? What omissions, errors (including discriminatory ideas) might have crept into the initial dataset that might alter the outcome of the data mining process? Grimmelman: “almost every interesting dataset is tainted by the effects of past discrimination”
(4) How were the datasets selected? Who selected them? On what basis? What sets were ignored? What was the initial hypothesis being tested that drove the assembly of the particular datasets?
(5) Do the datasets accurately represent the population being studied? Barocas and Selbst:”If a sample includes a disproportionate representation of a particular class (more or less than its actual incidence in the overall population), the results of an analysis of that sample may skew in favor or against the over or under-represented class.”
(6) What data was missing? What data was missing from datasets — either through loss or non-collection and how was its absence handled by the data scientist? Put differently: “It is important to understand what variables are more or less likely to be missing, to define a priori an acceptable percent of missing data for key data elements required for analysis, and to be aware of the efforts an organization takes to minimize the amount of missing information.” Consider also Ray Poynter’s perspective:
One of the issues about the scale of Big Data is that it can blind people to what is not being measured. For example, a project might collect a respondent’s location through every moment of the day, their online connections, their purchases, and their exposure to advertising. Surely that is enough to estimate their behavior? Not if their behavior is dependent on things such as their childhood experiences, their genes, conversations overheard, behavior seen etc.
(7) What data was discarded? Managing outlying data, ignoring discrepancies in the data and discarding data points is a common issue: data scientists should be queried as to why any data points were discarded to see if there were preconceived notions about the patterns in data and whether the scientist ignored an important divergence in data values. See Dan Power’s analysis, here.
(8) Were proper proxies used? For instance, we know that zip code is a poor proxy for ability to pay a mortgage and that a timed run is a bad proxy for being a good firefighter (and that being required to drag a 125 pound dummy approximately 30 feet along a zigzag course to a designated area in 36 second while crawling probably is).
(9) Was the model trained properly?
(a) Was the model used to examine the data trained on the correct data points? Grimmelman: “to learn who is a good employee, an algorithm needs to train on a dataset in which a human has flagged employees as ‘good’ or ‘bad,’ but that flagging process in a very real sense defines what it means to be a ‘good’ employee.”
(b) Was the tested data sufficiently close to the training data? Microsoft: “By using similar data for training and testing, you can minimize the effects of data discrepancies and better understand the characteristics of the model.”
(10) What alternative techniques could have been applied to the datasets? Dan Powers: “There are many data mining tools and each tool serves a somewhat different but often complementary purpose.” Given the availability of multiple products, tools and techniques, the data scientist should explain what she used and what other results were derived.
Bonus 11th Question: Is the data quality high enough? Is it relevant, complete, correct, well structured, valid, and timely?
- Relevant: Is the data well-targeted to answer the question? Were there improper extrapolations (that is, using data collected for one purpose for another purpose?
- Complete: Is all the needed data there? are there any key gaps?
- Correct/Accurate: Is the data accurate in that it captures the real world values it is supposed to represent (see proxies below)? Is the data granular enough, that is, is it sufficiently detailed to male the point asserted? Are other factors or details missing that could provide an alternative explanation of the results? Is there any internally inconsistent data and what explains it?
- Well-Structured: Are there any structural ambiguities in the data?
- Sound: Is the data trustworthy? Was it rigorously assembled?
- Timely: Is the data up-to-date? Does it adequately represent the history of the matter studied (is there adequate retention)? Are there any gaps in the record?
See Doug Vucevic, Wayne Yaddow, Testing the Data Warehouse Practicum: Assuring Data Content, Data Structures and Quality.
These eleven items are clearly not the only questions one can or should ask. But they are a starting point. I welcome suggestions for additional questions.