Takeaways from the White House Big Data Reports

May 5, 2014

On May 1, the White House released its two eagerly-awaited reports on “big data” resulting from the 90-day study President Obama announced on January 17—one by a team led by Presidential Counselor John Podesta, and a complementary study by the President’s Council of Advisors on Science and Technology (PCAST). The reports contain valuable detail about the uses of big data in both the public and private sector. At the risk of oversimplifying, I see three major takeaways from the reports.

First, the reports recognize big data’s enormous benefits and potential. Indeed, the Podesta report starts out by observing that “properly implemented, big data will become an historic driver of progress.” It adds, “Unprecedented computational power and sophistication make possible unexpected discoveries, innovations, and advancements in our quality of life.” The report is filled with examples of the value of big data in medical research and health care delivery, education, homeland security, fraud detection, improving efficiency and reducing costs across the economy, as well as in providing targeted information to consumers and the raw material for the advertising-supported internet ecosystem. The report states that the “Administration remains committed to supporting the digital economy and the free flow of data that drives its innovation.”

Second, neither report provides any actual evidence of harms from big data. While the reports provide concrete examples of beneficial uses of big data, the harmful uses are hypothetical. Perhaps the most publicized conclusion of the Podesta report concerns the possibility of discrimination—that “big data analytics have the potential to [italics added] eclipse longstanding civil rights protections in how personal information is used in housing, credit, employment, health, education, and the marketplace.” However, the two examples of discrimination cited turn out to be almost non-examples.

The first example involves StreetBump, a mobile application developed to collect information about potholes and other road conditions in Boston. Even before its launch the city recognized that this app, by itself, would be biased toward identifying problems in wealthier neighborhoods, because wealthier individuals would be more likely to own smartphones and make use of the app. As a result, the city adjusted accordingly to ensure reporting of road conditions was accurate and consistent throughout the city.

The second example involves the E-verify program used by employers to check the eligibility of employees to work legally in the United States. The report cites a study that “found the rate at which U.S. citizen have their authorization to work be initially erroneously unconfirmed by the system was 0.3 percent, compared to 2.1 percent for non-citizens. However, after a few days many of these workers’ status was confirmed.” It seems almost inevitable that the error rate for citizens would be lower since citizens automatically are eligible to work, whereas additional information is needed to confirm eligibility for non-citizens (i.e., evidence of some sort of work permit). Hence, it is not clear this is an example of discrimination.

It is notable that both these examples are of government activities. The reports do not present examples of commercial uses of big data that discriminate against particular groups. To the contrary, the PCAST report notes the private-sector use of big data to help underserved individuals with loan and credit-building alternatives.

Finally, and perhaps most importantly, both reports indicate that the Fair Information Practice Principles (FIPPs) that focus on limiting data collection are increasingly irrelevant and, indeed, harmful in a big data world. The Podesta report observes that “these trends may require us to look closely at the notice and consent framework that has been a central pillar of how privacy practices have been organized for more than four decades.” The PCAST report notes, “The beneficial uses of near-ubiquitous data collection are large, and they fuel an increasingly important set of economic activities. Taken together, these considerations suggest that a policy focus on limiting data collection will not be a broadly applicable or scalable strategy—nor one likely to achieve the right balance between beneficial results and unintended negative consequences (such as inhibiting economic growth).” The Podesta report suggests examining “whether a greater focus on how data is used and reused would be a more productive basis for managing privacy rights in a big data environment.” The PCAST report is even clearer:

Policy attention should focus more on the actual uses of big data and less on its collection and analysis. By actual uses, we mean the specific events where something happens that can cause an adverse consequence or harm to an individual or class of individuals….By contrast, PCAST judges that policies focused on the regulation of data collection, storage, retention, a priori limitations on applications, and analysis…are unlikely to yield effective strategies for improving privacy. Such policies would be unlikely to be scalable over time, or to be enforceable by other than severe and economically damaging measures.

In sum, the two reports have much to like: their acknowledgement of the importance and widespread use of big data and their attempt, particularly in the PCAST report, to refocus the policy discussion in a more productive direction. The reports also, however, suffer from a lack of evidence to substantiate their claim of harms.