There is a conundrum between statistical access to data and privacy.
The computing community has been working on this problem for years and came up with differential privacy as a solution, which is being implemented in the 2020 census, as described in this Computing Community Consortium (CCC) white paper on Privacy-Preserving Data Analysis for the Federal Statistical Agencies, and this recent NY Times article. The CCC is now working on similar issues in fairness with a workshop on Fair Representations and Fair Interactive Learning. See the corresponding workshop report here.
The NY Times article, however, perhaps portrays a confusing picture with its title “To Reduce Privacy Risks, the Census Plans to Report Less Accurate Data.” Readers could read the story as being about the Federal Government `making up fake data or results’. However, we in the community know that there is an essential trade-off between privacy and accuracy, as the research on differential privacy shows. Later in the article, they do a good job describing it as “[differential privacy] determines the amount of randomness — “noise” — that needs to be added to a data set before it is released, and sets up a balancing act between accuracy and privacy. Too much noise would mean the data would not be accurate enough to be useful — in redistricting, in enforcing the Voting Rights Act or in conducting academic research. But too little, and someone’s personal data could be revealed.”
Former CCC Council member and one of the inventors of differential privacy, Cynthia Dwork from Harvard University and John Abowd, the Associate Director for Research Methodology at the Census Bureau, who helped co-author the related CCC white paper are both quoted in this article.
Dwork said that differential privacy is “tailored to the statistical analysis of large data sets.” Abowd explained in the article that the bureau will announce the trade-off it has chosen for data publications from the 2018 End-to-End Census Test it conducted in Rhode Island, the only dress rehearsal before the actual census in 2020.
Trade-offs between two laudable but contradictory goals are common in computer science, of course, as are outcomes that can sometimes seem counter-intuitive, especially to those coming from outside the field. Differential privacy is an important step forward in an era of big data where many of our daily activities can be captured and mined, and provides an educational opportunity for the computing research community as the general public begins to become aware of its use in applications such as the upcoming Census.