🔬 Research summary by Laird Gallaghar, an editor and public policy researcher currently pursuing a master’s degree in applied social research at the City University of New York-Hunter College.
[Original paper by danah boyd]
Overview: Due to advancements in computational power and the increased availability of commercial data, the traditional privacy protections used by the U.S. Census Bureau are no longer effective in preventing the mass reconstruction and reidentification of confidential data. In this paper, danah boyd explores the bureau’s response for the 2020 Census: a new disclosure avoidance system called “differential privacy,” which creates a mathematical trade-off between data utility and privacy. But the opaque manner in which the bureau has rolled out the changes has risked undermining trust between the bureau and the diverse stakeholders who use Census data in policymaking, research, and advocacy.
Introduction
Even before COVID-19 had taken hold in the United States, the 2020 Census was off to a rocky start, with a majority of U.S. adults mistakenly believing the form contained a citizenship question. Then, the pandemic upended the Census Bureau’s normal operations and complicated efforts to ensure an accurate count.
However, barriers to enumeration are not the only challenges faced by the Census this year. Changes in the data and computing landscape over the past decade have made it much easier to reconstruct and reidentify confidential information out of Census data products. To respond to those threats, the Census has implemented an entirely new “disclosure avoidance system” (DAS). The system works by introducing noise–mathematical randomness–into the calculations used to generate data products.
But where and how much noise you inject matters. As danah boyd documents in Balancing Data Utility and Confidentiality in the 2020 US Census, the DAS requires a system-wide balance of privacy risk, which means that making certain statistical tables more accurate in turn requires others to include more noise. These trade-offs have widespread implications for the utility of data that stakeholders in government, academia, and the nonprofit and business sectors have come to rely on.
Key Insights
How the Census constructs data products
Since 1790, every ten years the U.S. government has conducted a census of all people living in the country. This decennial count determines the apportionment of legislative representation and the fair allocation of federal funding and resources. But the process also generates powerful data products used by policymakers, social science researchers, and others. In order to protect individual privacy, the Census does not release the full underlying data for 72 years. Instead, it releases aggregated and anonymized data products that ensure confidentiality while still providing valuable demographic information.
Through self-response and follow-up operations, the Census collects basic data about households: the type of housing unit; its ownership status; and the name, date of birth, sex, race, and Hispanic origin of everyone living there. After resolving addresses, this data becomes the “Census Unedited File” (CUF) which is used to calculate the population of each state and thus determine apportionment in the U.S. House of Representatives. Afterward, the Census resolves missing and conflicting demographic data, using statistical models to fill every cell with a value and produce the “Census Edited File” (CEF).
Then come the measures to avoid disclosure. Before this year, the Census would swap households from one location to another in an effort to scramble whether a record matches its real location. In addition, the Census would simply suppress certain information about subpopulations that would disclose too much detail. After recoding and quality assurance, these privacy-protected tabulations (the “Hundred-percent Detail File”) would get released to the public as a series of data products. But swapping and suppressing is no longer enough.
Why a new system to protect privacy?
Due to increases in computing power, it is now much easier for attackers to rebuild individual records out of aggregate data. They do this by triangulating across statistical tables to determine which individuals likely contain which attributes, yielding a reconstructed list of individuals matched to attributes like race, sex, and Census block. From here, an attacker can then use external data sources, including widely-available commercial data, to link these anonymized yet reconstructed individual records and re-identify individuals by name and other characteristics.
boyd explains that while reconstruction, linkage, and reidentification attacks were once theoretical, they are no longer. “Using the published available statistical tables from only the 2010 decennial census, researchers at the bureau reconstructed a complete set of individual records that could effectively serve as a complete microdata file down to the block level,” she writes. Due to swapping and other measures, the complete set did not fully match the unprotected, edited files–but fully 46 percent of individual records were perfect matches. And just by allowing the age variable to be +/- one year, fully 71 percent of individual records matched. From this reconstructed data, census researchers were able to re-identify (and confirm) 17 percent of individual records–tens of millions of U.S. residents.
And that was with 2010 data. Since then, commercial data availability has increased nearly exponentially. It became clear to census researchers that far more than 17 percent of records could be exposed if they stuck to their standard practices of swapping and suppressing. They’d either have to release far less data products or implement a new system of privacy protection.
Balancing confidentiality and accuracy
For the 2020 census, the bureau has decided to implement a new “Disclosure Avoidance System” built on the principles of differential privacy. According to boyd:
“Differential privacy works to prevent accurate reconstruction attacks while making certain that the data are still useful for statistical analyses. It does this by injecting a controlled amount of uncertainty in the form of mathematical randomness, also called noise, into the calculations that are used to produce data products. The range of noise can be shared publicly because an attacker cannot know exactly how much noise was introduced into any particular table. With differential privacy it is still possible to reconstruct a database, but the database that is reconstructed will include privacy-ensuring noise. In other words, the individual records become synthetic byproducts of the statistical system.”
The problem is that this system involves choices about where to introduce noise, and how much noise. In order to maintain a certain privacy-loss budget, designers must allocate noise levels throughout the data, prioritizing the accuracy of certain statistical tables over others. This is what makes the privacy differential. But as a consequence of how this top-down algorithmic approach works, it would create undesirable outcomes like geographic inconsistencies, partial people, and negative people, without additional processing. The need to perform a post-processing cleanup is primarily political, according to boyd. Laws around redistricting require the Census to prioritize making block-level data consistent and ensure the data consists only of non-negative integers (no negative or fractional counts of people). But this post-processing generates all sorts of statistical oddities entirely unrelated to privacy.
Communication breakdown
The Census’ announcement of a new disclosure avoidance system in late 2018 caught many data users and advocates by surprise. The lack of education on how differential privacy works and why it is necessary left many stakeholders confused and frustrated. This new approach to protecting confidentiality required all data uses to be determined in advance so that the noise could be best allocated throughout the statistical tables, but most Census data users had never approached their work in this way. In addition, users didn’t always understand why a new approach to privacy was even needed. And unlike the computer scientists who devised the disclosure avoidance system, they often lacked the skill set to analyze and comment on it. Data users became increasingly worried as they explored how the injection of noise could affect the reliability of their own scientific work.
Possible solutions
How can the bureau best maximize data utility while minimizing privacy loss? boyd recommends several ways to relax the constraints on the data.
First is to reduce geographic precision. If the Census stopped publishing block-level data, more of the privacy budget could be spent elsewhere. Unfortunately, federal law dictates that the Census must produce redistricting files with block-level counts.
That fix is out of the question without an unlikely congressional intervention, so boyd suggests publishing “pre-post-processed data” so that users can get acclimated to negative counts, fractional people, and more. Doing so wouldn’t jeopardize privacy.
In addition, the Census might also look to reduce the dimensions of certain variables and withhold publishing block-level data below a certain population threshold.
Between the lines
People like me–researchers for whom analyzing trends in Census data is a secondary aspect of our work–have by and large not even considered the effects of this sea-change in the Census approach to privacy protection. We didn’t see the 2018 notice, didn’t attend any meetings, and didn’t look at the demonstration data. We haven’t had the time. And now, we might not be able to use the Census like we did before. Luckily, differential privacy won’t be applied to the American Community Survey until 2025, which buys us some time to understand this new reality. But the Census is in a challenging place. There is a major threat to public trust in Census data collection that requires these new privacy measures. If data collection suffers, the data products will suffer, too. But there’s also a threat to the utility of the data, data which is important not just for advancing knowledge but also for public policy advocacy and more. Indeed, boyd is right in her premonition: “What’s at stake is not simply the availability of the data; it is the legitimacy of the census.”