Power  /  Explainer

How Racial Data Gets 'Cleaned' in the U.S. Census

The national survey offers more identity choices than ever—until those choices get scrubbed away.
United States Bureau of the Census/Wikimedia Commons

During the years between each census, researchers, activists, politicians, and interest groups lobby for the rewording of a label, the addition (or elimination) of a category, or the disaggregation of another, such as Asian or American Indian or Alaska Native. In 2000, for example, “Hispanic or Latino, or Spanish origins” was reclassified from racial to ethnic data. Respondents were also allowed to select multiple boxes to reflect multiracial heritage for the first time. Additional changes that affect how the racial makeup of the country is represented are underway, including the creation of a separate category for people of Middle Eastern and North African descent (referred to as MENA).

 
Shifts in racial classifications raise questions about what exactly is being counted, how people interpret the same questions differently, and what to do about people’s changing perceptions of their racial background. In 2015, the Pew Research Center reported that at least 9.8 million people reported a different racial or ethnic background than they did in 2000. When someone appears to “change” races, the resulting data is sometimes construed as erroneous.
The statistical accounting used to correct such errors is commonly referred to as “data cleaning” or data cleansing. This process involves identifying and then editing data already collected—through modification, enhancement, or deletion of responses—when it does not conform to some predetermined rules that standardize the data set. Ostensibly, the goal is to improve data quality by correcting measurement errors generated by people who complete the questionnaires or enter responses into the database. Data cleaning hopes to make a final data set similar to other, related ones, such as the other national censuses and the American Community Survey.
 
Errors in reporting and recording certainly do happen. But if racial data must be cleaned, then some data is dirty. And that dirtiness is undeniably political. Some responses are more likely to be diagnosed as dirty. Given the goal of creating information that is comparable from one national census to the next, the data most under suspect are those that correspond to the categories most in flux: people who checked more than one box, for example, or those who saw themselves as members of different racial or ethnic groups at different times.