Big data comes to the Census Bureau

Two of the more interesting sessions at last week's AAPOR conference featured the US Census Bureau. The shared theme was the Bureau's initiative to reengineer its data collection process in an era of declining cooperation and ever-tightening budgets. The two underpinnings of their strategy are (1) a new data collection approach called adaptive design and (2) big data.

Adaptive design is an enhancement to an earlier strategy called responsive design that replaces the traditional strategy of pursuing the highest possible response rate until either the money or time runs out. Adaptive design essentially says that the quality of the estimates is a better indicator of overall data quality than the response rate. To simplify it for the blogosphere, responsive design says that it makes no sense to continue to pursue interviews with certain types of people (say, a specific demographic group) if getting those data is not going to improve the survey's estimates, or at least the most important estimates. Adaptive design takes things a couple of steps further by saying that I can make decisions about which lines in my sample to pursue by using what I already know about them. Some of that information might come from a close monitoring of the field effort on the survey I'm running, and some might come from other sources. That's were big data comes.

The Census Bureau executes over 100 different surveys of households and businesses every year. Throw in the Decennial Census, and they have tens of thousands of field representative visiting or calling millions of homes and businesses, and learning at least a little something about each of them, whether they complete an interview or not. Putting all of this together in a systematic way will make it possible to separate out the easier respondents to get from the really hard ones. Bringing in data from the administrative records of other government agencies can enrich the database even further, sharpening the Bureau's ability to further prioritize the data collection effort. (I'm one of those people who believe that much of the Decennial Census might be done from these administrative records, but that's another post.) In theory, the survey effort becomes more efficient, can be completed more quickly, and will cost less.

But the Bureau faces the same challenge all users of big data must face: potential limits due to privacy protections. In their case it may come down to their ability to use data collected for another purpose. But unlike many of those other users, the Bureau approaches these issues with the utmost seriousness. Confidentiality protection is an obsession. The bar is significantly higher than simply what is legal. And so, they have an aggressive survey program designed to measure public attitudes toward an approach like what I've just described.

The jury is still out on all of this, but here's hoping they can make this work.