Authors
Daniel Hopkins, Gary King
Publication date
2010
Journal
American Journal of Political Science
Volume
54
Issue
1
Pages
229–247
Publisher
http://gking.harvard.edu/files/words.pdf
Description
The increasing availability of digitized text presents enormous opportunities for social scientists. Yet hand coding many blogs, speeches, government records, newspapers, or other sources of unstructured text is infeasible. Although computer scientists have methods for automated content analysis, most are optimized to classify individual documents, whereas social scientists instead want generalizations about the population of documents, such as the proportion in a given category. Unfortunately, even a method with a high percent of individual documents correctly classified can be hugely biased when estimating category proportions. By directly optimizing for this social science goal, we develop a method that gives approximately unbiased estimates of category proportions even when the optimal classifier performs poorly. We illustrate with diverse data sets, including the daily expressed opinions of thousands of …
Total citations
200720082009201020112012201320142015201620172018201920202021202220232024717182738558010190124979812397107878127
Scholar articles
D Hopkins, G King - Manuscript available at http://gking. harvard. edu/files …, 2007