SHARE:

Demographic Bias Correction for Social Media Data

General Information

Title
Demographic Bias Correction for Social Media Data
Author
Christopher Wienberg
Publication Type
Dissertation (Bachelor/Master/Phd)
Outlet
University of Southern California
Year
2017
Abstract
For generations people have been keeping records of their everyday lives. The web is now a popular place for people to document their personal lives, replacing journals and diaries popular decades ago. The popularity of weblogs and social media has provided an unique opportunity to study people at a massive scale. Social media researchers have seized this chance to use social media data to predict and measure social phenomena, such as elections, economic activity, and public health. While these researchers’ work has shown promise, they frequently highlight a challenge with web data: web users, as a group, are dissimilar (e.g. younger, wealthier) from most offline populations. Demographic representativity is an issue that economists and other social scientists deal with regularly. They have found that reweighting survey samples based on demographic variables like age and gender can improve the accuracy of survey results. They take advantage of this by asking survey respondents to provide their demographic background. In contrast, social media analysts do not have immediate access to these demographic variables. This dissertation proposes and evaluates a practical approach for making social predictions from social media data while contending with demographic representativity issues. It describes the collection and analysis of reliable data describing a population of web users. Social predictions are drawn from this population, with various bias correction approaches evaluated by comparing to gold standard data from traditionally collected surveys. Special attention is paid to important practical considerations, such as errors introduced by automated methods to characterize the demographic and other attributes of individual users and their impact on predictions for the broader population.