View article

[PDF] from researchgate.net

Assessing Bias in LLM-Generated Synthetic Datasets: The Case of German Voter Behavior

Authors

Leah von der Heyde, Anna-Carolina Haensch, Alexander Wenz

Publication date

2023/11

Source

SocArXiv Center for Open Science

Report number

DOI: 10.31219/osf.io/97r8s

Pages

1-7

Publisher

SocArXiv

Description

The rise of large language models (LLMs) like GPT-3 has sparked interest in their potential for creating synthetic datasets, particularly in the realm of privacy research. This study critically evaluates the use of LLMs in generating synthetic public opinion data, pointing out the biases inherent in the data generation process. While LLMs, trained on vast internet datasets, can mimic societal attitudes and behaviors, their application in synthesizing data poses significant privacy and accuracy challenges. We investigate these issues using the case of vote choice prediction in the 2017 German federal elections.

Employing GPT-3, we construct synthetic personas based on the German Longitudinal Election Study, prompting the LLM to predict voting behavior. Our analysis compares these LLM-generated predictions with actual survey data, focusing on the implications of using such synthetic data and the biases it may contain. The results demonstrate GPT-3’s propensity to inaccurately predict voter choices, with biases favoring certain political groups and more predictable voter profiles. This outcome raises critical questions about the reliability and ethical use of LLMs in generating synthetic data.

Total citations

Cited by 2

20242

Scholar articles

Assessing Bias in LLM-Generated Synthetic Datasets: The Case of German Voter Behavior

L von der Heyde, AC Haensch, A Wenz - 2023