you ask an LLM to simulate 6,000 American households answering questions on inflation? Latest papers discover that giant language fashions can replicate the typical responses of main family surveys to inside a proportion level (Zarifhonarvar, 2026). In 2020, the Survey of Client Expectations (SCE) reported a one-year-ahead median inflation price of about 3%. The median produced by a prompted LLM with life like personas and a knowledge-cutoff instruction: additionally about 3%. Shut sufficient that LLMs have been pitched as a low-cost, high-frequency complement to the SCE, Michigan, and Survey of Skilled Forecasters surveys.
In a current paper, Can LLMs Mimic Household Surveys?, co-authored with Ami Dalloul from the College of Duisburg-Essen, we have a look at the second second, the a part of a chance distribution that tells you whether or not the mannequin represents one opinion or a thousand. It’s right here that the obvious success of LLM-based surveys disappears. The identical Llama-3 mannequin that hits the SCE median to inside a proportion level locations 95% of its simulated respondents inside a two-percentage-point window. The actual 2020 SCE responses vary from roughly minus 25 to plus 27 p.c. Briefly, the typical is correct, however the inhabitants behind it doesn’t exist. So working a simulation with a number of thousand LLM personas boils down to 1 consultant agent.
Determine 1: Dispersion of Actual-World and Artificial Survey Populations
Be aware: The left panel plots the dispersion of particular person 2020 SCE respondents round their imply. Diffuse radiation displays heterogeneous beliefs throughout respondents. The center panel applies the identical development to artificial responses from a Llama-3.1-8B-Instruct mannequin prompted with personas matching the SCE demographic distribution. The scatter collapses to a near-point. The mannequin recovers the imply and discards all the pieces else. The best panel makes use of the identical Llama mannequin unlearned with gradient ascent (GA). The unlearned mannequin achieves a extra life like dispersion and doesn’t collapse across the mode.
Mode collapse
We benchmarked 5 LLMs (Llama-3-8B, Llama-3-70B, Claude-3.7-Sonnet, DeepSeek-V3, GPT-4o) towards the SCE, Michigan Survey, and Survey of Skilled Forecasters. Within the human surveys, 44 to 70% of respondents give solutions greater than 3 proportion factors away from the modal reply; within the LLM samples, that share is basically zero.
The usual cures from the survey-simulation literature don’t enhance this drawback. Census-derived personas with advanced and ranging traits, zero-shot knowledge-cutoff directions (“you have no idea occasions after June 2018”), and specific “don’t search for statistics” prompts all default to the identical slim distribution. The doubtless trigger is that the LLMs see CPI tables, information protection of FRBNY survey releases, and educational replications of their coaching corpora. Requested for the median 2020 inflation expectation, the mannequin is doing retrieval towards memorized knowledge. The load of that coaching knowledge overpowers regardless of the immediate directions ask it to do.
Unlearning the LLMs
If memorized statistics are the issue, a possible repair is to take away them from the weights relatively than ask the mannequin to look away. We utilized two unlearning strategies to Llama-3.1-8B-Instruct, an open-source mannequin that permits us to change its weights:
- Gradient Ascent (GA) maximizes prediction loss on a overlook set of CPI collection and survey aggregates, with a retain loss on micro-survey reasoning so normal functionality survives.
- Unfavourable Choice Optimization (NPO) treats the overlook set as dispreferred completions and minimizes a bounded choice loss towards a reference mannequin.
The info we ask the mannequin to overlook is the official inflation document itself: month-to-month CPI collection and revealed imply inflation expectations from the FRBNY SCE and Michigan surveys. The unlearning impact on the response distribution is in Desk 1.
Desk 1 Tail Accuracy with Completely different Unlearning Methods

Be aware: Unlearning methods to mitigate mode collapse. Gradient ascent (GA) is a focused unlearning technique the place the mannequin is fine-tuned to maximise loss on a dataset of official CPI statistics whereas minimizing loss, or retaining (RT), on a dataset of micro-survey knowledge. Unfavourable choice optimization (NPO) treats official statistics as adverse samples to penalize their technology whereas treating retaining (RT) samples as optimistic. Artificial survey replies of inflation expectations as proportion deviations from the mode and imply (in brackets) inside bins of tangible matches, ± 1, and > 3 % deviations. Tail Acc. measures closeness to the FRBNY tail dispersion benchmark (> ± 3.0 = 44.38).
The baseline Llama-3 (which incorporates prompt-based unlearning) produces an actual mode match on 92% of replies and 0 replies greater than 3pp away. Tail accuracy towards the SCE benchmark of 44% is subsequently zero. After GA, precise matches drop to 24%, and 43% of replies transfer past ±3pp; tail accuracy reaches 97%. NPO is comparable at 37% and 43%, with 98% tail accuracy. In different phrases, each unlearning strategies seem to recuperate a extra life like distribution.
Determine 2 Dispersion of LLMs vs. Unlearning Fashions

Be aware: The left-hand facet plots kernel density estimates of 2020 inflation expectations from the FRBNY SCE and two Llama-3 variants educated with unlearning strategies, gradient ascent (GA) and adverse choice optimization (NPO). Each unlearning variants cowl the vary the place FRBNY SCE locations chance mass, although they nonetheless stay extra concentrated than the human benchmark and barely skewed to greater means. The best-hand facet compares the KDEs of prompted LLM-generated expectations (GPT-4o, Llama-3, and many others.) to FRBNY SCE in 2020. The LLM curves (left axis) are tightly clustered round a slim area, whereas the FRBNY SCE curve stays a lot broader. The LLMs can match central tendency but fail to breed the cross-sectional unfold of survey micro-data. Bandwidth = 0.5 for all KDEs.
The kernel densities (Determine 2) present that off-the-shelf fashions pile chance mass into a skinny spike close to the imply. The unlearned variants unfold mass throughout the vary the place the human respondents of the SCE put it.
Simulating a randomized managed trial
A wider distribution is important however not adequate for the applying that motivated our paper: replicating survey RCTs with artificial variations. RCTs are costly. After knowledge assortment ends, a researcher can’t return to check a idea that emerged later or differ a remedy. Artificial brokers would allow us to do precisely that, if their habits matches what actual respondents produce.
To check this, we replicate a real-world RCT by Coibion, Gorodnichenko, and Weber (2022). Respondents are randomly assigned to considered one of a number of teams: a management group sees no info, a number of remedy teams every obtain a special financial piece of data (the precise previous inflation price, the Fed’s 2% goal, and many others.), and a placebo group is proven content material unrelated to inflation. All respondents first report a previous inflation expectation, then see no matter their group is assigned, after which report a brand new posterior expectation. The distinction between posterior and prior is the respondent’s revision.
A remedy works if its revisions differ visibly from the management group’s, and if the path of the shift matches what financial idea expects: downward revisions from FOMC communication, upward revisions from information of upper gasoline costs. The examine for our artificial brokers is whether or not their revisions separate the identical manner the human respondents did.
We constructed 30,000 artificial personas with Census-derived demographics, and estimated the typical remedy impact on every of the three LLMs, together with our unlearned ones. The primary examine is on the priors themselves: the inflation expectations brokers report earlier than they see any info. Determine 3 plots the imply and normal deviation of those priors throughout demographic subgroups for the human benchmark and the three LLMs. One unlearning mannequin (Llama-GA) comes near the human combination in each degree and dispersion. Whereas one unlearning technique labored (GA), the opposite didn’t (NPO). So unlearning might not be a one-size-fits-all treatment.
Determine 3 Mannequin Estimates of Perceived Inflation

Be aware: Every panel plots by demographic subgroup for the human benchmark (Coibion et al., 2022), the baseline Llama-3, and its two unlearned variants (GA, NPO). The dashed line marks the human “All” worth. Left-hand facet: Llama-3 and Llama-NPO are basically flat throughout demographic traits; Llama-GA tracks the human degree on common however doesn’t reproduce the within-demographic ordering (e.g. predicting the best imply for “school or extra” and “Inc T3,” opposite to the human sample). Proper-hand facet: the unlearned GA mannequin recovers many of the dispersion collapsed by the bottom mannequin.
The subsequent examine is on how the priors get up to date after the data remedy. Within the baseline Llama-3 and Llama-NPO fashions, revisions are basically similar throughout each remedy and the fashions don’t register a remedy impact in any respect. Llama-GA is the one one the place the therapies separate, and inside its largest subgroup of brokers (80% of the pattern) the 4 monetary-policy therapies (previous inflation, Fed goal, FOMC forecast, FOMC assertion) produce adverse and vital revisions of the identical signal and tough magnitude because the human respondents in Coibion et al.
What to take from this
For researchers and practitioners deciding whether or not to make use of LLMs to conduct surveys, the abstract is:
- LLMs are unable to mimic completely different personas. Simulating surveys comes down to 1 agent answering the identical query hundreds of instances, hitting one thing very near the imply each time, generally as much as 4 decimal locations.
- Focused unlearning recovers many of the dispersion and a good share of the remedy results in an RCT with human respondents. Nevertheless, unlearning strategies obtain completely different ranges of success.
- The hole between imply accuracy and distributional accuracy is giant sufficient that any paper utilizing artificial respondents ought to report the second.
Future work ought to deal with distributional accuracy and knowledge leakage as joint constraints relatively than secondary issues. Progress will rely on strategies that account for each what fashions know and the way their outputs are evaluated, with better consideration paid to dispersion, tails, and perception updating relatively than averages alone.
References
Coibion, O., Y. Gorodnichenko, and M. Weber (2022). Financial coverage communications and their results on family inflation expectations. Journal of Political Financial system 130(6), 1537–1584.
Dalloul, A., Pfeifer, M. (2026). Can LLMs Mimic Family Surveys?: From Consultant Brokers to Inhabitants Distributions. SSRN preprint. Link to working paper
Zarifhonarvar, A. (2026). Producing inflation expectations with giant language fashions. Journal of Financial Economics 157, 103859
Replication Information
Dalloul, A., Pfeifer, M. (2026). Replication Information for: “Can LLMs Mimic Family Surveys?: From Consultant Brokers to Inhabitants Distributions”, https://doi.org/10.7910/DVN/CRIRVJ, Harvard Dataverse, V1.

