Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • Optimizing AI Agent Planning with Operations Research and Data Science
    • Mercedes-AMG electric GT 4-Door performance EV revealed
    • AI coworker startup Viktor raises €64.7 million Series A after hitting €12.9 million revenue run rate within 10 weeks of launch
    • The 10 Best TV Shows to Stream This Month (May 2026)
    • CFTC fights Minnesota prediction markets felony restrictions
    • Plex Is Raising Its Lifetime Subscription Price Again, to a Whopping $750
    • Can LLMs Replace Survey Respondents?
    • New copper plates slash data center energy use
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Wednesday, May 20
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Artificial Intelligence»Can LLMs Replace Survey Respondents?
    Artificial Intelligence

    Can LLMs Replace Survey Respondents?

    Editor Times FeaturedBy Editor Times FeaturedMay 20, 2026No Comments9 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    you ask an LLM to simulate 6,000 American households answering questions on inflation? Latest papers discover that giant language fashions can replicate the typical responses of main family surveys to inside a proportion level (Zarifhonarvar, 2026). In 2020, the Survey of Client Expectations (SCE) reported a one-year-ahead median inflation price of about 3%. The median produced by a prompted LLM with life like personas and a knowledge-cutoff instruction: additionally about 3%. Shut sufficient that LLMs have been pitched as a low-cost, high-frequency complement to the SCE, Michigan, and Survey of Skilled Forecasters surveys.

    In a current paper, Can LLMs Mimic Household Surveys?, co-authored with Ami Dalloul from the College of Duisburg-Essen, we have a look at the second second, the a part of a chance distribution that tells you whether or not the mannequin represents one opinion or a thousand. It’s right here that the obvious success of LLM-based surveys disappears. The identical Llama-3 mannequin that hits the SCE median to inside a proportion level locations 95% of its simulated respondents inside a two-percentage-point window. The actual 2020 SCE responses vary from roughly minus 25 to plus 27 p.c. Briefly, the typical is correct, however the inhabitants behind it doesn’t exist. So working a simulation with a number of thousand LLM personas boils down to 1 consultant agent.

    Determine 1: Dispersion of Actual-World and Artificial Survey Populations

    Be aware: The left panel plots the dispersion of particular person 2020 SCE respondents round their imply. Diffuse radiation displays heterogeneous beliefs throughout respondents. The center panel applies the identical development to artificial responses from a Llama-3.1-8B-Instruct mannequin prompted with personas matching the SCE demographic distribution. The scatter collapses to a near-point. The mannequin recovers the imply and discards all the pieces else. The best panel makes use of the identical Llama mannequin unlearned with gradient ascent (GA). The unlearned mannequin achieves a extra life like dispersion and doesn’t collapse across the mode.

    Mode collapse

    We benchmarked 5 LLMs (Llama-3-8B, Llama-3-70B, Claude-3.7-Sonnet, DeepSeek-V3, GPT-4o) towards the SCE, Michigan Survey, and Survey of Skilled Forecasters. Within the human surveys, 44 to 70% of respondents give solutions greater than 3 proportion factors away from the modal reply; within the LLM samples, that share is basically zero.

    The usual cures from the survey-simulation literature don’t enhance this drawback. Census-derived personas with advanced and ranging traits, zero-shot knowledge-cutoff directions (“you have no idea occasions after June 2018”), and specific “don’t search for statistics” prompts all default to the identical slim distribution. The doubtless trigger is that the LLMs see CPI tables, information protection of FRBNY survey releases, and educational replications of their coaching corpora. Requested for the median 2020 inflation expectation, the mannequin is doing retrieval towards memorized knowledge. The load of that coaching knowledge overpowers regardless of the immediate directions ask it to do.

    Unlearning the LLMs

    If memorized statistics are the issue, a possible repair is to take away them from the weights relatively than ask the mannequin to look away. We utilized two unlearning strategies to Llama-3.1-8B-Instruct, an open-source mannequin that permits us to change its weights:

    • Gradient Ascent (GA) maximizes prediction loss on a overlook set of CPI collection and survey aggregates, with a retain loss on micro-survey reasoning so normal functionality survives.
    • Unfavourable Choice Optimization (NPO) treats the overlook set as dispreferred completions and minimizes a bounded choice loss towards a reference mannequin.

    The info we ask the mannequin to overlook is the official inflation document itself: month-to-month CPI collection and revealed imply inflation expectations from the FRBNY SCE and Michigan surveys. The unlearning impact on the response distribution is in Desk 1.

    Desk 1 Tail Accuracy with Completely different Unlearning Methods

    Be aware: Unlearning methods to mitigate mode collapse. Gradient ascent (GA) is a focused unlearning technique the place the mannequin is fine-tuned to maximise loss on a dataset of official CPI statistics whereas minimizing loss, or retaining (RT), on a dataset of micro-survey knowledge. Unfavourable choice optimization (NPO) treats official statistics as adverse samples to penalize their technology whereas treating retaining (RT) samples as optimistic. Artificial survey replies of inflation expectations as proportion deviations from the mode and imply (in brackets) inside bins of tangible matches, ± 1, and > 3 % deviations. Tail Acc. measures closeness to the FRBNY tail dispersion benchmark (> ± 3.0 = 44.38).

    The baseline Llama-3 (which incorporates prompt-based unlearning) produces an actual mode match on 92% of replies and 0 replies greater than 3pp away. Tail accuracy towards the SCE benchmark of 44% is subsequently zero. After GA, precise matches drop to 24%, and 43% of replies transfer past ±3pp; tail accuracy reaches 97%. NPO is comparable at 37% and 43%, with 98% tail accuracy. In different phrases, each unlearning strategies seem to recuperate a extra life like distribution.

    Determine 2 Dispersion of LLMs vs. Unlearning Fashions

    Be aware: The left-hand facet plots kernel density estimates of 2020 inflation expectations from the FRBNY SCE and two Llama-3 variants educated with unlearning strategies, gradient ascent (GA) and adverse choice optimization (NPO). Each unlearning variants cowl the vary the place FRBNY SCE locations chance mass, although they nonetheless stay extra concentrated than the human benchmark and barely skewed to greater means. The best-hand facet compares the KDEs of prompted LLM-generated expectations (GPT-4o, Llama-3, and many others.) to FRBNY SCE in 2020. The LLM curves (left axis) are tightly clustered round a slim area, whereas the FRBNY SCE curve stays a lot broader. The LLMs can match central tendency but fail to breed the cross-sectional unfold of survey micro-data. Bandwidth = 0.5 for all KDEs.

    The kernel densities (Determine 2) present that off-the-shelf fashions pile chance mass into a skinny spike close to the imply. The unlearned variants unfold mass throughout the vary the place the human respondents of the SCE put it.

    Simulating a randomized managed trial

    A wider distribution is important however not adequate for the applying that motivated our paper: replicating survey RCTs with artificial variations. RCTs are costly. After knowledge assortment ends, a researcher can’t return to check a idea that emerged later or differ a remedy. Artificial brokers would allow us to do precisely that, if their habits matches what actual respondents produce.

    To check this, we replicate a real-world RCT by Coibion, Gorodnichenko, and Weber (2022). Respondents are randomly assigned to considered one of a number of teams: a management group sees no info, a number of remedy teams every obtain a special financial piece of data (the precise previous inflation price, the Fed’s 2% goal, and many others.), and a placebo group is proven content material unrelated to inflation. All respondents first report a previous inflation expectation, then see no matter their group is assigned, after which report a brand new posterior expectation. The distinction between posterior and prior is the respondent’s revision.

    A remedy works if its revisions differ visibly from the management group’s, and if the path of the shift matches what financial idea expects: downward revisions from FOMC communication, upward revisions from information of upper gasoline costs. The examine for our artificial brokers is whether or not their revisions separate the identical manner the human respondents did.

    We constructed 30,000 artificial personas with Census-derived demographics, and estimated the typical remedy impact on every of the three LLMs, together with our unlearned ones. The primary examine is on the priors themselves: the inflation expectations brokers report earlier than they see any info. Determine 3 plots the imply and normal deviation of those priors throughout demographic subgroups for the human benchmark and the three LLMs. One unlearning mannequin (Llama-GA) comes near the human combination in each degree and dispersion. Whereas one unlearning technique labored (GA), the opposite didn’t (NPO). So unlearning might not be a one-size-fits-all treatment.

    Determine 3 Mannequin Estimates of Perceived Inflation

    Be aware: Every panel plots by demographic subgroup for the human benchmark (Coibion et al., 2022), the baseline Llama-3, and its two unlearned variants (GA, NPO). The dashed line marks the human “All” worth. Left-hand facet: Llama-3 and Llama-NPO are basically flat throughout demographic traits; Llama-GA tracks the human degree on common however doesn’t reproduce the within-demographic ordering (e.g. predicting the best imply for “school or extra” and “Inc T3,” opposite to the human sample). Proper-hand facet: the unlearned GA mannequin recovers many of the dispersion collapsed by the bottom mannequin.

    The subsequent examine is on how the priors get up to date after the data remedy. Within the baseline Llama-3 and Llama-NPO fashions, revisions are basically similar throughout each remedy and the fashions don’t register a remedy impact in any respect. Llama-GA is the one one the place the therapies separate, and inside its largest subgroup of brokers (80% of the pattern) the 4 monetary-policy therapies (previous inflation, Fed goal, FOMC forecast, FOMC assertion) produce adverse and vital revisions of the identical signal and tough magnitude because the human respondents in Coibion et al.

    What to take from this

    For researchers and practitioners deciding whether or not to make use of LLMs to conduct surveys, the abstract is:

    • LLMs are unable to mimic completely different personas. Simulating surveys comes down to 1 agent answering the identical query hundreds of instances, hitting one thing very near the imply each time, generally as much as 4 decimal locations.
    • Focused unlearning recovers many of the dispersion and a good share of the remedy results in an RCT with human respondents. Nevertheless, unlearning strategies obtain completely different ranges of success.
    • The hole between imply accuracy and distributional accuracy is giant sufficient that any paper utilizing artificial respondents ought to report the second.

    Future work ought to deal with distributional accuracy and knowledge leakage as joint constraints relatively than secondary issues. Progress will rely on strategies that account for each what fashions know and the way their outputs are evaluated, with better consideration paid to dispersion, tails, and perception updating relatively than averages alone.

    References

    Coibion, O., Y. Gorodnichenko, and M. Weber (2022). Financial coverage communications and their results on family inflation expectations. Journal of Political Financial system 130(6), 1537–1584.

    Dalloul, A., Pfeifer, M. (2026). Can LLMs Mimic Family Surveys?: From Consultant Brokers to Inhabitants Distributions. SSRN preprint. Link to working paper

    Zarifhonarvar, A. (2026). Producing inflation expectations with giant language fashions. Journal of Financial Economics 157, 103859

    Replication Information

    Dalloul, A., Pfeifer, M. (2026). Replication Information for: “Can LLMs Mimic Family Surveys?: From Consultant Brokers to Inhabitants Distributions”, https://doi.org/10.7910/DVN/CRIRVJ, Harvard Dataverse, V1.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    Optimizing AI Agent Planning with Operations Research and Data Science

    May 20, 2026

    From Possible to Probable AI Models

    May 20, 2026

    How to Safely Run Coding Agents

    May 20, 2026

    Introduction to Lean for Programmers

    May 19, 2026

    Deploying a Multistage Multimodal Recommender System on Amazon Elastic Kubernetes Service

    May 19, 2026

    Grounding LLMs with Fresh Web Data to Reduce Hallucinations

    May 19, 2026
    Leave A Reply Cancel Reply

    Editors Picks

    Optimizing AI Agent Planning with Operations Research and Data Science

    May 20, 2026

    Mercedes-AMG electric GT 4-Door performance EV revealed

    May 20, 2026

    AI coworker startup Viktor raises €64.7 million Series A after hitting €12.9 million revenue run rate within 10 weeks of launch

    May 20, 2026

    The 10 Best TV Shows to Stream This Month (May 2026)

    May 20, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    Tesla to Pay $243M After Jury Finds It Partly Liable for Fatal Autopilot Crash

    August 2, 2025

    Best Smart Home Gyms for 2026

    April 8, 2026

    Best Internet Providers in Hayward, California

    April 19, 2025
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.