Can AI write your code? | Towards Data Science

is not whether or not AI can write code, however whether or not we are able to belief the code it writes?

Over the previous few years, ChatGPT and different giant language fashions have turn into more and more widespread within the each day workflow of scholars, analysts, researchers, and knowledge scientists. Many people have already used AI instruments to generate a Python perform, debug an error message, automate a repetitive activity, or shortly translate code from one language to a different.

However there’s a main distinction between asking ChatGPT to put in writing a small helper perform and asking it to implement a fancy econometric methodology.

Can ChatGPT accurately code a Distinction-in-Variations mannequin? Can it implement Inverse Chance Therapy Weighting? Can it reproduce a Regression Discontinuity evaluation? Can it do that not solely in Python, but in addition in R and Stata?

That’s the reason the article “Can AI write your code? A case research of ChatGPT’s statistical coding capabilities for quantitative analysis” by Winberg et al. instantly caught my consideration. The paper was printed on-line on January 22, 2026, in Well being Economics Overview. The authors consider ChatGPT-4.0 Professional’s means to generate code for causal inference duties in Python, R, and Stata, utilizing benchmark options from Causal Inference: The Mixtape by Scott Cunningham.

Most articles I had beforehand learn on this subject targeted on comparatively easy programming duties: small automations, descriptive statistics, knowledge cleansing, fundamental knowledge evaluation, or code era in languages equivalent to Python, R, and SAS. This research goes additional. It asks whether or not ChatGPT can assist quantitative analysis in additional demanding settings, the place the code isn’t just technical but in addition methodological.

The authors give attention to three broadly used causal inference strategies:

Distinction-in-Variations, additionally referred to as Diff-in-Diff;
Inverse Chance Therapy Weighting, or IPTW;
Regression Discontinuity, or RD.

On this article, I’ll stroll by way of the research in a structured means. First, we’ll current what makes this research completely different for quantitative researchers. Second, we’ll evaluation the methodology utilized by the authors. Third, we’ll have a look at how ChatGPT’s efficiency was evaluated. Lastly, we’ll focus on how the Rise of LLMs Has Modified in My Personal Approach of Working

What Makes This Examine Totally different?

Many earlier research have evaluated ChatGPT’s coding means utilizing subjective evaluation. In different phrases, researchers regarded on the generated code and judged whether or not it appeared right.

That method is helpful, nevertheless it has a limitation: it relies upon closely on the evaluator’s judgment.

Winberg et al. take a extra structured method. They evaluate ChatGPT-generated code in opposition to standardized reference code and benchmark outputs from Causal Inference: The Mixtape. This permits them to guage the code not solely primarily based on look, but in addition primarily based on whether or not it reproduces anticipated outcomes.

One other necessary contribution is that the research contains Stata.

This issues as a result of many empirical researchers, particularly in economics, public coverage, and well being economics, nonetheless use Stata extensively. Nonetheless, discussions about AI coding assistants usually focus primarily on Python and R. By together with Stata, the authors consider ChatGPT in a language that’s extremely related for utilized econometric analysis however much less regularly analyzed in AI coding research.

The Methodology Used within the Examine

The authors consider ChatGPT-4.0 Professional, the paid model of ChatGPT out there on the time of the research. Their objective is to measure how nicely it performs when requested to code causal inference analyses in Python, R, and Stata.

They use publicly out there knowledge and downside units from Causal Inference: The Mixtape. This textbook is broadly identified in utilized econometrics and offers examples with code in R, Stata, and Python. In accordance with the research, the reference environments had been R 3.6.0, Stata 18, and Python 3.13.

The authors give attention to three causal inference strategies:

Distinction-in-Variations;
Inverse Chance Therapy Weighting;
Regression Discontinuity.

These strategies had been chosen as a result of they’re generally utilized in empirical analysis and require greater than easy syntax era. They require correct knowledge preparation, mannequin specification, and interpretation of outputs.

The research follows a three-step course of.

Prompting ChatGPT With Econometric Drawback Units

Step one is to offer ChatGPT downside units and ask it to generate code for the related econometric analyses.

For instance, one of many downside units focuses on Distinction-in-Variations. The context is the legalization of abortion in 5 U.S. states earlier than the nationwide legalization following Roe v. Wade in 1973. The duty is to estimate whether or not early abortion legalization affected gonorrhea incidence amongst adolescent females aged 15–19.

As a substitute of utilizing solely a easy post-treatment indicator, the immediate asks ChatGPT to make use of year-by-treatment interactions to seize dynamic therapy results over time.

This sort of immediate is extra advanced than asking for a fundamental regression. It requires the mannequin to know the coverage context, determine the therapy indicator, construction the interplay phrases, and generate applicable code.

The authors outline related downside units for IPTW and RD.

Asking for Full Coding Workflows

Within the second step, the authors present extra complete prompts. These prompts ask ChatGPT to breed fuller coding duties from The Mixtape, together with knowledge administration, econometric evaluation, and determine era.

That is necessary as a result of actual analysis workflows are not often restricted to 1 mannequin command. A researcher often has to import knowledge, clear variables, create indicators, estimate fashions, generate tables, produce plots, and evaluate outcomes.

By testing full workflows, the authors consider whether or not ChatGPT can deal with the sensible complexity of utilized quantitative work.

Working the Code and Evaluating Outputs

Within the third step, the generated code is executed within the corresponding programming atmosphere: Python, R, or Stata.

The authors then evaluate the outputs produced by ChatGPT-generated code with the benchmark outputs from The Mixtape.

How the Prompts Had been Generated

One of the crucial fascinating facets of the research is the best way the prompts had been designed.

The authors recruited 4 researchers with superior experience in econometric strategies. Two held PhDs, and two had been PhD candidates. Three researchers had been assigned to work with one language every: Python, R, or Stata. The fourth researcher replicated the complete course of throughout all three languages to validate the outcomes and assess consistency.

This design is helpful as a result of it displays how researchers may use ChatGPT in apply. Every researcher interacts with the mannequin, generates code, runs it, observes errors, and provides suggestions.

Nonetheless, this additionally creates a threat. If every researcher writes prompts independently, the outcomes could mirror variations in prompting model slightly than variations in ChatGPT’s coding means.

To scale back this bias, the authors standardized the prompts. They collaboratively developed prompts that had been clear, structured, and normal sufficient to use throughout duties. The objective was to offer ChatGPT with sufficient info to resolve the issue with out overfitting the immediate to 1 particular activity.

The standard of the output relies upon closely on the standard of the immediate. If the immediate is obscure, the mannequin could produce generic or incorrect code. If the immediate is just too particular, it might carry out nicely on one activity however fail to generalize.

A very good immediate ought to present context, specify the anticipated methodology, outline the related variables, describe the specified output, and make clear any assumptions.

The 5 Efficiency Indicators

The authors consider ChatGPT’s efficiency utilizing 5 important outcomes: accuracy, effectivity, error output, enhancing, and consistency.

Accuracy is measured by evaluating the outcomes generated by the ChatGPT-written code with the benchmark outputs from The Mixtape.

The analysis is binary: if the end result matches the benchmark, it’s thought-about correct. If it doesn’t, it’s thought-about inaccurate.

Effectivity is measured by evaluating the variety of instructions used within the ChatGPT-generated code with the variety of instructions in the usual reference code.

This isn’t an ideal measure of effectivity, nevertheless it offers a helpful approximation.

The authors doc whether or not the ChatGPT-generated code produces execution errors.

This is likely one of the most sensible indicators. When code fails to run, the consumer should debug it. If the consumer doesn’t perceive the tactic or the programming language, this may turn into a serious downside.

Enhancing refers to circumstances the place the code doesn’t produce an execution error however nonetheless requires clarification, extra context, or guide adjustment to acquire the proper output.

That is notably necessary as a result of not all errors are seen. A code block can run with out crashing however nonetheless produce an incorrect mannequin, a flawed variable transformation, or a deceptive determine.

Consistency is assessed by way of replication. A fourth researcher repeats the duties utilizing the identical prompts throughout Python, R, and Stata, with a brand new ChatGPT account and no prior dialog historical past.

The objective is to find out whether or not ChatGPT produces related logic and construction when completely different customers submit the identical prompts.

This issues as a result of reproducibility is central to analysis. If the identical immediate produces very completely different code throughout periods, researchers have to doc and validate outputs fastidiously.

What Did the Examine Discover?

The general conclusion is balanced. Here’s a table that summarizes the outcomes.

Based mostly on the research, ChatGPT carried out higher in Python and R than in Stata. The authors state that ChatGPT generated correct code and ends in R and Python for many duties, whereas Stata was much less dependable.

This end result will not be solely shocking.

Python and R are broadly utilized in knowledge science, statistics, and machine studying. Additionally they have giant on-line communities, intensive documentation, and plenty of publicly out there code examples. Since giant language fashions be taught from large-scale textual content and code knowledge, it’s cheap to count on them to carry out higher in languages with extra ample public examples.

That stated, this interpretation ought to be handled fastidiously. The research will not be a large-scale benchmark throughout hundreds of duties. It’s a case research primarily based on chosen econometric downside units. Subsequently, we should always not conclude that ChatGPT is universally higher at Python or R than Stata in all contexts.

A extra cautious conclusion is that this:

For the causal inference duties examined on this research, ChatGPT appeared extra dependable in Python and R than in Stata.

What the Rise of LLMs Has Modified in My Personal Approach of Working

What makes this research notably fascinating to me is that it doesn’t handle solely a theoretical query. It straight connects with what I observe in my very own work, each at house and in an expert setting. We used ChatGPT Professional 4.0 prior to now, and at the moment we use ChatGPT Professional 5.5. On this part, I need to clarify how the adoption of those fashions has modified the best way I work.

Previously, after I needed to conduct a quantitative research or develop a statistical methodology, a big a part of the work was spent on literature evaluation. I needed to determine the best scientific papers, perceive the strategies used, evaluate completely different approaches, after which resolve easy methods to apply them to our personal knowledge.

Right now, with ChatGPT, this exploratory part is way quicker. It doesn’t change the essential studying of scientific papers, nevertheless it helps construction the preliminary analysis, determine key ideas extra shortly, and formulate methodological questions extra clearly.

The change has been much more seen within the office, particularly in the best way we use programming languages.

Beforehand, we primarily used SAS for knowledge extraction, preparation, and processing. SAS stays a really environment friendly device for dealing with giant volumes of knowledge in an expert atmosphere. Nonetheless, for statistical modeling, we regularly relied on R, which was extra handy for estimation, visualization, and methodological experimentation.

With the rise of LLMs, we regularly determined to maneuver a big a part of our work to Python. This determination was not solely pushed by the truth that Python is easy and broadly used. It additionally got here from a really sensible commentary: in our expertise, instruments like ChatGPT typically present higher solutions in Python, with fewer errors and extra reusable examples.

We didn’t conduct a scientific research as structured because the one by Winberg et al., however we reached this conclusion by way of the suggestions of the modelers in our crew and as a part of a long-term strategic alternative. In apply, AI has influenced not solely the best way we write code but in addition the infrastructure we use. We moved from an atmosphere centered on SAS Studio and RStudio to a workflow extra oriented towards VS Code, as a result of it integrates extra simply with instruments equivalent to ChatGPT, Claude, and GitHub Copilot.

This shift could look technical, however it’s really fairly deep. AI not solely improves productiveness. It additionally influences the languages we select, the instruments we use, and the best way we manage our workflows.

One other concrete instance is the gathering of exterior knowledge. In our work, we generally want publicly out there datasets: INSEE knowledge, local weather knowledge, IPCC knowledge, NGFS eventualities for local weather stress testing, or different datasets utilized in ESG threat modeling.

Previously, this kind of activity might take a number of days, generally even a number of weeks. We needed to discover the best supply, perceive the construction of the information, obtain the information, clear it, reformat it, and make it usable for our fashions. Right now, with LLMs, this course of may be considerably accelerated.

Lately, for instance, I needed to retrieve NAF codes from the INSEE web site, along with their labels, in a format that may very well be used straight. Previously, this activity would in all probability have taken me a number of hours. With a couple of well-structured prompts, I shortly obtained a script that retrieved the information, cleaned the codes, eliminated the dots, and produced an Excel file prepared to make use of. This isn’t solely a time achieve. It additionally reduces the friction between an thought and its execution.

In my opinion, this is likely one of the most necessary contributions of LLMs for statisticians and quantitative analysts. They’re very helpful for knowledge processing, statistical modeling, mathematical programming, reporting, and formatting outcomes.

They’ve additionally turn into invaluable for producing deliverables: structuring paperwork, bettering explanations, formatting tables, describing figures, and decoding outcomes. Earlier variations of ChatGPT nonetheless made many errors in these duties, particularly in technical reasoning and references. Latest fashions are significantly better, though they nonetheless require cautious validation.

In my work, I see them extra as very quick analysis assistants than as autonomous specialists. They’ll do in a couple of hours what we would beforehand have assigned to a analysis assistant for a number of days: discover a way, suggest code, generate a primary model of a chart, rewrite an interpretation, or automate a part of a report.

However this velocity comes with one situation: human supervision and validation stay important.

The danger of hallucination will not be theoretical. A latest instance made this very clear: in line with the Monetary Instances, EY Canada withdrew a study used to advertise its cybersecurity providers after it was discovered to include fabricated knowledge, misattributed citations, and even a reference to a McKinsey report that didn’t exist.

That is precisely why I discover the research by Winberg et al. fascinating. It doesn’t merely ask whether or not ChatGPT can write code. It factors to a extra necessary query: below what situations can we belief AI-generated code?

For me, the reply is obvious. We are able to use LLMs to work quicker, however to not take away the duty of the researcher. The researcher nonetheless must test the assumptions, validate the information, check the code, evaluate the outcomes with benchmarks, and ensure the interpretation is right.

In different phrases, AI is deeply altering the best way we work, nevertheless it doesn’t take away the necessity for experience. The truth is, it makes experience much more necessary. The extra highly effective the device turns into, the extra crucial it’s to know when to belief it and when to not.

Lastly, the adoption of AI instruments will proceed to rework the best way we work. Some processes will turn into extra environment friendly, others will disappear, and extra refined workflows will emerge. To stay aggressive, we have to continue to learn, maintain working, and be able to combine these instruments into our skilled lives.

On the identical time, AI will even change the best way information is produced and shared. As a result of these instruments enhance productiveness, an article that when required a month of labor can now generally be accomplished in every week. This can be a good factor in some ways: it lowers the barrier to writing, helps extra individuals share concepts, and accelerates the circulation of data.

Nevertheless it additionally creates a brand new problem. If everybody can produce extra content material quicker, the web will turn into much more crowded. The attain of every article might not be the identical as earlier than. Some writers could really feel discouraged, particularly if their work receives much less visibility regardless of the trouble behind it.

In my opinion, this may create a brand new type of inequality between those that know easy methods to use AI successfully and people who don’t, but in addition between those that write solely to provide content material and people who write as a result of they honestly care concerning the topic.

In the long term, I imagine the individuals who stay shall be those that are genuinely passionate, those that need to be taught, suppose deeply, and share information with others. AI could make writing quicker, nevertheless it won’t change curiosity, self-discipline, and the will to contribute one thing significant.

References

Winberg, D., Tsai, E., Tang, T., Xuan, D., Marchi, N., & Shi, L. (2026). Can AI write your code? A case research of chatgpt’s statistical coding capabilities for quantitative analysis. Well being Economics Overview.

Source link

Can AI write your code? | Towards Data Science

Loop Engineering for RAG Question Parsing: The Small Loop That Runs Before Retrieval

How to Find the Optimal Coding Agent Interface

I Completed Five Years in Analytics Consulting: 5 Lessons That Changed How I Work

GPU-Resident Top-K for Agentic RAG: I Built a CUDA Kernel So My Retrieval Step Would Stop Bouncing Off the GPU

Can Machine Learning Predict the World Cup?

Automate Writing Your LLM Prompts

These Were My Favorite Things Samsung Unpacked During Its 2026 Galaxy Event

AI minister role boosted but tech department axed in Burnham shake-up

Loop Engineering for RAG Question Parsing: The Small Loop That Runs Before Retrieval

The risk of weather data sabotage is rising

Featured Picks

Reality bites: Mark Zuckerberg’s $110 billion Metaverse failure is a reminder that product-market fit still matters

the WH is pressuring Utah Republican State Rep. Doug Fiefia to abandon HB 286, an AI transparency and kids’ safety bill similar to California’s AI law (Maria Curi/Axios)

LG S95AR Review: A Hassle-Free Dolby Atmos Soundbar

Can AI write your code? | Towards Data Science

What Makes This Examine Totally different?

The Methodology Used within the Examine

Prompting ChatGPT With Econometric Drawback Units

Asking for Full Coding Workflows

Working the Code and Evaluating Outputs

How the Prompts Had been Generated

The 5 Efficiency Indicators

What Did the Examine Discover?

What the Rise of LLMs Has Modified in My Personal Approach of Working

References

Related Posts