ChatGPT a typical month-long internship drawback within the knowledge area. The issue in some sense bought “solved” however I’m unsure it means what I assumed it could. For knowledge and AI practitioners, that is now a really sensible query. Many groups use interns or analysis spikes to discover concepts: is AI ok now? Are these initiatives solely in regards to the closing artifact?
Interns as explorers
Constructing a tech roadmap at an early-stage knowledge startup just isn’t that completely different from a typical online game map:
The roadmap is way larger not solely than what you are able to do, but in addition what you may see. If solely we might peek over “the (product) horizon” by sending an explorer to clear the map, then we’d achieve some consciousness of what waits for you when you get there (the explorer could die, so the analogy is nice up to a degree).
Bauplan (which I co-founded in 2024) has made the bizarre (for its measurement) selection of operating summer time internships from high establishments (Columbia College, CMU, College of Wisconsin–Madison) to peek over the horizon. It has labored very nicely to date. Apart from a greater hiring funnel, group standing, and a few social clout, explorations made their way into our product, and can be strategic property as the corporate grows.
As I ship out internship provides for the summer time of 2026, half of my X feed is telling me that I’m going about all of it mistaken. Removed from being an hypothetical drawback, at completely different levels, sizes and constraints, all knowledge and AI groups in the present day are going through the identical query: is there now a greater method to do analysis spikes with brokers? If sure, what is an efficient, examined AI setup that’s simple to adapt?
Within the hope our expertise and perspective can be worthwhile for a lot of knowledge practitioners, that is our setup and the teachings we discovered from an actual analysis spike carried out by pairing with ChatGPT.
First, they got here for the consultants, and I didn’t communicate out…
At a time when AI threatens data staff, the junior positions appear to be those falling first. Why ought to McKinsey rent an Ivy League analyst when a $200 subscription produces extra studies, quicker? Recently, my feed appears to point that AI could also be coming after researchers too, with teachers attempting to automate themselves – “fully autonomous research from concept to paper” – and professors debating whether or not to hire assistants anymore.
There are apparent arguments to withstand the development. We will assault the consequence, and argue that the tech remains to be buggy so the promised “Ivy League” parity is simply not there. We will argue that the social contract is being damaged: certain, younger researchers have all the time been (in some sense) a “burden”, however that burden was each a method to pay it ahead and an funding within the subsequent era. We will additionally spotlight the potential long-term harm of changing a well-understood thought course of with a brand new, untested workflow.
Whereas there may be weight behind all of those arguments, parallel arguments might be superficially construed for the invention of vehicles or what-have-you. There’s all the time a time and place for these debates, however my curiosity in the present day is far more localized and private: what wouldn’t it really feel like if I have been to ditch my interns for a $200 subscription?
So (not not like this physics experiment I lately found) I attempted to squeeze the month of an intern right into a weekend with ChatGPT.
Whereas the precise drawback just isn’t terribly essential, scoping the internship could also be helpful to get a sense of the kind of issues interns do at Bauplan (be at liberty to skip!). Bauplan is a branching knowledge platform: brokers and people can open Git-like branches on their tables. Because of this, the identical desk could have completely different variations in several branches. In our motivating instance, Acme Inc. is a web-based retailer wherein a swarm of knowledge brokers is tasked with operating completely different predictions on tomorrow’s gross sales:

Ideally, a human would confirm the work, evaluate and distinction the findings, after which merge the predictions desk as the canonical knowledge illustration. However what if any individual asks a query earlier than this occurs?
Current programs would simply refuse to reply, even when this feels intuitively wasteful: two brokers computing month-to-month income could disagree on the precise determine, but each agree that income grew >10% quarter-over-quarter. In different phrases, even when there is no such thing as a system-wide agreed-upon model of knowledge, we might nonetheless reply many fascinating questions.
Our internship aim is then to construct a prototype of such a system. It requires studying about branching, selecting up new math, designing an answer on high of Bauplan, and constructing each a text-to-SQL module (easy-ish) and a customized question path (hard-ish).
The AI setup
Bauplaners lately had the privilege of seeing the one and solely Wes McKinney giving a reside demo of his setup, so I made a decision to undertake it (with some minor tweaks):
- ChatGPT 5.2 to plan and determine on methods (i.e. the way to design a benchmark highlighting the distinction between engineering approaches);
- Claude Code inside Visible Studio to do the precise improvement loop;
- Roborev to domestically evaluation commits adversarially. Powered by Codex, the critiques spotlight potential points and counsel enhancements;
- Roborev critiques to tame challenge complexity each 10 commits or so: these critiques take an architectural perspective and assist with slicing the bloat.
The true treasure was the buddies we made alongside the best way
Since I can not bear the concept of getting AI write for me (as a result of, to be sincere, I additionally can not bear the concept of interns doing it), I did the ultimate writing myself. As internships usually finish by sharing outcomes with the group, I ended up having sufficient issues for an ACM SAO paper, “Querying Everything Everywhere All at Once”.
By some metrics, the X crowd was proper: even granting a high quality mismatch, I babysat the AI for 48 hours to do, say, 80% of what would have taken weeks. Apparently, babysitting is of a unique nature: the AI is so desirous to please that it typically finally ends up “dishonest” to realize surface-level outcomes by hardcoded shortcuts. Whereas many knowledge and AI issues are superficially easy-to-verify, in our expertise they’re additionally easy-to-game: that is very true every time the interpretation of the experimental setup is nuanced, or the ultimate metric just isn’t easy: you need to triple verify in case your AI brokers is hill-climbing or simply pretending.
Alternatively, the AI doesn’t must be taught about Tarski’s fashions or fact gluts, as attaching just a few papers is sufficient to hit the bottom operating. The outcomes have been additionally “tangible”: I’ve a handsome internet app with out having to choose up D3.js once more (10 years after my final time!), and a demo script simulating agentic pipelines and enterprise questions over branches. In the event you imagine (as I do) that prototypes generally beat PowerPoint (or papers), there is no such thing as a doubt the AI stack delivered one thing.
What’s tougher to place into phrases is what was not delivered, or, to place it extra exactly, what I misplaced within the course of. For all the joy in regards to the unbelievable chart and the stunning benchmark, none of it actually produced extra understanding. I’m not wiser for having gone by the analysis movement: I’ve a bit extra instinct than earlier than (e.g. the way to higher immediate for good SQL translations), however my psychological fashions have largely the identical decision as after I began. Working with interns could also be time-consuming and typically even irritating, but it surely all the time produces higher ideas, in them and myself: by explaining and mentoring them, additionally they clarify and mentor me again in some sense.
If I now get outcomes with out studying a lot, I really feel uneasy largely as a result of it isn’t clear if that ought to matter. I don’t imply if it issues on a world, big-brain scale: in fact if our kids don’t be taught anymore and our scientists offload pondering to a chat, that’s dangerous. I’m now simply modestly targeted on this: does it matter for me, for my firm, for my buyers?
The native, private reply – until you’ve got a really inflated sense of your self – is much less clear-cut. I understand how to code and I might most likely nonetheless educate some mathematical logic, so in a way none of this challenge is breaking new floor anyway: maybe, there may be not that a lot to be taught right here (except for the feasibility of all of it, which in fact I suspected within the first place), and the uneasiness I really feel is the legacy of a previous mindset. Or, maybe, there is no such thing as a process too humble to change into a barely higher model of myself: doing the nitty-gritty work of connecting our APIs to a chart, failing to compile DataFusion 13 occasions, going forwards and backwards on the way to choose queries for a convincing benchmark the place no different system can categorical – not to mention compute – our question path. I really feel uneasy as a result of real-world initiatives for real-world, not-too-ego-inflated folks have a really giant floor of points which aren’t clearly first-principles pondering, nor apparent implementation particulars.
I’ve no drawback in the present day (tomorrow, we’ll see…) with the simplistic view that people ought to do the pondering and LLMs ought to repair matplotlib syntax. However I battle with the big gray space in between, and the internal voice whispering that by treating all the things as an implementation element, my ideas quickly received’t be sharp anymore. Are we turning into like these VCs who “sample match” and lose all of the nuances? Is the purpose of a proof proving a theorem (nonetheless alien-looking the proof may be), or giving us novel understanding?
The long run can wait (a bit)
Observing my choices (and never my emotions) for the summer time of 2026 does certainly reveal the results of this experiment. Bauplan has employed two (human) interns, two younger, proficient, motivated pc scientists in command of exploring the sting of our product map with regard to end-to-end AI optimization (expertise evolution with GEPA) and scaling git-for-data. From a sensible perspective, I made the identical resolution I’d have made earlier than this challenge. Nevertheless, I don’t imagine I bought out unchanged and unscathed from it: my emotions will sooner or later crystallize in new ideas after which affect my choices.
On the one hand, as an enormous fan of the Little Prince, it isn’t misplaced on me that it was the time wasted on that rose that made it essential: spending time with my interns this summer time will (I imagine) make them and our challenge collectively extra essential. On the opposite, this solely partially captures my vibe today. I needed to dig into the Web Archive to get well one thing I lately remembered from 2006 (mathematical logic just isn’t the one factor I keep in mind from my 20s, apparently). That is the #1 entry in Blender’s “50 Worst Things to Happen to Music”:
#01. KIDS TODAY
Again in our day, we didn’t have any of yer fancy iPods and ringtones and downloads. We didn’t have the luxurious and comfort of your scrotum-rings and your World Broad Net logs. After we wished to steal the brand new URIAH HEEP album, we couldn’t simply troll the Internets for it, we needed to do it the old style method—by mountaineering to the shop (uphill, each methods) and shoving 12″ of vinyl below our sweaters (which we needed to knit ourselves). That’s why you sniveling whipper-snappers don’t respect the true worth of music. Or Uriah Heep. Now get the hell off our garden!
Will we nonetheless respect the “actual worth of issues” if we are able to now “steal them” from the consolation of our laptops?
See you, agentic cowboys
Because of Luca, Colin, Ethan for his or her feedback on a earlier draft of this text.
If you wish to be a Bauplan intern and do cool data-and-AI stuff (like this or this or this), I nonetheless settle for human candidates: get in contact!

