on a function the place I needed to remodel 100 messy compliance pdfs into structured JSON guidelines.
The brute power strategy was apparent: give the agent the supply textual content, clarify the duty, present examples, and ask it to generate the foundations. Because it was the lowest-hanging fruit, I attempted it first.
At a look, the output appeared tremendous. The output JSON was legitimate and matched what I anticipated.
However as I used to be manually sampling the outcomes to verify for accuracy, the cracks appeared. Some guidelines have been too broad, others have been missed. Some guidelines did not protect the nuances of the unique textual content. I attempted utilizing one other agent to catch and repair the errors however with such an enormous corpus, it was unattainable to confidently confirm the output.
That was the irritating half. The errors weren’t apparent. This was means too fragile of an implementation to scale.
Although I can not share the precise implementation particulars, what I can share are the architectural classes I learnt and the way I finally carried out it. Hopefully, these insights shall be helpful if you happen to’re constructing AI programs that have to scale, keep dependable, and take care of messy knowledge. And when you have higher methods of doing issues, do reach out to chat!
Okay let’s get to it.
The issue
The 100 pdfs I labored with had already been parsed and chunked earlier than they reached me. However the uncooked content material was nonetheless messy. There have been bullet factors, tables, OCR artefacts, translated sections, semi structured headings, footers, headers, inconsistent formatting and doc particular quirks.
I selected to make use of an agent as a result of deciding what mattered required semantic judgement. The paperwork didn’t comply with one constant sample, so relevance couldn’t be decided by way of easy guidelines alone.
You needed to perceive the encompassing context. None of this was troublesome when achieved on a small chunk of knowledge. The problem was performing this reliably at scale.
These guidelines have been then processed by one other downstream system to be evaluated deterministically.
What finally labored
After just a few experiments, I realised the most important enchancment didn’t come from a greater immediate, a brand new device, an MCP server, or a extra refined agent harness.
It got here from altering the form of the issue.
As an alternative of making an attempt to make the agent smarter, I made the agent’s job smaller.
The primary change was to arrange the supply knowledge upfront. As an alternative of asking the agent to question a database, retrieve data, determine whether or not it had the correct inputs, after which carry out the extraction, I gave it a extra managed place to begin.
In my case, that meant briefly storing the related uncooked knowledge domestically.
This may occasionally not at all times be sensible. However the underlying precept is to cut back the quantity of retrieval uncertainty the agent has to deal with. If the agent’s job is to purpose over content material, don’t additionally make it accountable for determining whether or not it has discovered the correct content material.
Another choice could be to arrange the question upfront.
I additionally used a script to strip away pointless metadata and fields earlier than passing the uncooked content material to the agent. Much less irrelevant context meant fewer distractions, fewer probabilities for the agent to latch onto the flawed particulars and a cleaner reasoning job total.
However an important change was the unit of labor.
As an alternative of processing all the things without delay, I did issues iteratively and processed one doc at a time.
That made every job smaller, simpler to examine, simpler to retry, and simpler to audit. I spun up 5 subagents to course of paperwork in parallel, with every agent logging its progress to a file.
If one doc failed, I may retry solely that doc. If one output had formatting points, I may repair that particular case with out rerunning the entire batch. If the pipeline stopped midway, the cached progress meant it may resume from the final profitable checkpoint.
This was additionally the place the separation of tasks turned clearer.
The agent dealt with the semantic work: understanding the content material, figuring out the related components and writing the JSON output.
The encircling code dealt with the mechanical components: parallelising jobs, implementing the schema, producing IDs, writing recordsdata, caching progress, validating references, and checking whether or not the output may very well be traced again to the unique supply.
I additionally had an orchestrator watch over the progress of the script.
Making the output auditable
A helpful design choice was including reference IDs to each generated rule. This meant that every output merchandise pointed again to a selected supply.
This made the output simpler to audit. As an alternative of asking, “Does this generated rule look proper?”, I may ask extra exact questions similar to: does the referenced supply chunk exist? Is the quoted supply textual content truly current in that chunk?
I may additionally get one other agent to selectively run audits on bigger and extra advanced paperwork to make sure that necessary nuances have been preserved.
On high of that, I did a light-weight model of evals. I ran a small batch of uncooked paperwork by way of the workflow and manually reviewed the outcomes for protection and accuracy. A full golden dataset was not sensible for the scope of this job, however I nonetheless wanted a technique to show to myself that the workflow was working.
My purpose was to not construct an ideal benchmark however to make the system auditable sufficient that I may examine the outputs, catch failures, and iterate towards a better accuracy bar.
In the event you’ve acquired concepts on how I may have achieved this higher, let me know!
My greatest takeaway
The sample that labored was to cease treating the LLM as the entire system.
The system turned extra dependable not as a result of the agent turned excellent, however as a result of the workflow made its outputs simpler to hint, validate, and get well from.
Coincidentally, I used to be constructing this shortly earlier than attending the inaugural AI Engineer Singapore convention, held from 15–17 Could 2026.
On the final day, JJ Geewax, Director of Utilized AI at Google DeepMind, shared a framing that captured what I had been studying the arduous means: we have to cease utilizing LLMs like large drawback solvers.
That resonated with me as a result of it’s such a straightforward entice to fall into. It’s straightforward to only give the mannequin the information, schema, enterprise guidelines, edge circumstances, and the accountability to confirm itself. Then get pissed off when the result’s inconsistent.
However for dependable manufacturing programs, the higher sample is often a hybrid. Let the agent deal with the components that require semantic judgement, and let code deal with the components that require construction, validation, and management.
I’ll be sharing extra reflections from AI Engineer Singapore and the workshops I attended. The YouTube snippet of JJ’s speech here.
That’s all from me. I hope this helped, and see you within the subsequent article 🙂

