As information , we’re snug with tabular information…
We will additionally deal with phrases, json, xml feeds, and footage of cats. However what a couple of cardboard field stuffed with issues like this?

The information on this receipt desires so badly to be in a tabular database someplace. Wouldn’t or not it’s nice if we may scan all these, run them by means of an LLM, and save the ends in a desk?
Fortunate for us, we stay within the period of Document Ai. Doc AI combines OCR with LLMs and permits us to construct a bridge between the paper world and the digital database world.
All the main cloud distributors have some model of this…
Right here I’ll share my ideas on Snowflake’s Doc AI. Except for utilizing Snowflake at work, I’ve no affiliation with Snowflake. They didn’t fee me to write down this piece and I’m not a part of any ambassador program. All of that’s to say I can write an unbiased overview of Snowflake’s Document AI.
What’s Doc AI?
Doc AI permits customers to shortly extract data from digital paperwork. Once we say “paperwork” we imply footage with phrases. Don’t confuse this with niche NoSQL things.
The product combines OCR and LLM fashions so {that a} person can create a set of prompts and execute these prompts towards a big assortment of paperwork abruptly.

LLMs and OCR each have room for error. Snowflake solved this by (1) banging their heads towards OCR till it’s sharp — I see you, Snowflake developer — and (2) letting me fine-tune my LLM.
Wonderful-tuning the Snowflake LLM feels much more like glamping than some rugged out of doors journey. I overview 20+ paperwork, hit the “practice mannequin” button, then rinse and repeat till efficiency is passable. Am I even an information scientist anymore?
As soon as the mannequin is educated, I can run my prompts on 1000 paperwork at a time. I like to avoid wasting the outcomes to a desk however you could possibly do no matter you need with the outcomes actual time.
Why does it matter?
This product is cool for a number of causes.
- You’ll be able to construct a bridge between the paper and digital world. I by no means thought the large field of paper invoices beneath my desk would make it into my cloud information warehouse, however now it may possibly. Scan the paper bill, add it to snowflake, run my Doc AI mannequin, and wham! I’ve my desired data parsed right into a tidy desk.
- It’s frighteningly handy to invoke a machine-learning mannequin by way of SQL. Why didn’t we consider this sooner? In a previous instances this was just a few hundred of traces of code to load the uncooked information (SQL >> python/spark/and so on.), clear it, engineer options, practice/take a look at break up, practice a mannequin, make predictions, after which typically write the predictions again into SQL.
- To construct this in-house can be a significant endeavor. Sure, OCR has been round a very long time however can nonetheless be finicky. Wonderful-tuning an LLM clearly hasn’t been round too lengthy, however is getting simpler by the week. To piece these collectively in a method that achieves excessive accuracy for quite a lot of paperwork may take a very long time to hack by yourself. Months of months of polish.
After all some parts are nonetheless inbuilt home. As soon as I extract data from the doc I’ve to determine what to do with that data. That’s comparatively fast work, although.
Our Use Case — Carry on Flu Season:
I work at an organization known as IntelyCare. We function within the Healthcare staffing house, which implies we assist hospitals, nursing houses, and rehab facilities discover high quality clinicians for particular person shifts, prolonged contracts, or full-time/part-time engagements.
Lots of our amenities require clinicians to have an up-to-date flu shot. Final 12 months, our clinicians submitted over 10,000 flu photographs along with tons of of 1000’s of different paperwork. We manually reviewed all of those manually to make sure validity. A part of the enjoyment of working within the healthcare staffing world!
Spoiler Alert: Utilizing Doc AI, we have been capable of cut back the variety of flu-shot paperwork needing handbook overview by ~50% and all in simply a few weeks.
To drag this off, we did the next:
- Uploaded a pile of flu-shot paperwork to Snowflake.
- Massaged the prompts, educated the mannequin, massaged the prompts some extra, retrained the mannequin some extra…
- Constructed out the logic to match the mannequin output towards the clinician’s profile (e.g. do the names match?). Undoubtedly some trial and error right here with formatting names, dates, and so on.
- Constructed out the “determination logic” to both approve the doc or ship it again to the people.
- Examined the total pipeline on greater pile of manually reviewed paperwork. Took a detailed have a look at any false positives.
- Repeated till our confusion matrix was passable.
For this undertaking, false positives pose a enterprise threat. We don’t need to approve a doc that’s expired or lacking key data. We saved iterating till the false-positive charge hit zero. We’ll have some false positives finally, however fewer than what we’ve got now with a human overview course of.
False negatives, nevertheless, are innocent. If our pipeline doesn’t like a flu shot, it merely routes the doc to the human staff for overview. In the event that they go on to approve the doc, it’s enterprise as common.
The mannequin does effectively with the clear/simple paperwork, which account for ~50% of all flu photographs. If it’s messy or complicated, it goes again to the people as earlier than.
Issues we realized alongside the best way
- The mannequin does finest at studying the doc, not making selections or doing math primarily based on the doc.
Initially, our prompts tried to find out validity of the doc.
Unhealthy: Is the doc already expired?
We discovered it far simpler to restrict our prompts to questions that may very well be answered by wanting on the doc. The LLM doesn’t decide something. It simply grabs the related information factors off the web page.
Good: What’s the expiration date?
Save the outcomes and do the maths downstream.
- You continue to have to be considerate about coaching information
We had just a few duplicate flu photographs from one clinician in our coaching information. Name this clinician Ben. One in every of our prompts was, “what’s the affected person’s identify?” As a result of “Ben” was within the coaching information a number of instances, any remotely unclear doc would return with “Ben” because the affected person identify.
So overfitting remains to be a factor. Over/beneath sampling remains to be a factor. We tried once more with a extra considerate assortment of coaching paperwork and issues did a lot better.
Doc AI is fairly magical, however not that magical. Fundamentals nonetheless matter.
- The mannequin may very well be fooled by writing on a serviette.
To my information, Snowflake doesn’t have a option to render the doc picture as an embedding. You’ll be able to create an embedding from the extracted textual content, however that gained’t let you know if the textual content was written by hand or not. So long as the textual content is legitimate, the mannequin and downstream logic will give it a inexperienced mild.
You might repair this beautiful simply by evaluating picture embeddings of submitted paperwork to the embeddings of accepted paperwork. Any doc with an embedding method out in left subject is distributed again for human overview. That is simple work, however you’ll need to do it exterior Snowflake for now.
- Not as costly as I used to be anticipating
Snowflake has a fame of being spendy. And for HIPAA compliance considerations we run a higher-tier Snowflake account for this undertaking. I have a tendency to fret about working up a Snowflake tab.
In the long run we needed to attempt additional onerous to spend greater than $100/week whereas coaching the mannequin. We ran 1000’s of paperwork by means of the mannequin each few days to measure its accuracy whereas iterating on the mannequin, however by no means managed to interrupt the price range.
Higher nonetheless, we’re saving cash on the handbook overview course of. The prices for AI reviewing 1000 paperwork (approves ~500 paperwork) is ~20% of the associated fee we spend on people reviewing the remaining 500. All in, a 40% discount in prices for reviewing flu-shots.
Summing up
I’ve been impressed with how shortly we may full a undertaking of this scope utilizing Doc AI. We’ve gone from months to days. I give it 4 stars out of 5, and am open to giving it a fifth star if Snowflake ever provides us entry to picture embeddings.
Since flu photographs, we’ve deployed comparable fashions for different paperwork with comparable or higher outcomes. And with all this prep work, as a substitute of dreading the upcoming flu season, we’re able to convey it on.