are highly effective fashions that take pictures as inputs, as a substitute of textual content like conventional LLMs. This opens up lots of prospects, contemplating we are able to straight course of the contents of a doc, as a substitute of utilizing OCR to extract textual content, after which feeding this textual content into an LLM.
On this article, I’ll focus on how one can apply imaginative and prescient language fashions (VLMs) for lengthy context doc understanding duties. This implies making use of VLMs to both very lengthy paperwork over 100 pages or very dense paperwork that include lots of info, akin to drawings. I’ll focus on what to think about when making use of VLMs, and what sort of duties you’ll be able to carry out with them.
Why do we’d like VLMs?
I’ve mentioned VLMs rather a lot in my earlier articles, and lined why they’re so vital to grasp the contents of some paperwork. The principle purpose VLMs are required is that lots of info in paperwork, requires the visible enter to grasp.
The choice to VLMs is to make use of OCR, after which use an LLM. The issue right here is that you just’re solely extracting the textual content from the doc, and never together with the visible info, akin to:
- The place completely different textual content is positioned relative to different textual content
- Non-text info (basically every part that isn’t a letter, akin to symbols, or drawings)
- The place textual content is positioned relative to different info
This info is usually crucial to actually perceive the doc, and also you’re thus usually higher off utilizing VLMs straight, the place you feed within the picture straight, and may due to this fact additionally interpret the visible info.
For lengthy paperwork, utilizing VLMs is a challenges, contemplating you want lots of tokens to signify visible info. Processing hundres of pages is thus a giant problem. Nevertheless, with lots of latest developments in VLM expertise, the fashions have gotten higher and higher and compressing the visible info into affordable context lengths, making it doable and usable to use VLMs to lengthy paperwork for doc understanding duties.

OCR utilizing VLMs
One good choice to course of lengthy paperwork, and nonetheless embrace the visible info, is to make use of VLMs to carry out OCR. Conventional OCR like Tesseract, solely extracts the textual content straight from paperwork along with the bounding field of the textual content. Nevertheless, VLMs are additionally educated to carry out OCR, and may carry out extra superior textual content extraction, akin to:
- Extracting Markdown
- Explaining purely visible info (i.e. if there’s a drawing, clarify the drawing with textual content)
- Including lacking info (i.e. if there’s a field saying Date and a clean area after, you’ll be able to inform the OCR to extract Date
)
Just lately, Deepseek launched a strong VLM primarily based OCR mannequin, which has gotten lots of consideration and traction currently, making VLMs for OCR extra common.
Markdown
Markdown could be very highly effective, since you extract formatted textual content. This enables the mannequin to:
- Present headers and subheaders
- Symbolize tables precisely
- Make daring textual content
This enables the mannequin to extract extra consultant textual content, will extra precisely depicts the textual content contents of the paperwork. If you happen to now apply LLMs to this textual content, the LLMs will carry out means higher than should you utilized then to easy textual content extracted with conventional OCR.
LLMs carry out higher on formatted textual content like Markdown, than on pure textual content extracted utilizing conventional OCR.
Clarify visible info
One other factor you need to use VLM OCR for is to clarify visible info. For instance, you probably have a drawing with no textual content in it, conventional OCR wouldn’t extract any info, because it’s solely educated to extract textual content characters. Nevertheless, you need to use VLMs to clarify the visible contents of the picture.
Think about you’ve gotten the next doc:
That is the introduction textual content of the doc
That is the conclusion of the doc
If you happen to utilized conventional OCR like Tesseract, you’ll get the next output:
That is the introduction textual content of the doc
That is the conclusion of the doc
That is clearly a difficulty, because you’re not together with details about the picture displaying the Eiffel tower. As an alternative, you must use VLMs, which might output one thing like:
That is the introduction textual content of the doc
This picture depicts the Eiffel tower in the course of the day
That is the conclusion of the doc
If you happen to used an LLM on the primary textual content, it after all wouldn’t know the doc incorporates a picture of the Eiffel tower. Nevertheless, should you used an LLM on the second textual content extracted with a VLM, the LLM would naturally be higher at responding to questions in regards to the doc.
Add lacking info
You may as well immediate VLMs to output contents if there may be lacking info. To know this idea, have a look at the picture under:

If you happen to utilized conventional OCR to this picture, you’ll get:
Deal with Street 1
Date
Firm Google
Nevertheless, it might be extra consultant should you used VLMs, which if instructed, may output:
Deal with Street 1
Date
Firm Google
That is extra informative, as a result of we’re info any downstream mannequin, that the date area is empty. If we don’t present this info, it’s unattainable to know late if the date is solely lacking, the OCR wasn’t capable of extract it, or every other purpose.
Nevertheless, OCR utilizing VLMs nonetheless endure from among the points that conventional OCR struggles with, as a result of it’s not processing visible info straight. You’ve most likely heard the saying that a picture is value a thousand phrases, which regularly holds true for processing visible info in paperwork. Sure, you’ll be able to present a textual content description of a drawing with a VLM as OCR, however this article is going to by no means be as descriptive because the drawing itself. Thus, I argue you’re in lots of circumstances higher off straight processing the paperwork utilizing VLMs, as I’ll cowl within the following sections.
Open supply vs closed supply fashions
There are lots of VLMs out there. I follw the HuggingFace VLM leaderboard to concentrate to any new excessive performing fashions. Based on this leaderboard, you must go for both Gemini 2.5 Professional, or GPT-5 if you wish to use closed supply fashions by way of an API. From my expeirence, these are nice choices, which works nicely for lengthy doc understanding, and dealing with complicated paperwork.
Nevertheless, you may also need to make the most of open-source fashions, attributable to privateness, value, or to have extra management over your individual software. On this case, SenseNova-V6-5-Professional tops the leaderboard. I havn’t tried this mannequin personally, however I’ve used Qwen 3 VL rather a lot, which I’ve good expertise with. Qwen has additionally launched a specific cookbook for long document understanding.
VLMs on lengthy paperwork
On this part I’ll speak about making use of VLMs to lengthy paperwork, and issues it’s a must to make when doing it.
Processing energy issues
If you happen to’re working an open-source mannequin, one in every of your major issues is how massive of a mannequin you’ll be able to run, and the way lengthy it takes. You’re relying on entry to a bigger GPU, atleast an A100 normally. Fortunately that is broadly out there, and comparatively low-cost (usually value 1.5 – 2 USD per hour an lots of cloud suppliers now). Nevertheless, it’s essential to additional contemplate the latency you’ll be able to settle for. Runing VLMs require lots of processing, and it’s a must to contemplate the next components:
- How lengthy is suitable to spend processing one request
- Which picture decision do you want?
- What number of pages do you must course of
In case you have a reside chat for instance, you want fast course of, nonetheless should you’re merely processing within the background, you’ll be able to permit for longer processing instances.
Picture decision can also be an vital consideration. If you happen to want to have the ability to learn the textual content in paperwork, you want high-resolution pictures, usually over 2048×2048, although it naturally relies on the doc. Detailed drawings for instance with small textual content in them, would require even greater decision. Improve decision, enormously will increase processing time and is a crucial consideration. It’s best to intention for the bottom doable decision that also lets you carry out all of the duties you need to carry out. Moreover, the variety of pages is an identical consideration. Including extra pages is usually essential to have entry to all the knowledge in a doc. Nevertheless, usually, crucial info is contained early within the doc, so you may get away with solely processing the primary 10 pages for instance.
Reply dependent processing
One thing you’ll be able to attempt to decrease the required processing energy, is to start out of straightforward, and solely advance to heavier processing should you don’t get the specified solutions.
For instance, you may begin of solely trying on the first 10 pages, and seeing should you’re capable of correctly clear up the duty at hand, akin to extracting a chunk of data from a doc. Provided that we’re not capable of extract the piece of information, we begin extra pages. You possibly can apply the identical idea to the decision of your pictures, beginning with decrease decision pictures, and transferring to greater decision of required.
This will of hierarchical processing reduces the required processing energy, since most duties may be solved solely trying on the first 10 pages, or utilizing decrease decision pictures. Then, provided that mandatory, we transfer on to course of extra pictures, or greater decision pictures.
Price
Price is a crucial consideration when utilizing VLMs. I’ve processed lots of paperwork, and I usually see round a 10x improve in variety of tokens when utilizing pictures (VLMs) as a substitute of textual content (LLMs). Since enter tokens are sometimes the motive force of prices in lengthy doc duties, utilizing VLMs often considerably will increase value. Observe that for OCR, the purpose about extra enter tokens than output tokens doesn’t apply, since OCR naturally produces lots of output tokens when outputting all textual content in pictures.
Thus, when utilizing VLMs, is extremely vital to maximise your utilization of cached tokens, a subject I mentioned in my recent article about optimizing LLMs for cost and latency.
Conclusion
On this article I mentioned how one can apply imaginative and prescient language fashions (VLMs) to lengthy paperwork, to deal with complicated doc understanding duties. I mentioned why VLMs are so vital, and approaches to utilizing VLMs on lengthy paperwork. You possibly can for instance use VLMs for extra complicated OCR, or straight apply VLMs to lengthy paperwork, although with precautions about required processing energy, value and latency. I feel VLMs have gotten increasingly vital, highlighted by the latest launch of Deepseek OCR. I thus assume VLMs for doc understanding is a subject you must become involved with, and you must learn to use VLMs for doc processing purposes.
👉 Discover me on socials:
🧑💻 Get in touch
✍️ Medium
You may as well learn my different articles:

