Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • High-Endurance ASW and Strike USV
    • The competition watchdog just got a seat at the table in the legal battle between Epic Games and Apple
    • War Memes Are Turning Conflict Into Content
    • OnePlus Reveals New Phones Despite Uncertain Future
    • KTM Freeride E now street legal in all 50 US states
    • Unknown knowns: Techboard dug up unannounced startup investments – and discovered it’s potentially the majority of funding
    • Ben McKenzie Says Crypto Has a Secret Ingredient: Male Loneliness
    • Today’s NYT Mini Crossword Answers for April 21
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Tuesday, April 21
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Artificial Intelligence»How to Apply Vision Language Models to Long Documents
    Artificial Intelligence

    How to Apply Vision Language Models to Long Documents

    Editor Times FeaturedBy Editor Times FeaturedNovember 4, 2025No Comments11 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    are highly effective fashions that take pictures as inputs, as a substitute of textual content like conventional LLMs. This opens up lots of prospects, contemplating we are able to straight course of the contents of a doc, as a substitute of utilizing OCR to extract textual content, after which feeding this textual content into an LLM.

    On this article, I’ll focus on how one can apply imaginative and prescient language fashions (VLMs) for lengthy context doc understanding duties. This implies making use of VLMs to both very lengthy paperwork over 100 pages or very dense paperwork that include lots of info, akin to drawings. I’ll focus on what to think about when making use of VLMs, and what sort of duties you’ll be able to carry out with them.

    This infographic highlights the principle contents of this text. I’ll cowl why VLMs are so vital, and how one can apply them to lengthy paperwork. You possibly can for instance use VLMs for extra superior OCR, incorporating extra of the doc info into the extracted textual content. Moreover, you’ll be able to apply VLMs on to the photographs of a doc, although it’s a must to contemplate required processing energy, value and latency. Picture by ChatGPT.

    Why do we’d like VLMs?

    I’ve mentioned VLMs rather a lot in my earlier articles, and lined why they’re so vital to grasp the contents of some paperwork. The principle purpose VLMs are required is that lots of info in paperwork, requires the visible enter to grasp.

    The choice to VLMs is to make use of OCR, after which use an LLM. The issue right here is that you just’re solely extracting the textual content from the doc, and never together with the visible info, akin to:

    • The place completely different textual content is positioned relative to different textual content
    • Non-text info (basically every part that isn’t a letter, akin to symbols, or drawings)
    • The place textual content is positioned relative to different info

    This info is usually crucial to actually perceive the doc, and also you’re thus usually higher off utilizing VLMs straight, the place you feed within the picture straight, and may due to this fact additionally interpret the visible info.

    For lengthy paperwork, utilizing VLMs is a challenges, contemplating you want lots of tokens to signify visible info. Processing hundres of pages is thus a giant problem. Nevertheless, with lots of latest developments in VLM expertise, the fashions have gotten higher and higher and compressing the visible info into affordable context lengths, making it doable and usable to use VLMs to lengthy paperwork for doc understanding duties.

    This determine highlights the OCR + LLM strategy you’ll be able to make the most of. You are taking your doc, and apply OCR to get the doc textual content. Then you definitely feed this textual content, along with a person question into an LLM, which responds with a solution to the query, given the doc textual content. If you happen to as a substitute use VLMs, you’ll be able to skip the OCR step utterly, and reply the person query straight from the doc. Picture by the creator.

    OCR utilizing VLMs

    One good choice to course of lengthy paperwork, and nonetheless embrace the visible info, is to make use of VLMs to carry out OCR. Conventional OCR like Tesseract, solely extracts the textual content straight from paperwork along with the bounding field of the textual content. Nevertheless, VLMs are additionally educated to carry out OCR, and may carry out extra superior textual content extraction, akin to:

    • Extracting Markdown
    • Explaining purely visible info (i.e. if there’s a drawing, clarify the drawing with textual content)
    • Including lacking info (i.e. if there’s a field saying Date and a clean area after, you’ll be able to inform the OCR to extract Date )

    Just lately, Deepseek launched a strong VLM primarily based OCR mannequin, which has gotten lots of consideration and traction currently, making VLMs for OCR extra common.

    Markdown

    Markdown could be very highly effective, since you extract formatted textual content. This enables the mannequin to:

    • Present headers and subheaders
    • Symbolize tables precisely
    • Make daring textual content

    This enables the mannequin to extract extra consultant textual content, will extra precisely depicts the textual content contents of the paperwork. If you happen to now apply LLMs to this textual content, the LLMs will carry out means higher than should you utilized then to easy textual content extracted with conventional OCR.

    LLMs carry out higher on formatted textual content like Markdown, than on pure textual content extracted utilizing conventional OCR.

    Clarify visible info

    One other factor you need to use VLM OCR for is to clarify visible info. For instance, you probably have a drawing with no textual content in it, conventional OCR wouldn’t extract any info, because it’s solely educated to extract textual content characters. Nevertheless, you need to use VLMs to clarify the visible contents of the picture.

    Think about you’ve gotten the next doc:

    That is the introduction textual content of the doc
    
    
    
    That is the conclusion of the doc

    If you happen to utilized conventional OCR like Tesseract, you’ll get the next output:

    That is the introduction textual content of the doc
    
    That is the conclusion of the doc

    That is clearly a difficulty, because you’re not together with details about the picture displaying the Eiffel tower. As an alternative, you must use VLMs, which might output one thing like:

    That is the introduction textual content of the doc
    
    
    This picture depicts the Eiffel tower in the course of the day
    
    
    That is the conclusion of the doc

    If you happen to used an LLM on the primary textual content, it after all wouldn’t know the doc incorporates a picture of the Eiffel tower. Nevertheless, should you used an LLM on the second textual content extracted with a VLM, the LLM would naturally be higher at responding to questions in regards to the doc.

    Add lacking info

    You may as well immediate VLMs to output contents if there may be lacking info. To know this idea, have a look at the picture under:

    Why VLMs are important
    This determine reveals a typical instance of how info is represented in a doc. Picture by the creator.

    If you happen to utilized conventional OCR to this picture, you’ll get:

    Deal with Street 1
    Date
    Firm Google

    Nevertheless, it might be extra consultant should you used VLMs, which if instructed, may output:

    Deal with Street 1
    Date  
    Firm Google

    That is extra informative, as a result of we’re info any downstream mannequin, that the date area is empty. If we don’t present this info, it’s unattainable to know late if the date is solely lacking, the OCR wasn’t capable of extract it, or every other purpose.


    Nevertheless, OCR utilizing VLMs nonetheless endure from among the points that conventional OCR struggles with, as a result of it’s not processing visible info straight. You’ve most likely heard the saying that a picture is value a thousand phrases, which regularly holds true for processing visible info in paperwork. Sure, you’ll be able to present a textual content description of a drawing with a VLM as OCR, however this article is going to by no means be as descriptive because the drawing itself. Thus, I argue you’re in lots of circumstances higher off straight processing the paperwork utilizing VLMs, as I’ll cowl within the following sections.

    Open supply vs closed supply fashions

    There are lots of VLMs out there. I follw the HuggingFace VLM leaderboard to concentrate to any new excessive performing fashions. Based on this leaderboard, you must go for both Gemini 2.5 Professional, or GPT-5 if you wish to use closed supply fashions by way of an API. From my expeirence, these are nice choices, which works nicely for lengthy doc understanding, and dealing with complicated paperwork.

    Nevertheless, you may also need to make the most of open-source fashions, attributable to privateness, value, or to have extra management over your individual software. On this case, SenseNova-V6-5-Professional tops the leaderboard. I havn’t tried this mannequin personally, however I’ve used Qwen 3 VL rather a lot, which I’ve good expertise with. Qwen has additionally launched a specific cookbook for long document understanding.

    VLMs on lengthy paperwork

    On this part I’ll speak about making use of VLMs to lengthy paperwork, and issues it’s a must to make when doing it.

    Processing energy issues

    If you happen to’re working an open-source mannequin, one in every of your major issues is how massive of a mannequin you’ll be able to run, and the way lengthy it takes. You’re relying on entry to a bigger GPU, atleast an A100 normally. Fortunately that is broadly out there, and comparatively low-cost (usually value 1.5 – 2 USD per hour an lots of cloud suppliers now). Nevertheless, it’s essential to additional contemplate the latency you’ll be able to settle for. Runing VLMs require lots of processing, and it’s a must to contemplate the next components:

    • How lengthy is suitable to spend processing one request
    • Which picture decision do you want?
    • What number of pages do you must course of

    In case you have a reside chat for instance, you want fast course of, nonetheless should you’re merely processing within the background, you’ll be able to permit for longer processing instances.

    Picture decision can also be an vital consideration. If you happen to want to have the ability to learn the textual content in paperwork, you want high-resolution pictures, usually over 2048×2048, although it naturally relies on the doc. Detailed drawings for instance with small textual content in them, would require even greater decision. Improve decision, enormously will increase processing time and is a crucial consideration. It’s best to intention for the bottom doable decision that also lets you carry out all of the duties you need to carry out. Moreover, the variety of pages is an identical consideration. Including extra pages is usually essential to have entry to all the knowledge in a doc. Nevertheless, usually, crucial info is contained early within the doc, so you may get away with solely processing the primary 10 pages for instance.

    Reply dependent processing

    One thing you’ll be able to attempt to decrease the required processing energy, is to start out of straightforward, and solely advance to heavier processing should you don’t get the specified solutions.

    For instance, you may begin of solely trying on the first 10 pages, and seeing should you’re capable of correctly clear up the duty at hand, akin to extracting a chunk of data from a doc. Provided that we’re not capable of extract the piece of information, we begin extra pages. You possibly can apply the identical idea to the decision of your pictures, beginning with decrease decision pictures, and transferring to greater decision of required.

    This will of hierarchical processing reduces the required processing energy, since most duties may be solved solely trying on the first 10 pages, or utilizing decrease decision pictures. Then, provided that mandatory, we transfer on to course of extra pictures, or greater decision pictures.

    Price

    Price is a crucial consideration when utilizing VLMs. I’ve processed lots of paperwork, and I usually see round a 10x improve in variety of tokens when utilizing pictures (VLMs) as a substitute of textual content (LLMs). Since enter tokens are sometimes the motive force of prices in lengthy doc duties, utilizing VLMs often considerably will increase value. Observe that for OCR, the purpose about extra enter tokens than output tokens doesn’t apply, since OCR naturally produces lots of output tokens when outputting all textual content in pictures.

    Thus, when utilizing VLMs, is extremely vital to maximise your utilization of cached tokens, a subject I mentioned in my recent article about optimizing LLMs for cost and latency.

    Conclusion

    On this article I mentioned how one can apply imaginative and prescient language fashions (VLMs) to lengthy paperwork, to deal with complicated doc understanding duties. I mentioned why VLMs are so vital, and approaches to utilizing VLMs on lengthy paperwork. You possibly can for instance use VLMs for extra complicated OCR, or straight apply VLMs to lengthy paperwork, although with precautions about required processing energy, value and latency. I feel VLMs have gotten increasingly vital, highlighted by the latest launch of Deepseek OCR. I thus assume VLMs for doc understanding is a subject you must become involved with, and you must learn to use VLMs for doc processing purposes.

    👉 Discover me on socials:

    📩 Subscribe to my newsletter

    🧑‍💻 Get in touch

    🔗 LinkedIn

    🐦 X / Twitter

    ✍️ Medium

    You may as well learn my different articles:



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    The LLM Gamble | Towards Data Science

    April 21, 2026

    Context Payload Optimization for ICL-Based Tabular Foundation Models

    April 21, 2026

    What Does the p-value Even Mean?

    April 20, 2026

    From Risk to Asset: Designing a Practical Data Strategy That Actually Works

    April 20, 2026

    Will Humans Live Forever? AI Races to Defeat Aging

    April 20, 2026

    KV Cache Is Eating Your VRAM. Here’s How Google Fixed It With TurboQuant.

    April 19, 2026

    Comments are closed.

    Editors Picks

    High-Endurance ASW and Strike USV

    April 21, 2026

    The competition watchdog just got a seat at the table in the legal battle between Epic Games and Apple

    April 21, 2026

    War Memes Are Turning Conflict Into Content

    April 21, 2026

    OnePlus Reveals New Phones Despite Uncertain Future

    April 21, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    Two of the Kremlin’s most active hack groups are collaborating, ESET says

    September 21, 2025

    US lawmakers propose ban on death betting prediction markets

    March 12, 2026

    Italy’s Bending Spoons raises over €500 million to turbocharge tech acquisitions

    August 15, 2025
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.