Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • TOI-201 system shows planets changing orbits in real time
    • How the future of AI is at stake in the legal fight between Elon Musk and OpenAI’s Sam Altman
    • Goal Zero Yeti 1500 Power Station Review (2026): More Power, Better Chemistry
    • OpenAI says its models, starting with GPT-5.1, “increasingly mentioned goblins, gremlins, and other creatures”, leading to prompt instructions to mitigate it (OpenAI)
    • I Replaced Microsoft 365 With This Free Program, and I’m Happy With the Switch
    • Robot vacuum hides in kitchen cabinets for stealthy cleaning
    • Recognition is underrated – here’s why it’s your most valuable leadership tool
    • Motorola’s New Razr Folding Phones Command a Higher Price With Few Upgrades
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Thursday, April 30
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Artificial Intelligence»Docling: The Document Alchemist | Towards Data Science
    Artificial Intelligence

    Docling: The Document Alchemist | Towards Data Science

    Editor Times FeaturedBy Editor Times FeaturedSeptember 12, 2025No Comments15 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    Why can we nonetheless wrestle with paperwork in 2025?

    in any data-driven organisation, and also you’ll encounter a number of PDFs, Phrase recordsdata, PowerPoints, half-scanned photographs, handwritten notes, and the occasional shock CSV lurking in a SharePoint folder. Enterprise and information analysts waste hours changing, splitting, and cajoling these codecs into one thing their Python pipelines will settle for. Even the most recent generative-AI stacks can choke when the underlying textual content is wrapped inside graphics or sprinkled throughout irregular desk grids.

    Docling was born to resolve precisely that ache. Launched as an open-source venture by IBM Analysis Zurich and now hosted below the Linux Basis AI & Information Basis, the library abstracts parsing, format understanding, OCR, desk reconstruction, multimodal export, and even audio transcription behind one moderately easy API and CLI command.

    Though docling helps the processing of HTML, MS Workplace format recordsdata, Picture codecs and others, we’ll be largely taking a look at utilizing it to course of PDF recordsdata.

    As a knowledge scientist or ML engineer, why ought to I care about Docling?

    Usually, the actual bottleneck isn’t constructing the mannequin — it’s feeding it. We spend a big share of our time on information wrangling, and nothing kills productiveness sooner than being handed a vital dataset locked inside a 100-page PDF. That is exactly the issue Docling solves, performing as a bridge from the world of unstructured paperwork on to the structured sanity of Markdown, JSON, or a Pandas DataFrame. 

    However its energy extends past simply information extraction, immediately into the world of contemporary, AI-assisted growth. Think about pointing docling at an HTML web page of API specs; it effortlessly interprets that advanced net format into clear, structured Markdown — the right context to feed immediately into AI coding assistants like Cursor, ChatGPT, or Claude.

    The place Docling got here from

    The venture originated inside IBM’s Deep Search group, which was growing retrieval-augmented technology (RAG) pipelines for lengthy patent PDFs. They open-sourced the core below an MIT license in late 2024 and have been delivery weekly releases ever since. A vibrant group shortly fashioned round its unified DoclingDocument mannequin, a Pydantic object that retains textual content, photographs, tables, formulation, and format metadata collectively so downstream instruments like LangChain, LlamaIndex, or Haystack don’t must guess a web page’s studying order.

    Right this moment, Docling integrates visual-language fashions (VLMs), comparable to SmolDocling, for determine captioning. It additionally helps Tesseract, EasyOCR, and RapidOCR for textual content extraction and ships recipes for chunking, serialisation, and vector-store ingestion. In different phrases: you level it at a folder, and also you get Markdown, HTML, CSV, PNGs, JSON, or only a ready-to-embed Python object — no further scaffolding code required. 

    What we’ll do 

    To showcase Docling, we’ll first set up it after which use it with three completely different examples that display its versatility and usefulness as a doc parser and processor. Please word that utilizing Docling is sort of computationally intensive, so it is going to be useful when you have entry to a GPU in your system.

    Nevertheless, earlier than we begin coding, we have to arrange a growth surroundings.

    Organising a growth surroundings

    I’ve began utilizing the UV bundle supervisor for this now, however be happy to make use of whichever instruments you’re most comfy with. Observe additionally that I’ll be working below WSL2 Ubuntu for Home windows and operating my code utilizing a Jupyter Pocket book. 

    Observe, even utilizing UV, the code under took a few minutes to finish on my system, because it’s a reasonably hefty set of library installs.

    $ uv init docling
    Initialized venture `docling` at `/dwelling/tom/docling`
    $ cd docling
    $ uv venv
    Utilizing CPython 3.11.10 interpreter at: /dwelling/tom/miniconda3/bin/python
    Creating digital surroundings at: .venv
    Activate with: supply .venv/bin/activate
    $ supply .venv/bin/activate
    (docling) $ uv pip set up docling pandas jupyter

    Now sort within the command,

    (docling) $ jupyter pocket book

    And it’s best to see a pocket book open in your browser. If that doesn’t occur routinely, you’ll doubtless see a screenful of knowledge after operating the Jupyter Pocket book command. Close to the underside, you can see a URL to repeat and paste into your browser to launch the Jupyter Pocket book.

    Your URL might be completely different to mine, nevertheless it ought to look one thing like this:-

    http://127.0.0.1:8888/tree?token=3b9f7bd07b6966b41b68e2350721b2d0b6f388d248cc69d

    Instance 1: Convert any PDF or DOCX to Markdown or JSON

    The only use case can be the one you’ll use a big share of the time:- flip a doc’s textual content into Markdown 

    For many of our examples, our enter PDF might be one I’ve used a number of occasions earlier than for various checks. It’s a copy of Tesla’s 10-Q SEC submitting doc from September 2023. It’s roughly fifty pages lengthy and consists primarily of economic info associated to Tesla. The complete doc is publicly accessible on the Securities & Trade Fee (SEC) web site and might be considered/downloaded utilizing this link.

    Right here is a picture of the primary web page of that doc on your reference.

    Picture from Tesla 10-Q PDF

    Let’s evaluation the docling code we have to convert into markdown. It units up the file path for the enter PDF, runs the DocumentConverter perform on it, after which exports the parsed end result into Markdown format in order that the content material might be extra simply learn, edited, or analysed.

    from docling.document_converter import DocumentConverter
    import time
    from pathlib import Path
    
    inpath = "/mnt/d//tesla"
    infile = "tesla_q10_sept_23.pdf"
    
    data_folder = Path(inpath)
    
    doc_path = data_folder / infile
    
    converter = DocumentConverter()
    end result    = converter.convert(doc_path)     # → DoclingResult
    
    # Markdown export nonetheless works
    markdown_text = end result.doc.export_to_markdown()

    That is the output we get from operating the above code (simply the primary web page).

    ## UNITED STATES SECURITIES AND EXCHANGE COMMISSION
    
    Washington, D.C. 20549 FORM 10-Q
    
    (Mark One)
    
    - x QUARTERLY REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
    
    For the quarterly interval ended September 30, 2023
    
    OR
    
    - o TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
    
    For the transition interval from _________ to _________
    
    Fee File Quantity: 001-34756
    
    ## Tesla, Inc.
    
    (Precise title of registrant as laid out in its constitution)
    
    Delaware
    
    (State or different jurisdiction of incorporation or group)
    
    1 Tesla Street Austin, Texas
    
    (Tackle of principal govt workplaces)
    
    ## (512) 516-8177
    
    (Registrant's phone quantity, together with space code)
    
    ## Securities registered pursuant to Part 12(b) of the Act:
    
    | Title of every class   | Buying and selling Image(s)   | Identify of every change on which registered   |
    |-----------------------|---------------------|---------------------------------------------|
    | Widespread inventory          | TSLA                | The Nasdaq International Choose Market             |
    
    Point out by examine mark whether or not the registrant (1) has filed all reviews required to be filed by Part 13 or 15(d) of the Securities Trade Act of 1934 ('Trade Act') in the course of the previous 12 months (or for such shorter interval that the registrant was required to file such reviews), and (2) has been topic to such submitting necessities for the previous 90 days. Sure x No o
    
    Point out by examine mark whether or not the registrant has submitted electronically each Interactive Information File required to be submitted pursuant to Rule 405 of Regulation S-T (§232.405 of this chapter) in the course of the previous 12 months (or for such shorter interval that the registrant was required to submit such recordsdata). Sure x No o
    
    Point out by examine mark whether or not the registrant is a big accelerated filer, an accelerated filer, a non-accelerated filer, a smaller reporting firm, or an rising development firm. See the definitions of 'giant accelerated filer,' 'accelerated filer,' 'smaller reporting firm' and 'rising development firm' in Rule 12b-2 of the Trade Act:
    
    Giant accelerated filer
    
    x
    
    Accelerated filer
    
    Non-accelerated filer
    
    o
    
    Smaller reporting firm
    
    Rising development firm
    
    o
    
    If an rising development firm, point out by examine mark if the registrant has elected to not use the prolonged transition interval for complying with any new or revised monetary accounting requirements supplied pursuant to Part 13(a) of the Trade Act. o
    
    Point out by examine mark whether or not the registrant is a shell firm (as outlined in Rule 12b-2 of the Trade Act). Sure o No x
    
    As of October 16, 2023, there have been 3,178,921,391 shares of the registrant's widespread inventory excellent.

    With the rise of AI code editors and the usage of LLMs on the whole, this system has change into considerably extra priceless and related. The efficacy of LLMs and code editors might be considerably enhanced by offering them with applicable context. Usually it will entail supplying them with the textual illustration of a specific software or framework’s documentation, API and coding examples.

    Changing the output of PDFs to JSON format can be easy. Simply add these two strains of code. You could encounter limitations with the scale of the JSON output, so regulate the print assertion accordingly.

    json_blob = end result.doc.model_dump_json(indent=2)
    
    print(json_blob[10000], "…")

    Instance 2: Extract advanced tables from a PDF

    Many PDFs typically retailer tables as remoted textual content chunks or, worse, as flattened photographs. Docling’s table-structure mannequin reassembles rows, columns, and spanning cells, providing you with both a Pandas DataFrame or a ready-to-save CSV. Our check enter PDF has many tables. Look, for instance, at web page 11 of the PDF, and we will see the desk under,

    Picture from Tesla 10-Q PDF

    Let’s see if we will extract that information. It’s barely extra advanced code than in our first instance, nevertheless it’s doing extra work. The PDF is transformed once more utilizing Docling’s DocumentConverter perform, producing a structured doc illustration. Then, for every desk detected, it transforms the desk right into a Pandas DataFrame and in addition retrieves the web page variety of the desk from the doc’s provenance metadata. If the desk comes from web page 11, it prints it out in Markdown format after which breaks the loop (so solely the primary matching desk is proven).

    import pandas as pd
    from docling.document_converter import DocumentConverter
    from time import time
    from pathlib import Path
    
    inpath = "/mnt/d//tesla"
    infile = "tesla_q10_sept_23.pdf"
    data_folder = Path(inpath)
    input_doc_path = data_folder / infile
    
    doc_converter = DocumentConverter()
    start_time = time()
    conv_res = doc_converter.convert(input_doc_path)
    
    # Export desk from web page 11
    for table_ix, desk in enumerate(conv_res.doc.tables):
        page_number = desk.prov[0].page_no if desk.prov else "Unknown"
        if page_number == 11:
            table_df: pd.DataFrame = desk.export_to_dataframe()
            print(f"## Desk {table_ix} (Web page {page_number})")
            print(table_df.to_markdown())
            break
    
    end_time = time() - start_time
    print(f"Doc transformed and tables exported in {end_time:.2f} seconds.")

    And the output isn’t too shabby.

    ## Desk 10 (Web page 11)
    |    |                                        | Three Months Ended September 30,.2023   | Three Months Ended September 30,.2022   | 9 Months Ended September 30,.2023   | 9 Months Ended September 30,.2022   |
    |---:|:---------------------------------------|:----------------------------------------|:----------------------------------------|:---------------------------------------|:---------------------------------------|
    |  0 | Automotive gross sales                       | $ 18,582                                | $ 17,785                                | $ 57,879                               | $ 46,969                               |
    |  1 | Automotive regulatory credit          | 554                                     | 286                                     | 1,357                                  | 1,309                                  |
    |  2 | Power technology and storage gross sales    | 1,416                                   | 966                                     | 4,188                                  | 2,186                                  |
    |  3 | Companies and different                     | 2,166                                   | 1,645                                   | 6,153                                  | 4,390                                  |
    |  4 | Complete revenues from gross sales and companies | 22,718                                  | 20,682                                  | 69,577                                 | 54,854                                 |
    |  5 | Automotive leasing                     | 489                                     | 621                                     | 1,620                                  | 1,877                                  |
    |  6 | Power technology and storage leasing  | 143                                     | 151                                     | 409                                    | 413                                    |
    |  7 | Complete revenues                         | $ 23,350                                | $ 21,454                                | $ 71,606                               | $ 57,144                               |
    Doc transformed and tables exported in 33.43 seconds.

    To retrieve ALL the tables from a PDF, you would wish to omit the if page_number =… line from my code.

    One factor I’ve seen with Docling is that it’s not quick. As proven above, it took nearly 34 seconds to extract that single desk from a 50-page PDF.

    Instance 3: Carry out OCR on an picture.

    For this instance, I scanned a random web page from the Tesla 10-Q PDF and saved it as a PNG file. Let’s see how Docling copes with studying that picture and changing what it finds into markdown. Right here is my scanned picture.

    Picture from Tesla 10-Q PDF

    And our code. We use Tesseract as our OCR engine (others can be found)

    from pathlib import Path
    import time
    import pandas as pd
    
    from docling.document_converter import DocumentConverter, ImageFormatOption
    from docling.fashions.tesseract_ocr_cli_model import TesseractCliOcrOptions
    
    
    def essential():
        inpath = "/mnt/d//tesla"
        infile = "10q-image.png"
    
        input_doc_path = Path(inpath) / infile
    
        # Configure OCR for picture enter
        image_options = ImageFormatOption(
            ocr_options=TesseractCliOcrOptions(force_full_page_ocr=True),
            do_table_structure=True,
            table_structure_options={"do_cell_matching": True},
        )
    
        converter = DocumentConverter(
            format_options={"picture": image_options}
        )
    
        start_time = time.time()
    
        conv_res = converter.convert(input_doc_path).doc
    
        # Print all tables as Markdown
        for table_ix, desk in enumerate(conv_res.tables):
            table_df: pd.DataFrame = desk.export_to_dataframe(doc=conv_res)
            page_number = desk.prov[0].page_no if desk.prov else "Unknown"
            print(f"n--- Desk {table_ix+1} (Web page {page_number}) ---")
            print(table_df.to_markdown(index=False))
    
        # Print full doc textual content as Markdown
        print("n--- Full Doc (Markdown) ---")
        print(conv_res.export_to_markdown())
    
        elapsed = time.time() - start_time
        print(f"nProcessing accomplished in {elapsed:.2f} seconds")
    
    
    if __name__ == "__main__":
        essential()
    

    Right here is our output.

    --- Desk 1 (Web page 1) ---
    |                          |   Three Months Ended September J0,. | Three Months Ended September J0,.2022   | 9 Months Ended September J0,.2023   | 9 Months Ended September J0,.2022   |
    |:-------------------------|------------------------------------:|:----------------------------------------|:---------------------------------------|:---------------------------------------|
    | Value ol revenves         |                                 181 | 150                                     | 554                                    | 424                                    |
    | Analysis an0 developrent |                                 189 | 124                                     | 491                                    | 389                                    |
    |                          |                                  95 |                                         | 2B3                                    | 328                                    |
    | Complete                    |                                 465 | 362                                     | 1,328                                  | 1,141                                  |
    
    --- Full Doc (Markdown) ---
    ## Observe 8 Fairness Incentive Plans
    
    ## Different Pertormance-Primarily based Grants
    
    ("RSUs") und inventory optlons unrecognized stock-based compensatian
    
    ## Abstract Inventory-Primarily based Compensation Data
    
    |                          | Three Months Ended September J0,   | Three Months Ended September J0,   | 9 Months Ended September J0,   | 9 Months Ended September J0,   |
    |--------------------------|------------------------------------|------------------------------------|-----------------------------------|-----------------------------------|
    |                          |                                    | 2022                               | 2023                              | 2022                              |
    | Value ol revenves         | 181                                | 150                                | 554                               | 424                               |
    | Analysis an0 developrent | 189                                | 124                                | 491                               | 389                               |
    |                          | 95                                 |                                    | 2B3                               | 328                               |
    | Complete                    | 465                                | 362                                | 1,328                             | 1,141                             |
    
    ## Observe 9 Commitments and Contingencies
    
    ## Working Lease Preparations In Buffalo, New York and Shanghai, China
    
    ## Authorized Proceedings
    
    Between september 1 which 2021 pald has
    
    Processing accomplished in 7.64 seconds

    Should you evaluate this output to the unique picture, the outcomes are disappointing. Loads of the textual content within the picture was simply missed or garbled. That is the place a product like AWS Textract comes into its personal, because it excels at extracting textual content from a variety of sources. 

    Nevertheless, Docling does present numerous choices for OCR, so in the event you obtain poor outcomes from one system, you may all the time swap to a different.

    I tried the identical job utilizing EasyOCR, however the outcomes weren’t considerably completely different from these obtained with Tesseract. Should you’d prefer to strive it out, right here is the code.

    from pathlib import Path
    import time
    import pandas as pd
    
    from docling.document_converter import DocumentConverter, ImageFormatOption
    from docling.fashions.easyocr_model import EasyOcrOptions  # Import EasyOCR choices
    
    
    def essential():
        inpath = "/mnt/d//tesla"
        infile = "10q-image.png"
    
        input_doc_path = Path(inpath) / infile
    
        # Configure picture pipeline with EasyOCR
        image_options = ImageFormatOption(
            ocr_options=EasyOcrOptions(force_full_page_ocr=True),  # use EasyOCR
            do_table_structure=True,
            table_structure_options={"do_cell_matching": True},
        )
    
        converter = DocumentConverter(
            format_options={"picture": image_options}
        )
    
        start_time = time.time()
    
        conv_res = converter.convert(input_doc_path).doc
    
        # Print all tables as Markdown
        for table_ix, desk in enumerate(conv_res.tables):
            table_df: pd.DataFrame = desk.export_to_dataframe(doc=conv_res)
            page_number = desk.prov[0].page_no if desk.prov else "Unknown"
            print(f"n--- Desk {table_ix+1} (Web page {page_number}) ---")
            print(table_df.to_markdown(index=False))
    
        # Print full doc textual content as Markdown
        print("n--- Full Doc (Markdown) ---")
        print(conv_res.export_to_markdown())
    
        elapsed = time.time() - start_time
        print(f"nProcessing accomplished in {elapsed:.2f} seconds")
    
    
    if __name__ == "__main__":
        essential()
    

    Abstract

    The generative-AI increase re-ignited an outdated reality: rubbish in, rubbish out. LLMs can hallucinate much less solely after they ingest semantically and spatially coherent enter. Docling supplies coherence (more often than not) throughout a number of supply codecs that your stakeholders can current, and does so regionally and reproducibly.

    Docling has its makes use of past the AI world, although. Think about the huge variety of paperwork saved in areas comparable to financial institution vaults, solicitors’ workplaces, and insurance coverage corporations worldwide. If these are to be digitised, Docling could present among the options for that.

    Its greatest weak point might be the Optical Character Recognition of textual content inside photographs. I attempted utilizing Tesseract and EasyOCR, and the outcomes from each have been disappointing. You’ll in all probability want to make use of a business product like AWS Textract if you wish to reliably reproduce textual content from these kinds of sources.

    It will also be sluggish. I’ve a reasonably high-spec desktop PC with a GPU, and it took a while on most duties I set it. Nevertheless, in case your enter paperwork are primarily PDFs, Docling could possibly be a priceless addition to your textual content processing toolbox.

    I’ve solely scratched the floor of what Docling is able to, and I encourage you to go to their homepage, which might be accessed utilizing the next link to be taught extra.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    System Design Series: Apache Flink from 10,000 Feet, and Building a Flink-powered Recommendation Engine

    April 30, 2026

    Agentic AI: How to Save on Tokens

    April 29, 2026

    4 YAML Files Instead of PySpark: How We Let Analysts Build Data Pipelines Without Engineers

    April 29, 2026

    Ensembles of Ensembles of Ensembles: A Guide to Stacking

    April 29, 2026

    How AI Policy in South Africa Is Ruining Itself

    April 29, 2026

    PyTorch NaNs Are Silent Killers — So I Built a 3ms Hook to Catch Them at the Exact Layer

    April 28, 2026

    Comments are closed.

    Editors Picks

    TOI-201 system shows planets changing orbits in real time

    April 30, 2026

    How the future of AI is at stake in the legal fight between Elon Musk and OpenAI’s Sam Altman

    April 30, 2026

    Goal Zero Yeti 1500 Power Station Review (2026): More Power, Better Chemistry

    April 30, 2026

    OpenAI says its models, starting with GPT-5.1, “increasingly mentioned goblins, gremlins, and other creatures”, leading to prompt instructions to mitigate it (OpenAI)

    April 30, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    I’m Still Using My TP-Link Router, Even Though It Could Be Banned in the US

    November 5, 2025

    Scammers Will Try to Trick You Into Filling Out Google Forms. Don’t Fall for It

    August 31, 2025

    Robotic legs dance, climb stairs in Roadrunner demo

    March 29, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.