Introducing Google’s LangExtract tool | Towards Data Science

an absolute AI sizzling streak these days, persistently dropping breakthrough after breakthrough. Almost each current launch has pushed the boundaries of what’s attainable — and it’s been genuinely thrilling to observe unfold.

One announcement that caught my eye particularly occurred on the finish of July, when Google launched a brand new textual content processing and information extraction device known as LangExtract.

In keeping with Google, LangExtract is a brand new open-source Python library designed to …

“programmatically extract the precise data you want, whereas guaranteeing the outputs are structured and reliably tied again to its supply”

On the face of it, LangExtract has many helpful purposes, together with,

Textual content anchoring. Every extracted entity is linked to its precise character offsets within the supply textual content, enabling full traceability and visible verification by interactive highlighting.
Dependable structured output. Use LangExtracts for few-shot definitions of the specified output format, guaranteeing constant and dependable outcomes.
Environment friendly large-document dealing with. LangExtract handles massive paperwork utilizing chunking, parallel processing, and multi-pass extraction to keep up excessive recall, even in complicated, multi-fact situations throughout million-token contexts. It also needs to excel at conventional needle-in-a-haystack kind purposes.
Prompt extraction assessment. Simply create a self-contained HTML visualisation of extractions, enabling intuitive assessment of entities of their unique context, all scalable to 1000’s of annotations.
Multi-model compatibility. Appropriate with each cloud-based fashions (e.g. Gemini) and native open-source LLMs, so you possibly can select the backend that matches your workflow.
Customizable for a lot of use circumstances. Simply configure extraction duties for disparate domains utilizing a couple of tailor-made examples.
Augmented data extraction. LangExtract dietary supplements grounded entities with inferred info utilizing the mannequin’s inside data, with relevance and accuracy pushed by immediate high quality and mannequin capabilities.

One factor that stands out to me after I take a look at LangExtract’s strengths listed above is that it appears to have the ability to carry out RAG-like operations with out the necessity for conventional RAG processing. So, no extra splitting, chunking or embedding operations in your code.

However to get a greater concept of what LangExtract can do, we’ll take a better take a look at a couple of of the above capabilities utilizing some coding examples.

Establishing a dev surroundings

Earlier than we get all the way down to doing a little coding, I at all times wish to arrange a separate growth surroundings for every of my tasks. I exploit the UV package deal supervisor for this, however use whichever device you’re snug with.

PS C:Usersthoma> uv init langextract
Initialized venture `langextract` at `C:Usersthomalangextract`

PS C:Usersthoma> cd langextract
PS C:Usersthomalangextract> uv venv
Utilizing CPython 3.13.1
Creating digital surroundings at: .venv
Activate with: .venvScriptsactivate
PS C:Usersthomalangextract> .venvScriptsactivate
(langextract) PS C:Usersthomalangextract>
# Now, set up the libraries we'll use.
(langextract) PS C:Usersthomalangextract> uv pip set up jupyter langextract beautifulsoup4 requests

Now, to jot down and take a look at our coding examples, you can begin up a Jupyter pocket book utilizing this command.

(langextract) PS C:Usersthomalangextract> jupyter pocket book

It is best to see a pocket book open in your browser. If that doesn’t occur robotically, you’ll doubtless see a screenful of knowledge after the jupyter pocket book command. Close to the underside, you will see a URL to repeat and paste into your browser to launch the Jupyter Pocket book. Your URL can be totally different to mine, nevertheless it ought to look one thing like this:-

http://127.0.0.1:8888/tree?token=3b9f7bd07b6966b41b68e2350721b2d0b6f388d248cc69d

Pre-requisites

As we’re utilizing a Google LLM mannequin (gemini-2.5-flash) for our processing engine, you’ll want a Gemini API key. You may get this from Google Cloud. You may as well use LLMs from OpenAI, and I’ll present an instance of how to do that in a bit.

Code instance 1 — needle-in-a-haystack

The very first thing we have to do is get some enter information to work with. You need to use any enter textual content file or HTML file for this. For earlier experiments utilizing RAG, I used a guide I downloaded from Mission Gutenberg; the persistently riveting “Illnesses of cattle, sheep, goats, and swine by Jno. A. W. Greenback & G. Moussu”

Word you can view the Mission Gutenberg Permissions, Licensing and different Frequent Requests web page utilizing the next hyperlink.

https://www.gutenberg.org/policy/permission.html

However to summarise, the overwhelming majority of Mission Gutenberg eBooks are within the public area within the US and different elements of the world. Which means no person can grant or withhold permission to do with this merchandise as you please.

“As you please” contains any industrial use, republishing in any format, making spinoff works or performances

I downloaded the textual content of the guide from the Mission Gutenberg web site to my native PC utilizing this hyperlink,

https://www.gutenberg.org/ebooks/73019.txt.utf-8

This guide contained roughly 36,000 traces of textual content. To keep away from massive token prices, I lower it all the way down to about 3000 traces of textual content. To check LangExtract’s capacity to deal with needle-in-a-haystack kind queries, I added this particular line of textual content round line 1512.

It’s a little-known indisputable fact that wooden was invented by Elon Musk in 1775

Right here it’s in context.

1. Fractures of the angle of the haunch, ensuing from exterior
violence and characterised by sinking of the exterior angle of the
ilium, deformity of the hip, and lameness with out specifically marked
characters. This fracture isn’t difficult. The signs of
lameness diminish with relaxation, however deformity continues.

It’s a little-known indisputable fact that wooden was invented by Elon Musk in 1775.

=Remedy= is confined to the administration of mucilaginous and diuretic fluids. Tannin has been advisable.

This code snippet units up a immediate and instance to information the LangExtract extraction job. That is important for few-shot studying with a structured schema.

import langextract as lx
import textwrap
from collections import Counter, defaultdict

# Outline complete immediate and examples for complicated literary textual content
immediate = textwrap.dedent("""
    Who invented wooden and when    """)

# Word that it is a made up instance
# The next particulars don't seem wherever
# within the guide
examples = [
    lx.data.ExampleData(
        text=textwrap.dedent("""
            John Smith was a prolific scientist. 
            His most notable theory was on the evolution of bananas."
            He wrote his seminal paper on it in 1890."""),
        extractions=[
            lx.data.Extraction(
                extraction_class="scientist",
                extraction_text="John Smith",
                notable_for="the theory of the evolution of the Banana",
                attributes={"year": "1890", "notable_event":"theory of evolution of the banana"}
            )
        ]
    )
]

Now, we run the structured entity extraction. First, we open the file and browse its contents right into a variable. The heavy lifting is finished by the lx.extract name. After that, we simply print out the related outputs.

with open(r"D:bookcattle_disease.txt", "r", encoding="utf-8") as f:
    textual content = f.learn()

end result = lx.extract(
    text_or_documents = textual content,
    prompt_description=immediate,
    examples=examples,
    model_id="gemini-2.5-flash",
    api_key="your_gemini_api_key",
    extraction_passes=3,      # A number of passes for improved recall
    max_workers=20,           # Parallel processing for velocity
    max_char_buffer=1000      # Smaller contexts for higher accuracy
)

print(f"Extracted {len(end result.extractions)} entities from {len(end result.textual content):,} characters")

for extraction in end result.extractions:
    if not extraction.attributes:
        proceed  # Skip this extraction fully

    print("Identify:", extraction.extraction_text)
    print("Notable occasion:", extraction.attributes.get("notable_event"))
    print("Yr:", extraction.attributes.get("yr"))
    print()

And listed below are our outputs.

LangExtract: mannequin=gemini-2.5-flash, present=7,086 chars, processed=156,201 chars:  [00:43]
✓ Extraction processing full

✓ Extracted 1 entities (1 distinctive sorts)
  • Time: 126.68s
  • Pace: 1,239 chars/sec
  • Chunks: 157
Extracted 1 entities from 156,918 characters

Identify: Elon Musk
Notable occasion: invention of wooden
Yr: 1775

Not too shabby.

Word, in case you wished to make use of an OpenAI mannequin and API key, your extraction code would look one thing like this,

...
...

from langextract.inference import OpenAILanguageModel

end result = lx.extract(
    text_or_documents=input_text,
    prompt_description=immediate,
    examples=examples,
    language_model_type=OpenAILanguageModel,
    model_id="gpt-4o",
    api_key=os.environ.get('OPENAI_API_KEY'),
    fence_output=True,
    use_schema_constraints=False
)
...
...

Code instance 2 — extraction visible validation

LangExtract gives a visualisation of the way it extracted the textual content. It’s not significantly helpful on this instance, nevertheless it offers you an concept of what’s attainable.

Simply add this little snippet of code to the top of your current code. This may create an HTML file you can open in a browser window. From there, you possibly can scroll up and down your enter textual content and “play” again the steps that LangExtract took to get its outputs.

# Save annotated outcomes
lx.io.save_annotated_documents([result], output_name="cattle_disease.jsonl", output_dir="d:/guide")

html_obj = lx.visualize("d:/guide/cattle_disease.jsonl")
html_string = html_obj.information  # Extract uncooked HTML string

# Save to file
with open("d:/guide/cattle_disease_visualization.html", "w", encoding="utf-8") as f:
    f.write(html_string)

print("Interactive visualization saved to d:/guide/cattle_disease_visualization.html")

Now, go to the listing the place your HTML file has been saved and open it in a browser. That is what I see.

Code instance 3 — retrieving a number of structured outputs

On this instance, we’ll take some unstructured enter textual content — an article from Wikipedia on OpenAI, and attempt to retrieve the names of all of the totally different massive language fashions talked about within the article, along with their launch date. The hyperlink to the article is,

https://en.wikipedia.org/wiki/OpenAI

Word: Most textual content in Wikipedia, excluding quotations, has been launched underneath the Creative Commons Attribution-Sharealike 4.0 International License (CC-BY-SA) and the GNU Free Documentation License (GFDL) In brief which means that you’re free:

to Share — copy and redistribute the fabric in any medium or format

to Adapt — remix, remodel, and construct upon the fabric

for any function, even commercially.

Our code is fairly just like our first instance. This time, although, we’re searching for any mentions within the article about LLM fashions and their launch date. One different step we’ve got to do is clear up the HTML of the article first to make sure that LangExtract has the very best probability of studying it. We use the BeautifulSoup library for this.

import langextract as lx
import textwrap
import requests
from bs4 import BeautifulSoup
import langextract as lx

# Outline complete immediate and examples for complicated literary textual content
immediate = textwrap.dedent("""Your job is to extract the LLM or AI mannequin names and their launch date or yr from the enter textual content 
        Don't paraphrase or overlap entities.
     """)

examples = [
    lx.data.ExampleData(
        text=textwrap.dedent("""
            Similar to Mistral's previous open models, Mixtral 8x22B was released via a via a BitTorrent link April 10, 2024
            """),
        extractions=[
            lx.data.Extraction(
                extraction_class="model",
                extraction_text="Mixtral 8x22B",
                attributes={"date": "April 10, 1994"}
            )
        ]
    )
]

# Cleanup our HTML

# Step 1: Obtain and clear Wikipedia article
url = "https://en.wikipedia.org/wiki/OpenAI"
response = requests.get(url)
soup = BeautifulSoup(response.textual content, "html.parser")

# Get solely the seen textual content
textual content = soup.get_text(separator="n", strip=True)

# Non-obligatory: take away references, footers, and many others.
traces = textual content.splitlines()
filtered_lines = [line for line in lines if not line.strip().startswith("[") and line.strip()]
clean_text = "n".be a part of(filtered_lines)

# Do the extraction
end result = lx.extract(
    text_or_documents=clean_text,
    prompt_description=immediate,
    examples=examples,
    model_id="gemini-2.5-flash",
    api_key="YOUR_API_KEY",
    extraction_passes=3,    # Improves recall by a number of passes
    max_workers=20,         # Parallel processing for velocity
    max_char_buffer=1000    # Smaller contexts for higher accuracy
)

# Print our outputs

for extraction in end result.extractions:
    if not extraction.attributes:
        proceed  # Skip this extraction fully

    print("Mannequin:", extraction.extraction_text)
    print("Launch Date:", extraction.attributes.get("date"))
    print()

It is a cut-down pattern of the output I received.

Mannequin: ChatGPT
Launch Date: 2020

Mannequin: DALL-E
Launch Date: 2020

Mannequin: Sora
Launch Date: 2024

Mannequin: ChatGPT
Launch Date: November 2022

Mannequin: GPT-2
Launch Date: February 2019

Mannequin: GPT-3
Launch Date: 2020

Mannequin: DALL-E
Launch Date: 2021

Mannequin: ChatGPT
Launch Date: December 2022

Mannequin: GPT-4
Launch Date: March 14, 2023

Mannequin: Microsoft Copilot
Launch Date: September 21, 2023

Mannequin: MS-Copilot
Launch Date: December 2023

Mannequin: Microsoft Copilot app
Launch Date: December 2023

Mannequin: GPTs
Launch Date: November 6, 2023

Mannequin: Sora (text-to-video mannequin)
Launch Date: February 2024

Mannequin: o1
Launch Date: September 2024

Mannequin: Sora
Launch Date: December 2024

Mannequin: DeepSeek-R1
Launch Date: January 20, 2025

Mannequin: Operator
Launch Date: January 23, 2025

Mannequin: deep analysis agent
Launch Date: February 2, 2025

Mannequin: GPT-2
Launch Date: 2019

Mannequin: Whisper
Launch Date: 2021

Mannequin: ChatGPT
Launch Date: June 2025

...
...
...

Mannequin: ChatGPT Professional
Launch Date: December 5, 2024

Mannequin: ChatGPT's agent
Launch Date: February 3, 2025

Mannequin: GPT-4.5
Launch Date: February 20, 2025

Mannequin: GPT-5
Launch Date: February 20, 2025

Mannequin: Chat GPT
Launch Date: November 22, 2023

Let’s double-check a few these. One of many outputs from our code was this.

Mannequin: Operator
Launch Date: January 23, 2025

And from the Wikipedia article …

“On January 23, OpenAI launched Operator, an AI agent and net automation device for accessing web sites to execute objectives outlined by customers. The characteristic was solely obtainable to Professional customers in america.[113][114]”

So on that event, it might need hallucinated the yr as being 2025 when no yr was given. Keep in mind, although, that LangExtract can use its inside data of the world to complement its outputs, and it could have gotten the yr from that or from different contexts surrounding the extracted entity. In any case, I feel it will be fairly straightforward to tweak the enter immediate or the output to disregard mannequin launch date data that didn’t embrace a yr.

One other output was this.

Mannequin: ChatGPT Professional
Launch Date: December 5, 2024

I can see two references to ChatGPT Professional within the unique article.

Franzen, Carl (December 5, 2024). “OpenAI launches full o1 model with image uploads and analysis, debuts ChatGPT Pro”. VentureBeat. Archived from the unique on December 7, 2024. Retrieved December 11, 2024.

And

In December 2024, through the “12 Days of OpenAI” occasion, the corporate launched the Sora mannequin for ChatGPT Plus and Professional customers,[105][106] It additionally launched the superior OpenAI o1 reasoning mannequin[107][108] Moreover, ChatGPT Professional — a $200/month subscription service providing limitless o1 entry and enhanced voice options — was launched, and preliminary benchmark outcomes for the upcoming OpenAI o3 fashions have been shared

So I feel LangExtract was fairly correct with this extraction.

As a result of there have been many extra “hits” with this question, the visualisation is extra attention-grabbing, so let’s repeat what we did in instance 2. Right here is the code you’ll want.

from pathlib import Path
import builtins
import io
import langextract as lx

jsonl_path = Path("fashions.jsonl")

with jsonl_path.open("w", encoding="utf-8") as f:
    json.dump(serialize_annotated_document(end result), f, ensure_ascii=False)
    f.write("n")

html_path = Path("fashions.html")

# 1) Monkey-patch builtins.open so our JSONL is learn as UTF-8
orig_open = builtins.open
def open_utf8(path, mode='r', *args, **kwargs):
    if Path(path) == jsonl_path and 'r' in mode:
        return orig_open(path, mode, encoding='utf-8', *args, **kwargs)
    return orig_open(path, mode, *args, **kwargs)

builtins.open = open_utf8

# 2) Generate the visualization
html_obj = lx.visualize(str(jsonl_path))
html_string = html_obj.information

# 3) Restore the unique open
builtins.open = orig_open

# 4) Save the HTML out as UTF-8
with html_path.open("w", encoding="utf-8") as f:
    f.write(html_string)

print(f"Interactive visualization saved to: {html_path}")

Run the above code after which open the fashions.html file in your browser. This time, it’s best to have the ability to click on the Play/Subsequent/Earlier buttons and see a greater visualisation of the LangExtract textual content processing in motion.

For extra particulars on LangExtract, take a look at Google’s GitHub repo here.

Abstract

On this article, I launched you to LangExtract, a brand new Python library and framework from Google that means that you can extract structured output from unstructured enter.

I outlined a number of the benefits that utilizing LangExtract can convey, together with its capacity to deal with massive paperwork, its augmented data extraction and multi-model assist.

I took you thru the set up course of — a easy pip set up, then, by means of some instance code, confirmed use LangExtract to carry out needle-in-the-haystack kind queries on a big physique of unstructured textual content.

In my ultimate instance code, I demonstrated a extra conventional RAG-type operation by extracting a number of entities (AI Mannequin names) and an related attribute (date of launch). For each my major examples, I additionally confirmed you code a visible illustration of how LangExtract works in motion you can open and play again in a browser window.

Source link

Introducing Google’s LangExtract tool | Towards Data Science

Loop Engineering for RAG Question Parsing: The Small Loop That Runs Before Retrieval

How to Find the Optimal Coding Agent Interface

I Completed Five Years in Analytics Consulting: 5 Lessons That Changed How I Work

GPU-Resident Top-K for Agentic RAG: I Built a CUDA Kernel So My Retrieval Step Would Stop Bouncing Off the GPU

Can Machine Learning Predict the World Cup?

Automate Writing Your LLM Prompts

These Were My Favorite Things Samsung Unpacked During Its 2026 Galaxy Event

AI minister role boosted but tech department axed in Burnham shake-up

Loop Engineering for RAG Question Parsing: The Small Loop That Runs Before Retrieval

The risk of weather data sabotage is rising

Featured Picks

The problem with AI agents

An AI Customer Service Chatbot Made Up a Company Policy—and Created a Mess

Buying a New iPhone or Android Phone? This Is What You Need to Know First

Introducing Google’s LangExtract tool | Towards Data Science

Establishing a dev surroundings

Pre-requisites

Code instance 1 — needle-in-a-haystack

Code instance 3 — retrieving a number of structured outputs

Abstract

Related Posts