GliNER2: Extracting Structured Information from Text

, we had SpaCy, which was the de facto NLP library for each rookies and superior customers. It made it simple to dip your toes into NLP, even in case you weren’t a deep studying knowledgeable. Nevertheless, with the rise of ChatGPT and different LLMs, it appears to have been moved apart.

Whereas LLMs like Claude or Gemini can do all types of NLP issues automagically, you don’t all the time wish to deliver a rocket launcher to a fist struggle. GliNER is spearheading the return of smaller, centered fashions for traditional NLP strategies like entity and relationship extraction. It’s light-weight sufficient to run on a CPU, but highly effective sufficient to have constructed a thriving neighborhood round it.

Launched earlier this yr, GliNER2 is a big leap ahead. The place the unique GliNER centered on entity recognition (spawning numerous spin-offs like GLiREL for relations and GLiClass for classification), GliNER2 unifies named entity recognition, textual content classification, relation extraction, and structured knowledge extraction right into a single framework.

The core shift in GliNER2 is its schema-driven method, which lets you outline extraction necessities declaratively and execute a number of duties in a single inference name. Regardless of these expanded capabilities, the mannequin stays CPU-efficient, making it a great answer for reworking messy, unstructured textual content into clear knowledge with out the overhead of a big language mannequin.
As a data graph fanatic at Neo4j, I’ve been significantly drawn to newly added structured knowledge extraction by way of extract_json methodology. Whereas entity and relation extraction are worthwhile on their very own, the flexibility to outline a schema and pull structured JSON instantly from textual content is what actually excites me. It’s a pure match for data graph ingestion, the place structured, constant output is crucial.

Establishing data graphs with GliNER2. Picture by writer.

On this weblog put up, we’ll consider GliNER2’s capabilities, particularly the mannequin fastino/gliner2-large-v1, with a give attention to how nicely it could actually assist us construct clear, structured data graphs.

The code is accessible on GitHub.

Dataset choice

We’re not operating formal benchmarks right here, only a fast vibe examine to see what GliNER2 can do. Right here’s our take a look at textual content, pulled from the Ada Lovelace Wikipedia page:

Augusta Ada King, Countess of Lovelace (10 December 1815–27 November 1852), often known as Ada Lovelace, was an English mathematician and author mainly identified for work on Charles Babbage’s proposed mechanical general-purpose laptop, the analytical engine. She was the primary to recognise the machine had purposes past pure calculation. Lovelace is usually thought of the primary laptop programmer. Lovelace was the one legit little one of poet Lord Byron and reformer Anne Isabella Milbanke. All her half-siblings, Lord Byron’s different youngsters, have been born out of wedlock to different girls. Lord Byron separated from his spouse a month after Ada was born, and left England perpetually. He died in Greece throughout the Greek Struggle of Independence, when she was eight. Girl Byron was anxious about her daughter’s upbringing and promoted Lovelace’s curiosity in arithmetic and logic, to stop her creating her father’s perceived madness. Regardless of this, Lovelace remained taken with her father, naming one son Byron and the opposite, for her father’s center title, Gordon. Lovelace was buried subsequent to her father at her request. Though usually unwell in childhood, Lovelace pursued her research assiduously. She married William King in 1835. King was a Baron, and was created Viscount Ockham and 1st Earl of Lovelace in 1838. The title Lovelace was chosen as a result of Ada was descended from the extinct Baron Lovelaces. The title given to her husband thus made Ada the Countess of Lovelace.

At 322 tokens, it’s a strong chunk of textual content to work with. Let’s dive in.

Entity extraction

Let’s begin with entity extraction. At its core, entity extraction is the method of robotically figuring out and categorizing key entities inside textual content, akin to individuals, areas, organizations, or technical ideas. GliNER1 already dealt with this nicely, however GliNER2 takes it additional by letting you add descriptions to entity varieties, supplying you with finer management over what will get extracted.

entities = extractor.extract_entities(
    textual content,
    {
        "Particular person": "Names of individuals, together with the Aristocracy titles.",
        "Location": "Nations, cities, or geographic locations.",
        "Invention": "Machines, units, or technological creations.",
        "Occasion": "Historic occasions, wars, or conflicts."
    }
)

The outcomes are the next:

Entity extraction outcomes. Picture by writer.

Offering customized descriptions for every entity sort helps resolve ambiguity and improves extraction accuracy. That is particularly helpful for broad classes like Occasion, the place by itself, the mannequin won’t know whether or not to incorporate wars, ceremonies, or private milestones. Including historic occasions, wars, or conflicts clarifies the supposed scope.

Relation extraction

Relation extraction identifies relationships between pairs of entities in textual content. For instance, within the sentence “Steve Jobs based Apple”, a relation extraction mannequin would determine the connection Based between the entities Steve Jobs and Apple.

With GLiNER2, you outline solely the relation varieties you wish to extract as you possibly can’t constrain which entity varieties are allowed as the top or tail of every relation. This simplifies the interface however might require post-processing to filter undesirable pairings.

Right here, I added a easy experiment by including each the alias and the same_as relationship definitions.

relations = extractor.extract_relations(
    textual content,
    {
        "parent_of": "An individual is the dad or mum of one other individual",
        "married_to": "An individual is married to a different individual",
        "worked_on": "An individual contributed to or labored on an invention",
        "invented": "An individual created or proposed an invention",
        "alias": "Entity is an alias, nickname, title, or alternate reference for one more entity",
        "same_as": "Entity is an alias, nickname, title, or alternate reference for one more entity"
    }
)

The outcomes are the next:

Relation extraction outcomes. Picture by writer.

The extraction appropriately recognized key relationships: Lord Byron and Anne Isabella Milbanke as Ada’s mother and father, her marriage to William King, Babbage as inventor of the analytical engine, and Ada’s work on it. Notably, the mannequin detected Augusta Ada King as an alias of Ada Lovelace, however same_as wasn’t captured regardless of having an an identical description. The choice doesn’t appear random because the mannequin all the time populates the alias however by no means the same_as relationship. This highlights how delicate relation extraction is to label naming, not simply descriptions.

Conveniently, GLiNER2 permits combining a number of extraction varieties in a single name so you will get entity varieties alongside relation varieties in a single cross. Nevertheless, the operations are unbiased: entity extraction doesn’t filter or constrain which entities seem in relation extraction, and vice versa. Consider it as operating each extractions in parallel somewhat than as a pipeline.

schema = (extractor.create_schema()
    .entities({
        "Particular person": "Names of individuals, together with the Aristocracy titles.",
        "Location": "Nations, cities, or geographic locations.",
        "Invention": "Machines, units, or technological creations.",
        "Occasion": "Historic occasions, wars, or conflicts."
    })
    .relations({
        "parent_of": "An individual is the dad or mum of one other individual",
        "married_to": "An individual is married to a different individual",
        "worked_on": "An individual contributed to or labored on an invention",
        "invented": "An individual created or proposed an invention",
        "alias": "Entity is an alias, nickname, title, or alternate reference for one more entity"
    })
)

outcomes = extractor.extract(textual content, schema)

The outcomes are the next:

Mixed entity and relation extraction outcomes. Picture by writer.

The mixed extraction now offers us entity varieties, that are distinguished by shade. Nevertheless, a number of nodes seem remoted (Greece, England, Greek Struggle of Independence) since not each extracted entity participates in a detected relationship.

Structured JSON extraction

Maybe essentially the most highly effective characteristic is structured knowledge extraction by way of extract_json. This mimics the structured output performance of LLMs like ChatGPT or Gemini however runs totally on CPU. Not like entity and relation extraction, this allows you to outline arbitrary fields and pull them into structured information. The syntax follows a field_name::sort::description sample, the place sort is str or listing.

outcomes = extractor.extract_json(
    textual content,
    {
        "individual": [
            "name::str",
            "gender::str::male or female",
            "alias::str::brief summary of included information about the person",
            "description::str",
            "birth_date::str",
            "death_date::str",
            "parent_of::str",
            "married_to::str"
        ]
    }
)

Right here we’re experimenting with some overlap: alias, parent_of, and married_to may be modeled as relations. It’s value exploring which method works higher in your use case. One fascinating addition is the description subject, which pushes the boundaries a bit: it’s nearer to abstract technology than pure extraction.

The outcomes are the next:

{
  "individual": [
    {
      "name": "Augusta Ada King",
      "gender": null,
      "alias": "Ada Lovelace",
      "description": "English mathematician and writer",
      "birth_date": "10 December 1815",
      "death_date": "27 November 1852",
      "parent_of": "Ada Lovelace",
      "married_to": "William King"
    },
    {
      "name": "Charles Babbage",
      "gender": null,
      "alias": null,
      "description": null,
      "birth_date": null,
      "death_date": null,
      "parent_of": "Ada Lovelace",
      "married_to": null
    },
    {
      "name": "Lord Byron",
      "gender": null,
      "alias": null,
      "description": "reformer",
      "birth_date": null,
      "death_date": null,
      "parent_of": "Ada Lovelace",
      "married_to": null
    },
    {
      "name": "Anne Isabella Milbanke",
      "gender": null,
      "alias": null,
      "description": "reformer",
      "birth_date": null,
      "death_date": null,
      "parent_of": "Ada Lovelace",
      "married_to": null
    },
    {
      "name": "William King",
      "gender": null,
      "alias": null,
      "description": null,
      "birth_date": null,
      "death_date": null,
      "parent_of": "Ada Lovelace",
      "married_to": null
    }
  ]
}

The outcomes reveal some limitations. All gender fields are null, regardless that Ada is explicitly known as a daughter, the mannequin doesn’t infer she’s feminine. The description subject captures solely surface-level phrases (“English mathematician and author”, “reformer”) somewhat than producing significant summaries, not helpful for workflows like Microsoft’s GraphRAG that depend on richer entity descriptions. There are additionally clear errors: Charles Babbage and William King are incorrectly marked as parent_of Ada, and Lord Byron is labeled a reformer (that’s Anne Isabella). These errors with parent_ofdidn’t come up throughout relation extraction, so maybe that’s the higher methodology right here. General, the outcomes suggests the mannequin excels at extraction however struggles with reasoning or inference, doubtless a tradeoff of its compact dimension.

Moreover, all attributes are non-compulsory, which is smart and simplifies issues. Nevertheless, you need to watch out as typically the title attribute might be null, therefore making the document invalid. Lastly, we might use one thing like PyDantic to validate outcomes and solid to to applicable varieties like floats or dates and deal with invalid outcomes.

Establishing data graphs

Since GLiNER2 permits a number of extraction varieties in a single cross, we will mix all above strategies to assemble a data graph. Quite than operating separate pipelines for entity, relation, and structured knowledge extraction, a single schema definition handles all three. This makes it simple to go from uncooked textual content to a wealthy, interconnected illustration.

schema = (extractor.create_schema()
    .entities({
        "Particular person": "Names of individuals, together with the Aristocracy titles.",
        "Location": "Nations, cities, or geographic locations.",
        "Invention": "Machines, units, or technological creations.",
        "Occasion": "Historic occasions, wars, or conflicts."
    })
    .relations({
        "parent_of": "An individual is the dad or mum of one other individual",
        "married_to": "An individual is married to a different individual",
        "worked_on": "An individual contributed to or labored on an invention",
        "invented": "An individual created or proposed an invention",
    })
    .construction("individual")
        .subject("title", dtype="str")
        .subject("alias", dtype="str")
        .subject("description", dtype="str")
        .subject("birth_date", dtype="str")
)

outcomes = extractor.extract(textual content, schema)

The way you map these outputs to your graph (nodes, relationships, properties) depends upon your knowledge mannequin. On this instance, we use the next knowledge mannequin:

Data graph development end result. Picture by writer.

What you possibly can discover is that we embody the unique textual content chunk within the graph as nicely, which permits us to retrieve and reference the supply materials when querying the graph, enabling extra correct and traceable outcomes. The import Cypher seems to be like the next:

import_cypher_query = """
// Create Chunk node from textual content
CREATE (c:Chunk {textual content: $textual content})

// Create Particular person nodes with properties
WITH c
CALL (c) {
  UNWIND $knowledge.individual AS p
  WITH p
  WHERE p.title IS NOT NULL
  MERGE (n:__Entity__ {title: p.title})
  SET n.description = p.description,
      n.birth_date = p.birth_date
  MERGE (c)-[:MENTIONS]->(n)
  WITH p, n WHERE p.alias IS NOT NULL
  MERGE (m:__Entity__ {title: p.alias})
  MERGE (n)-[:ALIAS_OF]->(m)
}

// Create entity nodes dynamically with __Entity__ base label + dynamic label
CALL (c) {
  UNWIND keys($knowledge.entities) AS label
  UNWIND $knowledge.entities[label] AS entityName
  MERGE (n:__Entity__ {title: entityName})
  SET n:$(label)
  MERGE (c)-[:MENTIONS]->(n)
}

// Create relationships dynamically
CALL (c) {
  UNWIND keys($knowledge.relation_extraction) AS relType
  UNWIND $knowledge.relation_extraction[relType] AS rel
  MATCH (a:__Entity__ {title: rel[0]})
  MATCH (b:__Entity__ {title: rel[1]})
  MERGE (a)-[:$(toUpper(relType))]->(b)
}
RETURN distinct 'import accomplished' AS end result
"""

The Cypher question takes the outcomes from GliNER2 output and shops them into Neo4j. We might additionally embody embeddings for the textual content chunks, entities, and so forth.

Abstract

GliNER2 is a step in the appropriate route for structured knowledge extraction. With the rise of LLMs, it’s simple to succeed in for ChatGPT or Claude every time that you must pull data from textual content, however that’s usually overkill. Operating a multi-billion-parameter mannequin to extract a couple of entities and relationships feels wasteful when smaller, specialised instruments can do the job on a CPU.

GliNER2 unifies named entity recognition, relation extraction, and structured JSON output right into a single framework. It’s well-suited for duties like data graph development, the place you want constant, schema-driven extraction somewhat than open-ended technology.
Whereas the mannequin has its limitations. It really works finest for direct extraction somewhat than inference or reasoning, and outcomes might be inconsistent. However the progress from the unique GliNER1 to GliNER2 is encouraging, and hopefully we’ll see continued growth on this house. For a lot of use instances, a centered extraction mannequin beats an LLM that’s doing way over you want.

The code is accessible on GitHub.

Source link

GliNER2: Extracting Structured Information from Text

Loop Engineering for RAG Question Parsing: The Small Loop That Runs Before Retrieval

How to Find the Optimal Coding Agent Interface

I Completed Five Years in Analytics Consulting: 5 Lessons That Changed How I Work

GPU-Resident Top-K for Agentic RAG: I Built a CUDA Kernel So My Retrieval Step Would Stop Bouncing Off the GPU

Can Machine Learning Predict the World Cup?

Automate Writing Your LLM Prompts

These Were My Favorite Things Samsung Unpacked During Its 2026 Galaxy Event

AI minister role boosted but tech department axed in Burnham shake-up

Loop Engineering for RAG Question Parsing: The Small Loop That Runs Before Retrieval

The risk of weather data sabotage is rising

Featured Picks

Affordable, off-grid, and space-saving designs

QT Sense raises €4 million to develop quantum platform for cancer, sepsis and arthritis research

Over 5,000 gambling ads seen during Premier League match despite ban, study finds

GliNER2: Extracting Structured Information from Text

Dataset choice

Entity extraction

Relation extraction

Structured JSON extraction

Establishing data graphs

Abstract

Related Posts