The Rise of Semantic Entity Resolution

This publish introduces the rising discipline of semantic entity decision for data graphs, which makes use of language fashions to automate essentially the most painful a part of constructing data graphs from textual content: deduplicating information. Information graphs extracted from textual content energy most autonomous brokers, however these include many duplicates. The work under consists of unique analysis, so this publish is essentially technical.

Semantic entity decision makes use of language fashions to convey an elevated stage of automation to schema alignment, blocking (grouping information into smaller, environment friendly blocks for all-pairs comparability at quadratic, n² complexity), matching and even merging duplicate nodes and edges. Up to now, entity decision methods relied on statistical tips equivalent to string distance, static guidelines or complicated ETL to schema align, block, match and merge information. Semantic entity decision makes use of representation learning to realize a deeper understanding of information’ that means within the area of a enterprise to automate the identical course of as a part of a knowledge graph factory.

TLDR

The identical know-how that remodeled textbooks, customer support and programming is coming for entity decision. Skeptical? Attempt the interactive demos under… they present potential 🙂

Don’t Simply Say It: Show It

I don’t need to persuade you, I need to convert you with interactive demos in every publish. Attempt them, edit the info, see what they will do. Play with it. I hope these easy examples proves the potential of a semantic strategy to entity decision.

This publish has two demos. Within the first demo we extract firms from information plus wikipedia for enrichment. Within the second demo we deduplicate these firms in a single immediate utilizing semantic matching.
In a second publish I’ll reveal semantic blocking, a time period I outline as that means “utilizing deep embeddings and semantic clustering to construct smaller teams of information for pairwise comparability.”
In a 3rd publish I’ll present how semantic blocking and matching mix to enhance text-to-Cypher of an actual data graph in KuzuDB.

Agent-Primarily based Information Graph Explosion!

Why does semantic entity decision matter in any respect? It’s about brokers!
Autonomous brokers are hungry for data, and up to date fashions like Gemini 2.5 Professional make extracting data graphs from textual content straightforward. LLMs are so good at extracting structured info from textual content that there can be extra data graphs constructed from unstructured information within the subsequent eighteen months than have ever existed earlier than. The supply of most internet visitors is already hungry LLMs consuming textual content to provide structured info. Autonomous brokers are more and more powered by textual content to question of a graph database by way of instruments like Text2Cypher.

The semantic internet turned out to be extremely individualistic: each firm of any measurement is about to have their very own data graph of their downside area as a core asset to energy the brokers that automate their enterprise.

Subplot: Highly effective Brokers Want Entity Resolved KGs

Firms constructing brokers are about to run straight into entity decision for data graphs as a posh, usually cost-prohibitive downside stopping them from harnessing their organizational data. Extracting data graphs from textual content with LLMs produces giant numbers of duplicate nodes and edges. Rubbish in: rubbish out. When ideas are break up throughout a number of entities, fallacious solutions emerge. This limits uncooked, extracted graphs’ capacity to energy brokers. Entity resolved data graphs are required for brokers to do their jobs.

Entity Decision for Information Graphs

There are a number of steps to entity decision for data graphs to go from uncooked information to retrievable data. Let’s outline them to know how semantic entity decision improves the method.

Node Deduplication

A low price blocking perform teams related nodes into smaller blocks (teams) for pairwise comparability, as a result of it scales at n² complexity.
An identical perform makes a match resolution for every pair of nodes inside every block, usually with a confidence rating and a proof.
New SAME_AS edges are created between every matched pair of nodes.
This varieties clusters of linked nodes known as linked elements. One element corresponds to at least one resolved report.
Nodes in elements are merged — fields might develop into lists, that are then deduplicated. Merging nodes will be automated with LLMs.

The diagram under illustrates this course of:

A Survey of Blocking and Filtering Techniques for Entity Resolution, Papadakis et al, 2020

Edge Deduplication

Merged nodes mix the sides of the supply nodes, which incorporates duplicates of the identical kind to mix. Blocking for edges is easier, however merging will be complicated relying on edge properties.

Edges are GROUPED BY their supply node id, vacation spot node id and edge kind to create edge blocks.
An edge matching perform makes a match resolution for every pair of edges inside an edge block.
Edges are then merged utilizing guidelines for the way to mix properties like weights.

The ensuing entity resolved data graph now precisely represents experience in the issue area. Text2Cypher over this information base turns into a robust technique to drive autonomous brokers… however not earlier than entity decision happens.

The place Current Instruments Come up Brief

Entity decision for data graphs is a troublesome downside, so current ER instruments for data graphs are complicated. Most entity linking libraries from academia aren’t efficient in actual world situations. Business entity decision merchandise are caught in a SQL centric world, usually restricted to folks and firm information and will be prohibitively costly, particularly for giant data graphs. Each units of instruments match however don’t merge nodes and edges for you, which requires numerous handbook effort via complicated ETL. There’s an acute want for the less complicated, automated workflow semantic entity decision represents.

Semantic Entity Decision for Graphs

Trendy semantic entity decision schema aligns, blocks, matches and merges information utilizing pre-trained language fashions: deep embeddings, semantic clustering and generative AI. It might group, match and merge information in an automatic course of, utilizing the similar transformers which are changing so many legacy methods as a result of they comprehend the precise that means of information within the context of a enterprise or downside area.

Semantic ER isn’t new: it has been state-of-the-art since Ditto used BERT to each block and match within the landmark 2020 paper Deep Entity Matching with Pre-Trained Language Models (Li et al, 2020), beating earlier benchmarks by as a lot as 29%. We used Ditto and BERT do entity decision for billions of nodes at Deep Discovery in 2021. Each Google and Amazon have semantic ER choices… what’s new is its simplicity, making it extra accessible to builders. Semantic blocking nonetheless makes use of sentence transformers, with right now’s highly effective embeddings. Matching has transitioned from customized transformer fashions to giant language fashions. Merging with language fashions emerged simply this 12 months. It continues to evolve.

Semantic Blocking: Clustering Embedded Data

Semantic blocking makes use of the identical sentence transformer fashions powering right now’s Retrieval Augmented Generation (RAG) methods to transform information into dense vector representations for semantic retrieval utilizing vector similarity measures like cosine similarity. Semantic blocking makes use of semantic clustering on the fixed-length vector representations supplied by sentence encoder fashions (i.e. sbert) to group information more likely to match based mostly on their semantic similarity within the phrases of the info’s downside area.

Every dimension in a semantic embedding vector has its personal that means, Meet AI’s multitool: Vector embeddings

Semantic clustering is an environment friendly methodology of blocking that ends in smaller blocks with extra constructive matches as a result of in contrast to conventional syntactic blocking strategies that make use of string similarity measures to kind blocking keys to group information, semantic clustering leverages the wealthy contextual understanding of contemporary language fashions to seize deeper relationships between the fields of information, even when their strings differ dramatically.

You may see semantic clusters emerge on this vector similarity matrix of semantic representations under: they’re the blocks alongside the diagonals… and they are often lovely 🙂

You shall know an object by the company it keeps: An investigation of semantic representations derived from object co-occurrence in visual scenes, Sadeghi et al, 2015

Whereas off-the-shelf, pre-trained embeddings can work effectively, semantic blocking will be enormously enhanced by fine-tuning sentence transformers for entity decision. I’ve been engaged on precisely that utilizing contrastive studying for folks and firm names in a undertaking known as Eridu (huggingface). It’s a piece in progress, however my prototype address matching model works surprisingly effectively utilizing synthetic data from GPT4o. You may fine-tune embeddings to each cluster and match.

I’ll reveal the specifics of semantic blocking in my second publish. Keep tuned!

Align, Match and Merge Data with LLMs

Prompting Massive Language Fashions to each match and merge two or more information is a brand new and highly effective approach. The newest era of Massive Language Fashions is surprisingly highly effective for matching JSON information, which shouldn’t be shocking given how effectively they will carry out info extraction. My initial experiment used BAML to match and merge firm information in a single step and labored surprisingly effectively. Given the fast tempo of enchancment in LLMs, it isn’t onerous to see that that is the way forward for entity decision.

Can an LLM be trusted to carry out entity decision? This ought to be judged on benefit, not preconception. It’s unusual to assume that LLMs will be trusted to construct data graphs whole-cloth, however can’t be trusted to deduplicate their entities! Chain-of-Thought will be employed to provide a proof for every match. I talk about workloads under, however as the range of information graphs expands to cowl each enterprise and its brokers, there can be a robust demand for easy ER options extending the KG development pipeline utilizing the identical instruments that make it up: BAML, DSPy and LLMs.

Low-Code Proof-of-Idea

There are two interactive Immediate Fiddle demos under. The entities extracted from the primary demo are used as information to be entity resolved within the second.

Extracting Firms from Information and Wikipedia

The primary demo is an interactive demo exhibiting the way to carry out info extraction from information and Wikipedia utilizing BAML and Gemini 2.5 Professional. BAML models are based mostly on Jinja2 templates and outline what semi-structured information is extracted from a given immediate. They are often exported as Pydantic models, by way of the baml-cli generate command. The next demo extracts firms from the Wikipedia article on Nvidia.

Click on for stay demo: Interactive demo of information extraction of companies using BAML + Gemini – Prompt Fiddle

I’ve been doing the above for the previous three months for my funding membership and… I’ve hardly discovered a single mistake. Any time I’ve thought an organization was faulty, it was really a good suggestion to incorporate it: Meta when Llama fashions had been talked about. By comparability, state-of-the-art, conventional info extraction instruments… don’t work very well. Gemini is much forward of different fashions relating to info extraction… supplied you utilize the appropriate device.

BAML and DSPy really feel like disruptive applied sciences. They supply sufficient accuracy LLMs develop into sensible for a lot of activity. They’re to LLMs what Ruby on Rails was to internet growth: they make utilizing LLMs joyous. A lot enjoyable! An introduction to BAML is here and you may also take a look at Ben Lorica’s show about BAML.

A truncated model of the corporate mannequin seems under. It has 10 fields, most of which gained’t be extracted from anyone article… so I threw in Wikipedia, which will get most of them. The query marks after properties like alternate string?imply elective, which is vital as a result of BAML gained’t extract an entity lacking a required discipline. @description provides steerage to the LLM in decoding the sector for each extraction and matching and merging.

Word the sort annotations used within the schema information the method of schema alignment, matching and merging!

Semantic ER Accelerates Enrichment

As soon as entity decision is automated, it turns into trivial to flesh out any public dealing with entity utilizing the wikipedia PyPi package (or a business API like Diffbot or Google Knowledge Graph), so within the examples I included Wikipedia articles for some firms, together with a pair of articles about NVIDIA and AMD. Enriching public dealing with entities from Wikipedia was all the time on the TODO listing when constructing a data graph however… so usually so far, it didn’t get performed because of the overhead of schema alignment, entity decision and merging information. For this publish, I added it in minutes. This satisfied me there can be numerous downstream influence from the rapidity of semantic ER.

Semantic Multi-Match-Merge with BAML, Gemini 2.5 Professional

The second demo under performs entity matching on the Firm entities extracted throughout the first demo, together with a number of extra firm Wikipedia articles. It merges all 39 information without delay with out a single mistake! Discuss potential!? It’s not a quick immediate… however you don’t really need Gemini 2.5 Professional to do it, quicker fashions will work and LLMs can merge many extra information than this without delay in a 1M token window… and rising quick 🙂

Click on for stay demo: LLM MulitMatch + MultiMerge – Prompt Fiddle

Merging Guided by Subject Descriptions

For those who look, you’ll discover that the merge of firms above routinely chooses the total firm identify when a number of varieties are current owing to the outline of the Firm.identify discipline description Formal identify of the corporate with company suffix. I didn’t have to provide that instruction within the immediate! It’s attainable to use report metadata to information schema alignment, matching and merging with out straight modifying a immediate. Together with merging a number of information in an LLM, I imagine that is unique work… I stumbled into 🙂

The sector annotation within the BAML schema:

class Firm {
  identify string
  @description("Formal identify of the corporate with company suffix")
  ...
}

The unique two information, one extracted from information, the opposite from Wikipedia:

{
  "identify": "Nvidia Company",
  "ticker": {
    "image": "NVDA",
    "alternate": "NASDAQ"
  },
  "description": "An American know-how firm, based in 1993, specializing in GPUs (e.g., Blackwell), SoCs, and full-stack AI computing platforms like DGX Cloud. A dominant participant within the AI, gaming, and information heart markets, it's led by CEO Jensen Huang and headquartered in Santa Clara, California.",
  "website_url": "null",
  "headquarters_location": "Santa Clara, California, USA",
  "revenue_usd": 10918000000,
  "staff": null,
  "founded_year": 1993,
  "ceo": "Jensen Huang",
  "linkedin_url": "null"
}
{
  "identify": "Nvidia",
  "ticker": null,
  "description": "An organization specializing in GPUs and full-stack AI computing platforms, together with the GB200 and Blackwell collection, and platforms like DGX Cloud.",
  "website_url": "null",
  "headquarters_location": "null",
  "revenue_usd": null,
  "staff": null,
  "founded_year": null,
  "ceo": "null",
  "linkedin_url": "null"
}

The matched and merged report under. Word the longer Nvidia Company was chosen with out particular steerage based mostly on the sector description. Additionally, the outline is a abstract of each the Nvidia point out within the article and the wikipedia entry. And no, the schemas don’t need to be the identical 🙂

{
  "identify": "Nvidia Company",
  "ticker": {
    "image": "NVDA",
    "alternate": "NASDAQ"
  },
  "description": "An American know-how firm, based in 1993, specializing in GPUs (e.g., Blackwell), SoCs, and full-stack AI computing platforms like DGX Cloud. A dominant participant within the AI, gaming, and information heart markets, it's led by CEO Jensen Huang and headquartered in Santa Clara, California.",
  "website_url": "null",
  "headquarters_location": "Santa Clara, California, USA",
  "revenue_usd": 10918000000,
  "staff": null,
  "founded_year": 1993,
  "ceo": "Jensen Huang",
  "linkedin_url": "null"
}

Under is the immediate, all fairly and branded for a slide:

This easy immediate each matches and merges 39 information within the above demo, guided by the sort annotations.

Now to be clear: there’s much more than matching in a manufacturing entity decision system… that you must assign distinctive identifiers to new information and embody the merged IDs as a discipline, to maintain monitor of which information had been merged… at a minimal. I do that in my funding membership’s pipeline. My purpose is to indicate you the potential of semantic matching and merging utilizing giant language fashions… when you’d wish to take it additional, I may also help. We try this at Graphlet AI 🙂

Schema Alignment? Coming Up!

One other powerful downside in entity decision is schema alignment: totally different sources of information for a similar kind of entity have fields that don’t precisely match. Schema alignment is a painful course of that usually happens earlier than entity decision is feasible… with semantic matching and related names or descriptions, schema alignment simply occurs. The information being matched and merged will align utilizing the facility of illustration studying… which understands that the underlying ideas are the identical, so the schemas align.

Past Matching

An fascinating side of doing a number of report comparisons without delay is that it offers a possibility for the language mannequin to watch, consider and touch upon the group of information within the immediate. In my very own entity decision pipeline, I mix and summarize a number of descriptions of firms in Firm objects, extracted from totally different information articles, every of which summarizes the corporate because it seems in that individual article. This offers a complete description of an organization by way of its relationships not in any other case obtainable.

I imagine there are lots of alternatives like this, provided that even final 12 months’s LLMs can do linear and non-linear regression… take a look at From Words to Numbers: Your Large Language Model Is Secretly A Capable Regressor When Given In-Context Examples (Vacareanu et al, 2024).

From Words to Numbers: Your Large Language Model Is Secretly A Capable Regressor When Given In-Context Examples, Vacareanu 2024.

There isn’t any finish to the observations an LLM would possibly make about teams of information: duties associated to entity decision, however not restricted to it.

Price and Scalability

The early, excessive price of huge language mannequin APIs and the historic excessive value of GPU inference have created skepticism about whether or not semantic entity decision can scale.

Scaling Blocking by way of Semantic Clustering

Matching in entity decision for data graphs is simply hyperlink prediction of SAME_AS edges, a standard graph machine studying activity. There’s little query that semantic clustering for hyperlink prediction can cost-efficiently scale, because the approach was confirmed at Google by Google Grale (Halcrow et al, 2020, NeurIPS presentation). That paper’s authors embody graph studying luminary Bryan Perozzi, current winner of KDD’s Test of Award for his invention of graph embeddings.

It scales for Google… Grale: Designing Networks for Graph Learning, Johnathan Halcrow, Google Analysis

Semantic clustering in Grale is an important a part of the machine studying behind many options throughout Google’s internet properties, together with suggestions at YouTube. Word that Google additionally makes use of language fashions to match nodes throughout hyperlink prediction in Grale 🙂 Google additionally uses semantic clustering in its Entity Reconciliation API for its Enterprise Information Graph service.

Clustering in Grale makes use of Locality Delicate Hashing (LSH). One other environment friendly methodology of clustering by way of info retrieval is to make use of L2 / Approximate Ok-Nearest Neighbors clustering in a vector database equivalent to Facebook FAISS (blog post) or Milvus. In FAISS, information are clustered throughout indexing and could also be retrieved as teams of comparable information by way of A-KNN.

I’ll discuss extra about scaling semantic blocking in my second publish!

Scaling Matching by way of Massive Language Fashions

Massive Language Fashions are useful resource intensive and make use of GPUs for effectivity in each coaching and inference. There are three causes to be optimistic about their effiency for entity decision.

1. LLMs are continually, quickly changing into inexpensive… don’t match your finances right now? Wait a month.

State of Foundation Models, 2025 by Innovation Endeavors

…and extra succesful. Not correct sufficient right now? Wait per week for the brand new finest mannequin. Given time, your satisfaction is inevitable.

The economics of matching by way of an LLM had been first explored in Price-Environment friendly Immediate Engineering for Unsupervised Entity Decision (Nananukul et al, 2023). The authors embody Mayank Kejriwal, who wrote the bible of KGs. They achieved surprisingly correct outcomes, given how dangerous GPT3.5 now seems.

2. Semantic blocking will be simpler, that means smaller blocks with extra constructive matches. I’ll reveal this course of in my subsequent publish.

3. A number of information, even a number of blocks, will be matched concurrently in a single immediate, provided that trendy LLMs have 1 million token context home windows. 39 information match and merge without delay within the demo above, however in the end, 1000’s will without delay.

In-context Clustering-based Entity Resolution with Large Language Models: A Design Space Exploration, Fu et al, 2025.

Skepticism: A Story of Two Workloads

Some workloads are acceptable for semantic entity decision right now, whereas others are usually not but. Let’s discover what works right now and what doesn’t.

Semantic entity decision is finest suited to data graphs which were extracted from unstructured textual content utilizing a big language mannequin — which you already belief to generate the info. You additionally belief embeddings to retrieve the info. Why wouldn’t you belief embeddings to block your information into matching teams, adopted by an LLM to match and merge information?

Trendy LLMs and instruments like BAML are so highly effective for info extraction from textual content that the subsequent two years will see a proliferation of information graphs masking each conventional domains like science, e-commerce, advertising, finance, manufacturing and biomedicine to… something and every part: sports activities, trend, cosmetics, hip-hop, crafts, leisure, non-fiction (each ebook will get a KG), even fiction (I predict a huge Cthulhu Mythos KG… which I could now construct). These sorts of workloads will skip conventional entity decision instruments solely and carry out semantic entity decision as one other step of their KG development pipelines.

Idempotence for Entity Decision

Semantic entity decision isn’t prepared for finance and medication, each of which have strict idempotence (reproducibility) as a authorized requirement. This has led to scare tactics that faux this is applicable to all workloads.

LLM output varies for a number of causes. GPUs execute a number of threads concurrently that end in various orders. There are {hardware} and software program settings to scale back or take away variation to enhance consistency at a efficiency hit, nevertheless it isn’t clear these take away all variation even on the identical {hardware}. Strict idempotence is simply attainable when internet hosting giant language fashions on the identical {hardware} between runs utilizing a wide range of {hardware} and software program settings and at a efficiency penalty… it requires a proof-of-concept. That’s more likely to change by way of particular {hardware} designed for monetary establishments as LLMs take over the remainder of the world. Rules are additionally more likely to change over time to accommodate statistical precision moderately than exact determinism.

For explanations of matching and merging information, idempotent workloads should additionally handle the truth that Reasoning Models Don’t Always Say What They Think (Chen et al, 2025). See extra not too long ago, Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens, Zhao et al, 2025. That is attainable with adequate validation utilizing rising instruments like prompt tuning for correct, totally reproducible conduct.

Knowledge Provenance

For those who use semantic strategies to dam, match and merge for current entity decision workloads, you have to nonetheless monitor the explanation for a match and preserve information provenance: a whole lineage of information. That is onerous work! That signifies that most companies will select a device that leverages language fashions, moderately than doing their very own entity decision. Remember that most data graphs two years from now can be new data graphs constructed by giant language fashions in different domains.

Abzu Capital

I’m not a vendor promoting you a product… I strongly imagine in open supply, open information instruments. I’m in an funding membership that constructed an entity resolved data graph of AI, robotics and data-center associated industries utilizing this know-how. We needed to put money into smaller know-how firms with excessive progress potential that reduce offers and kind strategic relationships with greater gamers with giant capital expenditures… however studying kind 10-Ok reviews, monitoring the information and including up the offers for even a handful of investments turned a full time job. So we constructed brokers powered by a data graph of firms, applied sciences and merchandise to automate the method! That is the place from which this publish comes.

Conclusion

On this publish, we explored semantic entity decision. We demonstrated proof-of-concept info extraction and entity matching utilizing Massive Language Fashions (LLMs). I encourage you to play with the supplied demos and are available to your personal conclusions about semantic entity matching. I believe the easy outcome above, mixed with the opposite two posts, will present early adopters that is the best way the market will flip, one workload at a time.

Up Subsequent…

That is the primary publish in a collection of three posts. Within the second publish, I’ll reveal semantic blocking by semantic clustering of sentence encoded information. In my closing publish, I’ll present an end-to-end instance of semantic entity decision to enhance text-to-cypher on an actual data graph for a real-world use case. Stick round, I believe you’ll be happy 🙂

At Graphlet AI we construct autonomous brokers powered by entity resolved data graphs for firms giant and small. We construct giant data graphs from structured and unstructured information: thousands and thousands, billions or trillions of nodes and edges. I lead the Spark GraphFrames undertaking, broadly utilized in entity decision for connected components. I’ve a 20 12 months background and teach community science, graph machine studying and NLP. I constructed and product managed LinkedIn InMaps and Career Explorer. I used to be a visualization engineer at Ning (Marc Andreesen’s social community), evangelist at Hortonworks and Principal Knowledge Scientist at Walmart. I coined the time period “agile information science” in 2009 (from 0 hits on Google) and wrote the primary agile information science methodology in Agile Knowledge Science (O’Reilly Media, 2013). I improved it in Agile Data Science 2.0 (O’Reilly Media, 2017), which has a 4-star rating on Amazon 8 years later (code still works). I wrote the first fully data-driven market report for O’Reilly Media in 2015. I’m an Apache Committer on DataFu, I wrote the Apache Druid onboarding docs, and I preserve graph sampler Little Ball of Fur and graph embedding assortment Karate Club.

This publish initially appeared on the Graphlet AI Blog.

Source link

The Rise of Semantic Entity Resolution

Loop Engineering for RAG Question Parsing: The Small Loop That Runs Before Retrieval

How to Find the Optimal Coding Agent Interface

I Completed Five Years in Analytics Consulting: 5 Lessons That Changed How I Work

GPU-Resident Top-K for Agentic RAG: I Built a CUDA Kernel So My Retrieval Step Would Stop Bouncing Off the GPU

Can Machine Learning Predict the World Cup?

Automate Writing Your LLM Prompts

These Were My Favorite Things Samsung Unpacked During Its 2026 Galaxy Event

AI minister role boosted but tech department axed in Burnham shake-up

Loop Engineering for RAG Question Parsing: The Small Loop That Runs Before Retrieval

The risk of weather data sabotage is rising

Featured Picks

Today’s NYT Strands Hints, Answer and Help for Sept. 2 #548

AGI is suddenly a dinner table topic

Can You Hear the Future? SquadStack’s AI Voice Just Fooled 81% of Listeners

The Rise of Semantic Entity Resolution

TLDR

Don’t Simply Say It: Show It

Agent-Primarily based Information Graph Explosion!

Subplot: Highly effective Brokers Want Entity Resolved KGs

Entity Decision for Information Graphs

Node Deduplication

Edge Deduplication

The place Current Instruments Come up Brief

Semantic Entity Decision for Graphs

Semantic Blocking: Clustering Embedded Data

Align, Match and Merge Data with LLMs

Low-Code Proof-of-Idea

Extracting Firms from Information and Wikipedia

Semantic ER Accelerates Enrichment

Semantic Multi-Match-Merge with BAML, Gemini 2.5 Professional

Merging Guided by Subject Descriptions

Schema Alignment? Coming Up!

Past Matching

Price and Scalability

Scaling Blocking by way of Semantic Clustering

Scaling Matching by way of Massive Language Fashions

Skepticism: A Story of Two Workloads

Idempotence for Entity Decision

Knowledge Provenance

Abzu Capital

Conclusion

Up Subsequent…

Related Posts