graphs have change into probably the most broadly used enterprise semantic layer, offering a unified view of a company’s suppliers, contracts, merchandise, companions and so forth. Consequently, they evolve organically over time to change into very giant, with hundreds of thousands of nodes (entities) and plenty of instances extra edges (relations).
Even with governance controls and ontologies in place, adherence throughout completely different pipelines feeding information into the graph is commonly not constant. New enterprise guidelines emerge, naming conventions change and older areas of the graph are regularly left untouched as a result of sheer complexity and computational price of upgrading them.
All of this makes a big graph more and more tough to take care of. One of many greatest operational issues happen on the ingestion layer. For each new doc that must be added, just a few questions emerge recurrently that want answering. Questions corresponding to the next:
- Does Sony Corp exist already within the graph? And if that’s the case, by what title?
- Is the “Sony Corp” listed on this new doc the identical entity as “Sony Interactive Leisure” already current within the graph? Or do they maintain completely different relationships to our group, thereby requiring a definite, new node?
- What relationships do exist? Semantic ambiguities (provides, gives, is contracted for) make reconciliation more and more tough at scale.
Within the absence of an efficient device that may slim the search area, ingestion pipelines are pressured to execute costly international graph searches to scan for variations, which degrade efficiency and incurs giant computational prices.
What if there was a scalable, low-cost and quick solution to scan 1000’s of historic paperwork which have already been ingested into the graph and decide the possible entities and relations earlier than querying the information graph. Even higher can be to make use of the context thus gathered for semantic localization —
telling the pipeline precisely which particular area of the graph to replace, fairly than forcing it to traverse the entire thing?
The plain alternative for this pre-filtering step is a vector index.
Nonetheless, conventional Retrieval-Augmented Era (RAG) is fully unsuitable for this activity. Normal vector chunking fragments a doc into remoted snippets, with no frequent structural narrative. Whereas chunks could possibly discover an entity title, they strip away the encompassing context wanted to precisely extract the relationships between firms, merchandise, individuals, locations and so forth.
That’s the place Proxy-Pointer architecture is available in.
On this article, I’ll exhibit a novel strategy to rapidly and reliably extract entities and relationships from historic paperwork. By utilizing vector matches as “pointers” to retrieve intact structural sections of a doc, we are able to shift the burden of entity reconciliation away from the costly Information Graph, and onto a considerably quicker, cheaper, and extra correct vector retrieval pipeline.
Fast Recap: What’s Proxy-Pointer?
Normal vector RAG splits paperwork into blind chunks, embeds them, and retrieves the top-Okay by cosine similarity. The synthesizer LLM sees fragmented, context-less textual content — and regularly hallucinates or misses the reply fully.
Proxy-Pointer fixes this with 5 zero-cost engineering methods:
- Skeleton Tree — Parse Markdown headings right into a hierarchical tree (pure Python, no LLM wanted)
- Breadcrumb Injection — Prepend the complete structural path (
AMD > Monetary Statements > Money Flows) to each chunk earlier than embedding - Construction-Guided Chunking — Cut up textual content inside part boundaries, by no means throughout them
- Noise Filtering — Take away distracting sections (TOC, glossary, government summaries) from the index
- Pointer-Based mostly Context — Use retrieved chunks as pointers to load the full, unbroken doc part for the synthesizer
The consequence: each chunk is aware of the place it lives within the doc, and the synthesizer sees full sections — not fragments.
How Information Graphs Deal with Reconciliation
Whereas it’s clear why conventional vector databases aren’t appropriate for reconciliation, it’s value inspecting how information graphs sort out this downside. Nearly all enterprise graph databases can carry out semantic similarity matching over nodes and relationships. As well as, graph databases deploy a wide range of instruments — ontology matching, alias tables, fuzzy matching and GNN. However maybe probably the most well-known and broadly used approach is embedding similarity.
In a contemporary graph, the nodes and edges of a graph carry vector embeddings. And node embeddings will embrace not solely the node title (eg: Sony Corp) but additionally its metadata (tags corresponding to trade) and its localized topology (neighborhood nodes and relations). In precept, this enables the system to establish nodes which might be semantically shut even when names differ. For instance, a graph seek for: Sony + gaming ecosystem + provider could retrieve nodes corresponding to PlayStation ecosystem, Sony Corp or Sony Interactive Leisure.
Nonetheless, this strategy turns into more and more tough at enterprise scale. Because the variety of semantically comparable entities proliferates—whether or not by design or resulting from messy historic information — it turns into more and more tough to foretell which particular entity node is the right goal for the brand new relationship we try to ingest.
Contemplate this single sentence: “AMD partnered with Sony for PlayStation semi-custom SoCs” incorporates entity id (AMD, Sony, Ps) but additionally relationship semantics (partnered with), platform context (Ps) and enterprise function (semi-custom SoCs). Implicitly, this sentence maps to a number of distinct relationships: AMD is the chip designer/provider, Sony is the platform proprietor/buyer, and the interplay is hardware-oriented.
In a big information graph, such various relationships aren’t saved shut collectively — they’re prone to be distributed throughout a number of nodes and complicated edge paths. But within the supply doc, they’re a part of only one part the place this sentence happens.
This makes deciding which of the “Sony-related” nodes is the right anchor for the brand new ingestion a large, computationally costly problem.
So, how does Proxy-Pointer resolve this?
The reply is that whereas conventional vector database is used as a retailer for random bag-of-words (chunks), Proxy-Pointer treats it as a structural index. Utilizing a two step pipeline that bridges the hole between a precise match and a semantic search, we are able to tackle reconciliation effectively sufficient to regulate entity and relations sprawl.
For this check, I downloaded and embedded the publicly available 10-Okay filings of AMD for 2020 and 2021 utilizing the Proxy-Pointer tree-based indexing. Every of those paperwork is over 120 pages lengthy and end in a complete of ~1000 chunks. This types our historic dataset that we assume is already ingested and current within the information graph. I then took the AMD 2022 10-Okay because the doc to be ingested and created 4 check queries round entities detected there.
Right here is how this works in observe:
1. Constructing the Entity Profile (The Question Builder)
When an ingestion pipeline processes a brand new doc (e.g., a 2022 submitting), an upstream LLM first extracts not solely the entities, however would then create an “Entity Profile.” As an alternative of simply extracting the title “Sony”, it collates the important thing info and enterprise context surrounding that entity within the new doc.
For instance: "Sony is referenced in AMD's 2022 submitting because the proprietor of the PlayStation registered trademark. Within the Gaming section, AMD states that the Sony PlayStation 5 and Microsoft Xbox Sequence S and X sport consoles function AMD's RDNA graphics structure. Beneath Semi-Customized Merchandise, AMD states that it developed the semi-custom SoC merchandise that energy each the Sony PlayStation 5 in addition to the Microsoft Xbox Sequence S and X sport consoles. AMD's semi-custom SoC income relies on shopper market circumstances, together with the success of the Sony PlayStation 5 sport console techniques worldwide.."
Our Question Builder robotically converts this profile right into a multi-track vector search technique. It generates one question for the pure entity title (Sub-Question1: “Sony”) designed to search out any part that explicitly mentions the phrase “Sony” or another variants. Subsequent, we parse the profile to generate a number of questions focused to search out if the identical/comparable relation exists between “Sony” and one other entity. This creates the next two queries on this case:
- Sub-Question 2:
"Sony owns the PlayStation trademark. Does the identical or an analogous relation exist?" - Sub-Question 3:
"Sony makes use of AMD's RDNA graphics structure within the PlayStation 5. Does the identical or an analogous relation exist?"
By sending each the uncooked entity title and the decomposed relationship questions, Proxy-Pointer creates a personalized “semantic internet”— enhancing the likelihood that the Reconciler will get each related part of the doc wanted to confirm the node from completely different views earlier than ingesting into the Information Graph. Nonetheless, discovering the correct chunks within the vector database is simply half the battle. That is the place conventional RAG fails, and the place the Proxy-Pointer structure makes the leap.
2. The Vector Hit is Only a “Pointer”
In conventional RAG, the vector database would return a fragmented 500-character chunk—maybe simply the sentence mentioning PlayStation. Proxy-Pointer ignores the textual content of the chunk fully. As an alternative, it makes use of the chunk’s metadata as a “pointer” to retrieve the total, structurally intact doc part (from heading to subsequent heading).
That is what lets the LLM Reconciler see the complete semantic context to derive relations between entities eg: Sony is proprietor of Ps
3. LLM-Pushed Strict Reconciliation
We take the distinctive top-k sections returned by all of the queries and create the context for the Reconciler LLM. The Reconciler is prompted to output all of the completely different variations of the entity title in addition to relations it could see with different entities. As a result of the LLM can learn the whole part, it doesn’t must guess relationships primarily based on remoted sentences.
What in regards to the variations not in top-k?
A legitimate architectural query at this level can be to grasp what occurs if the entity is talked about throughout lots of of historic paperwork? Since our vector search is proscribed to retrieving the `top-k` (e.g., 3 to 7) sections, gained’t we miss very important historic context?
The reply is “we don’t must”. The purpose of the Proxy-Pointer filter pipeline shouldn’t be an exhaustive evaluation of the historic paperwork, it’s “Semantic Localization” for graph ingestion. By retrieving just some extremely related, full-context sections, the Reconciler LLM captures a vital mass of entity aliases and enterprise relationships which helps the GraphQL to pinpoint the area the place it ought to look to search out and merge this entity and relation.
As we’ll see within the subsequent part, for our “Sony” instance, we’re capable of finding the canonical authorized entity “Sony Interactive Leisure, Inc.” In different case, it could level to the broader areas — gaming techniques or AMD which can be utilized as anchors to localize our search in these graph neighborhoods.
Take a look at Outcomes
I deployed this structure in opposition to a number of tough ingestion situations, with outstanding outcomes.
Bridging Aliases (The “Sony” and “Valve” Assessments):
When queried with the “Sony” question (Sub-Question 1) talked about within the earlier sections, the Reconciler learn the retrieved historic sections and precisely recognized the formal authorized entity already within the graph: “Sony Interactive Leisure, Inc.”. Additionally, it efficiently resolved the connection owns trademark Ps (Sub-Question 2) confirming it already exists within the historic disclosures and stopping duplicate edge creation.
Extra importantly, it additionally discovered oblique proof for the Sub-Question 3 (Sony makes use of AMD's RDNA graphics structure) within the graph_neighborhood. From historic 2020 and 2021 paperwork, it inferred that AMD leverages its core graphics IP to design Semi-Customized SoCs that energy the PlayStation 5, which itself is constructed on the RDNA 2 structure. And due to this fact, this complicated multi-hop relationship is already accounted for — efficiently stopping the creation of a number of redundant edges.
Right here is the complete response:
{
"entity_name": "Sony",
"candidate_matches": [
{
"name": "Sony Interactive Entertainment, Inc.",
"confidence": "HIGH",
"sources": [
"AMD_2020_10K > Pending Acquisition > Additional Information",
"AMD_2021_10K > Additional Information"
],
"reasoning": "The proof explicitly identifies 'PlayStation' as a registered trademark of 'Sony Interactive Leisure, Inc.', which is the formal authorized entity for the Sony gaming division referenced within the 2022 context."
}
],
"candidate_relationships": [
{
"relationship": "owns trademark",
"target_entity": "PlayStation",
"confidence": "HIGH",
"evidence_snippet": "PlayStation is a registered trademark or trademark of Sony Interactive Entertainment, Inc."
},
{
"relationship": "partner/customer",
"target_entity": "AMD",
"confidence": "HIGH",
"evidence_snippet": "We developed the semi-custom SoC products that power both the Sony PlayStation®5 as well as the Microsoft®Xbox Series X™ and Microsoft® Xbox Series S™ game consoles."
}
],
"graph_neighborhood": [
{
"related_entity": "PlayStation 5",
"relationship_to_target": "Product owned by the target entity",
"evidence_snippet": "We have developed RDNA™ 2, a high performing and power efficient graphics architecture, which is the foundation for next-generation PC gaming graphics, the PlayStation 5 and Xbox Series S and X consoles."
},
{
"related_entity": "Semi-Custom Products",
"relationship_to_target": "Business segment where AMD and the target entity collaborate",
"evidence_snippet": "We leverage our core IP, including our graphics and processing technologies to develop semi-custom solutions... We developed the semi-custom SoC products that power both the Sony PlayStation®5"
}
],
"abstract": "The goal entity 'Sony' is recognized within the 2020-2021 proof primarily by way of its subsidiary 'Sony Interactive Leisure, Inc.', which holds the trademark for the PlayStation model. The proof confirms a direct enterprise relationship the place AMD gives semi-custom SoC merchandise for the Sony PlayStation 5 console, according to the 2022 context.",
"sources": [
"AMD_2020_10K > Pending Acquisition > Additional Information",
"AMD_2021_10K > Additional Information"
]
}
When queried with “Valve”, it efficiently extracted two distinct variations—“Valve” and “Valve Company”—by analyzing each a product partnership (the Steam Deck) and a trademark disclosure in the identical cross. Much like the Sonly question, the candidate_relationships and graph_neighborhood point out that the relationships within the Question are already represented within the graph.
Listed here are the complete enter question and the response:
Question: Valve is referenced in AMD's 2022 submitting within the context of semi-custom gaming merchandise. AMD states that it partnered with Valve to create a semi-custom APU optimized for handheld gaming to energy the Steam Deck. Steam and the Steam emblem are recognized as emblems and/or registered emblems of Valve. The partnership falls inside AMD's Gaming section, which primarily consists of discrete GPUs, semi-custom SoC merchandise, and improvement companies.
{
"entity_name": "Valve",
"candidate_matches": [
{
"name": "Valve",
"confidence": "HIGH",
"sources": [
"AMD_2021_10K > The Enterprise, Embedded and Semi-Custom Markets"
],
"reasoning": "The entity is explicitly named within the 2021 10-Okay as a accomplice for the Steam Deck."
},
{
"title": "Valve Company",
"confidence": "HIGH",
"sources": [
"AMD_2021_10K > Additional Information"
],
"reasoning": "Full authorized title offered within the trademark disclosures."
}
],
"candidate_relationships": [
{
"relationship": "partnered with",
"target_entity": "AMD",
"confidence": "HIGH",
"evidence_snippet": "We also recently partnered with Valve to create a custom APU optimized for handheld gaming to power the Steam Deck™."
},
{
"relationship": "owns trademark",
"target_entity": "Steam",
"confidence": "HIGH",
"evidence_snippet": "Steam and the Steam logo are trademarks and/or registered trademarks of Valve Corporation in the United States and/or other countries."
}
],
"graph_neighborhood": [
{
"related_entity": "Steam Deck",
"relationship_to_target": "Product developed by the target entity (Valve) in partnership with AMD.",
"evidence_snippet": "We also recently partnered with Valve to create a custom APU optimized for handheld gaming to power the Steam Deck™."
},
{
"related_entity": "Semi-Custom",
"relationship_to_target": "Business segment under which the partnership between AMD and Valve is categorized.",
"evidence_snippet": "We leverage our core IP, including our graphics and processing technologies to develop semi-custom solutions."
}
],
"abstract": "The entity 'Valve' (and its full title 'Valve Company') is current within the 2021 proof. It's recognized as a accomplice of AMD within the 'Semi-Customized' enterprise section, particularly concerning the event of a {custom} APU for the 'Steam Deck' product. The proof additionally confirms Valve's possession of the 'Steam' trademark.",
"sources": [
"AMD_2021_10K > The Enterprise, Embedded and Semi-Custom Markets",
"AMD_2021_10K > Additional Information"
]
}
Semantic Localization – 1 (The “Pensando” Take a look at)
That is the place the structure actually shines. We requested the pipeline to reconcile “Pensando Techniques” (an organization AMD acquired in 2022) in opposition to the 2020-2021 corpus. As anticipated, it accurately recognized that this can be a new node not but current within the graph. However extra importantly, primarily based on the Pensando entity profile, it recognized entities from AMD paperwork — Knowledge middle, Networking and AMD to which Pensando is probably going associated. This localizes the graph neighborhood for search thereby streamlining the ingestion. Listed here are the question and response:
Question: Pensando Techniques, Inc. was acquired by AMD in Could 2022 for roughly $1.9 billion. With this acquisition, AMD gives high-performance DPUs and next-generation information middle options. The AMD Pensando DPUs are P4 programmable and designed to assist offload information middle infrastructure companies from the CPU. Coupled with a complete software program stack, they assist allow cloud and enterprise clients to optimize efficiency for community, storage, and safety companies at cloud scale. The DPUs are designed for minimal latency, jitter, and energy influence, and may help giant Infrastructure as a Service (IaaS) cloud suppliers enhance hosted virtualized and bare-metal workload efficiency. The DPUs energy the Aruba CX 10000 top-of-rack community change.
{
"entity_name": "Pensando Techniques",
"candidate_matches": [],
"candidate_relationships": [],
"graph_neighborhood": [
{
"related_entity": "Data center",
"relationship_to_target": "The target entity (Pensando Systems) provides DPUs designed to offload infrastructure services from the CPU in this environment.",
"evidence_snippet": "Today’s data centers require new technologies and configuration models to meet the demand driven by the growing amount of data that needs to be stored, accessed, analyzed and managed."
},
{
"related_entity": "Networking",
"relationship_to_target": "The target entity specializes in network services; this is a key focus area for AMD's embedded products.",
"evidence_snippet": "Embedded products address computing needs in enterprise-class telecommunications, networking, security, storage systems and thin clients"
},
{
"related_entity": "AMD",
"relationship_to_target": "AMD is the parent company that acquired the target entity in 2022.",
"evidence_snippet": "Advanced Micro Devices, Inc. (AMD) was incorporated under the laws of Delaware on May 1, 1969"
},
{
"related_entity": "Infrastructure as a Service (IaaS)",
"relationship_to_target": "The target entity's DPUs are designed to improve performance for IaaS cloud providers.",
"evidence_snippet": "Servers are also used in cloud computing, which is a computing model where data, applications and services are delivered over the internet or an intranet"
}
],
"abstract": "Pensando Techniques doesn't seem within the 2020-2021 proof, which is according to the truth that the acquisition occurred in 2022. The graph neighborhood confirms that AMD was already closely invested within the 'Enterprise, Embedded and Semi-Customized' markets, particularly specializing in networking, safety, and information middle infrastructure, which aligns with the strategic rationale for the later acquisition of Pensando Techniques.",
"sources": []
}
Semantic Localization – 2 (The “Product Household” Take a look at)
When requested about “AMD EPYC 9004 Sequence”, the pipeline accurately inferred that though this isn’t current within the graph, it’s a part of the product sequence “AMD EPYC Sequence” and its predecessor “AMD EPYC 7003 Sequence”, powered by “Zen 3” structure is prone to be current throughout the “Knowledge Heart” portfolio. This significantly simplifies discovering no less than certainly one of these entities within the present graph and including the brand new node there. Listed here are the question and response:
Question: AMD EPYC 9004 Sequence processors are a part of AMD's information middle computing portfolio, listed below the AMD EPYC Sequence alongside the AMD EPYC 7003 Sequence and AMD EPYC 7002 Sequence. The 4th Gen AMD EPYC 9004 Sequence processors are constructed on the "Zen 4" core and are designed to ship management efficiency and power effectivity throughout a number of market segments and workloads. They seem within the Knowledge Heart section, which primarily consists of server CPUs, GPUs, DPUs, FPGAs, and Adaptive SoC merchandise. The Knowledge Heart section leverages expertise to deal with computational and visible information processing wants, the place completely different combos of CPUs, GPUs, DPUs, FPGAs, and Adaptive SoCs allow the optimization of efficiency and energy for a various set of workloads.
{
"entity_name": "AMD EPYC 9004 Sequence",
"candidate_matches": [],
"candidate_relationships": [],
"graph_neighborhood": [
{
"related_entity": "AMD EPYC Series",
"relationship_to_target": "The target entity is a member of this product series.",
"evidence_snippet": "Our microprocessors for server platforms currently include the AMD EPYC™ Series processors."
},
{
"related_entity": "AMD EPYC 7003 Series",
"relationship_to_target": "The target entity is the successor to this product line.",
"evidence_snippet": "We launched our 3rd Gen AMD EPYC processors, the AMD EPYC 7003 Series CPUs, in March 2021."
},
{
"related_entity": "Zen 3",
"relationship_to_target": "The target entity uses Zen 4, while the evidence highlights the predecessor Zen 3 architecture.",
"evidence_snippet": "Our new AMD EPYC 7003 Series processors are powered by our “Zen 3” core architecture"
},
{
"related_entity": "Data Center",
"relationship_to_target": "The target entity is part of the Data Center computing portfolio.",
"evidence_snippet": "We leverage our technology to address the computational and visual data processing needs in the data center market where we design CPUs, GPUs, and software for HPC, cloud gaming, and cloud and enterprise customers."
}
],
"abstract": "The 'AMD EPYC 9004 Sequence' doesn't seem within the 2020-2021 proof, which is anticipated because the sequence was launched after this era. The proof corpus incorporates the 'AMD EPYC Sequence' and the 'AMD EPYC 7003 Sequence' (third Gen), which function the fast predecessors and the related product household context for the goal entity.",
"sources": []
}
Conclusion
As enterprise information graphs develop to lots of of hundreds of thousands of nodes, reconciliation more and more turns into the dominant ingestion bottleneck. Figuring out whether or not an entity already exists, below what title, and through which operational context usually requires costly international graph exploration throughout fragmented relationships and inconsistent semantics.
Proxy-Pointer approaches this downside in a different way.
As an alternative of forcing the graph to carry out large-scale semantic reconstruction from remoted nodes and edges, it retrieves intact contextual proof straight from historic paperwork — the place entities, relationships, enterprise roles, and ecosystem context naturally exist collectively.
In impact, Proxy-Pointer shifts a lot of the reconciliation burden away from the Information Graph and onto a considerably quicker and cheaper vector retrieval pipeline. The graph not has to globally infer that means first. It could give attention to what it does greatest: localized structural traversal, persistence, and governance.
Proxy-Pointer doesn’t exchange the Information Graph. It helps the graph perceive the place to look.
Additional Studying
Whereas entity and relations reconciliation in information graphs is a persistent trade ache level, the above pipeline is a comparatively easy technical adaptation of the Proxy-Pointer pipeline, obtainable as open supply at Proxy-Pointer Github repository. Nonetheless, Proxy-Pointer is able to far more. Learn the way it can resolve one of the crucial complicated use instances with purposes throughout industries. Which is efficient domain-contextual comparability of complicated enterprise paperwork corresponding to contracts and analysis papers utilizing a multi-step implementation of the unique structure. The article is at Proxy-Pointer RAG — Structure-Aware Document Comparison at Enterprise Scale.
Join with me and share your feedback at www.linkedin.com/in/partha-sarkar-lets-talk-AI
All paperwork used on this benchmark are publicly obtainable 10-Okay filings at SEC.gov. Code and benchmark outcomes are open-source below the MIT License. Photos used on this article are generated utilizing Google Gemini.

