Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • Today’s NYT Connections: Sports Edition Hints, Answers for May 16 #600
    • Proxy-Pointer RAG — Structure-Aware Document Comparison at Enterprise Scale
    • Musk v. Altman week 3: Musk and Altman traded blows over each other’s credibility. Now the jury will pick a side.
    • Airstream World Traveler camper is a lighter, cheaper Silver Bullet
    • Berlin-based Elephant Company raises over €5 million to bring AI-powered training to frontline workers
    • The Best Outdoor Deals From the REI Anniversary Sale 2026
    • UK gambling harms research center begins nationwide
    • Google Could Limit New Gmail Accounts to Only 5GB of Free Storage
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Saturday, May 16
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Artificial Intelligence»Proxy-Pointer RAG — Structure-Aware Document Comparison at Enterprise Scale
    Artificial Intelligence

    Proxy-Pointer RAG — Structure-Aware Document Comparison at Enterprise Scale

    Editor Times FeaturedBy Editor Times FeaturedMay 16, 2026No Comments10 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    crucial AI use circumstances of an enterprise in the present day, doc comparability sits proper alongside conversational chatbots. Organizations spend an unlimited variety of person-hours evaluating contracts, insurance policies, technical specs, authorized petitions, analysis papers and plenty of extra to establish variations, dangers, revisions and semantic inconsistencies.

    Nevertheless, doc comparability is way extra advanced than conventional textual content distinction. For one, these instruments are supposed to be efficient assistants to authorized and industrial professionals, scientists and others who count on the evaluation to be on the stage of depth and language as will be anticipated from a junior skilled within the area.

    A good tougher downside is that which means in enterprise paperwork normally isn’t contained in remoted chunks. It’s embedded inside sections, hierarchies, clause groupings and relationships. And these sections could also be scattered throughout a number of pages of a doc spanning over a 100 pages. For instance, a credit score settlement might outline collateral limitations in a single part, exceptions to these a number of pages later, and describe enforcement rights below a very completely different article. If one other settlement is being in contrast in opposition to this with the standards akin to “collateral construction, safety pursuits, and lien necessities,” the system should establish, retrieve, and synthesize all of those structurally scattered sections collectively earlier than any significant comparability can happen.

    Proxy-Pointer architecture, with its structure-aware, but low-cost retrieval pipeline that preserves doc hierarchy throughout retrieval and comparability, is ideally suited to this activity. Utilizing a mix of hierarchical breadcrumb embeddings and light-weight LLM re-ranker, it is ready to exactly extract semantically aligned areas throughout paperwork earlier than comparative reasoning begins.

    On this article, I’m sharing the design and real-world outcomes of a flexible doc comparator able to analyzing each extremely advanced monetary Credit score Agreements and tutorial analysis papers. As you’ll discover within the structure described within the subsequent part, the core comparability engine is separated from the upstream doc processing and downstream report formatting and era, enabling the system to be simply tailored to any new doc area (akin to insurance coverage insurance policies, medical pointers, or tax codes). All that’s required is an upstream extraction pipeline to construction the enter for hierarchical tree era, and a downstream replace to the LLM’s analytical persona and report formatter—leaving the core multi-stage retrieval and comparability pipeline fully untouched.

    Additionally, I’m including the total code, to my present open-source Proxy-Pointer github repository, together with a 5 minute quickstart.

    Doc Comparator Structure

    Right here is an outline of the logical structure. The LLM used is gemini-3-flash together with gemini-embedding-001 (dimension: 1536) for vector embeddings.

    Architectural Tiers

    Upstream Extraction Layer

    Converts any incoming uncooked doc construction right into a standardized, machine-readable hierarchy.

    Packages Concerned

    • extract_pdf_to_md.py: Handles upstream ingestion, changing PDFs into clear, hierarchically formatted Markdown.
    • build_doc_index.py: Parses Markdown headers, filters administrative noise, and builds the hierarchical JSON construction map (_structure.json).

    Core Comparability Engine

    Coordinates semantic search over hierarchical doc nodes.

    Packages Concerned

    • criteria_validator.py: Dynamically detects the doc_type (e.g., Tutorial vs. Authorized) and performs an preliminary feasibility examine on the consumer’s comparability standards, to establish if the standards is related for the recognized doc kind.
    • section_selector.py: Implements Stage 1 Proxy-Pointer retrieval. It identifies and extracts probably the most related sections of Doc 1 primarily based on consumer standards utilizing FAISS semantic search and an LLM re-ranker.
    • cross_retriever.py: Implements Stage 2 Proxy-Pointer retrieval. It performs a focused semantic search inside Doc 2’s vector area utilizing the context of the chosen Doc 1 sections (pairing the Doc 1 part content material with the consumer’s standards because the question). The Proxy-Pointer pipeline is extraordinarily correct in figuring out the right semantically analogous sections for comparability.
    • section_comparator.py: Coordinates pairwise evaluations of matching sections, passing them to the LLM to investigate alignments and discrepancies.

    Downstream Presentation Layer

    Tailors the analytical output to the audience and codecs the ultimate visualization.

    Packages Concerned

    • build_comparison_prompt (in criteria_validator.py): The immediate assigns the suitable persona (e.g., Skilled Tutorial Researcher or Senior Authorized Counsel) primarily based on the detected doc_type.
    • report_builder.py: Renders the ultimate comparability report side-by-side utilizing skilled CSS colours and extremely readable format formatting. The report will also be downloaded as a markdown file.

    Dataset Used

    For the prototype, publicly available Credit score Agreements, Emerson (136 pages) and Texas Roadhouse (190 pages) are used. These have been intentionally chosen as they’ve completely different constructions and belong to completely different industries. Emerson is a utility supplier, and its settlement reads like a sovereign company treasury doc primarily based on credit score company rankings, whereas Texas Roadhouse’s settlement is very personalized, constructed particularly round restaurant leases, multi-entity subsidiary constructions, and dynamic leverage ratios.

    As well as, I added the function to match analysis papers for which I chosen  VectorFusion and VectorPainter, which had been utilized in my article on Multimodal Answers RAG.  They’re each papers within the extremely specialised area of text-to-vector graphics era. Whereas each share an equivalent technical basis—utilizing differentiable rendering (akin to DiffVG) to optimize Scalable Vector Graphics (SVG) paths through diffusion fashions—they differ considerably of their methodological execution. This slender, shared-domain relationship is a tough take a look at case for our comparability engine, of its capability to bypass surface-level similarities and as a substitute consider refined architectural variations, which we will see within the subsequent part.

    Comparability of Credit score Agreements

    I ran a number of completely different queries with a various set of standards; the detailed studies are absolutely included within the repository, and a snapshot is shared beneath. The Streamlit UI accepts two paperwork (both in .pdf or .md format) as enter, with the comparability carried out strictly from the attitude of Doc 1. For instance, if Doc 1 is Emerson and Doc 2 is Texas Roadhouse, the ultimate comparability is framed round Emerson.

    There are three steps to the method. First, it selects all sections from the Emerson settlement which might be related to the consumer’s standards. For every chosen part, it finds as much as three comparative sections in Texas Roadhouse, after which performs a side-by-side evaluation. Together with the detailed evaluation, the system offers a purposeful Position, a Discrepancy Score, and a Danger Course (or Methodological Tradeoff for tutorial papers)

    Within the following 4 circumstances, Doc 1 is Emerson, Doc 2 is Texas Roadhouse.

    Standards 1: collateral construction, safety pursuits, ensures, and lien necessities

    Standards 2: occasions of default, lender cures, acceleration rights, and remedy intervals

    Standards 3: monetary covenants, leverage ratio necessities, and borrower compliance obligations

    Standards 4a: representations and warranties, materials adversarial impact clauses, and disclosure obligations

    For edge case testing, right here is the above “warranties” standards with the paperwork switched. Within the following, Doc 1 is Texas Roadhouse and Doc 2 is Emerson.

    Standards 4b: representations and warranties, materials adversarial impact clauses, and disclosure obligations

    Evaluation of Credit score Settlement comparability

    What the above outcomes present is that Proxy-Pointer isn’t just matching clauses by key phrases or incomplete chunks, it’s them from the persona of a authorized analyst, somebody who understands how credit score works, throughout these extremely various industries. One being an investment-grade utility, and the opposite a midsize restaurant chain. As an illustration, it identifies the financial and authorized penalties hidden beneath superficially comparable language — like structural subordination threat inside a unfavourable pledge, enterprise-value preservation inside disposition covenants or litigation publicity inside disclosure representations.

    One other remark is that the evaluation remained directionally constant when the paperwork had been flipped. It didn’t anchor itself to Emerson as Doc 1, however as a substitute re-evaluated the agreements from the Texas Roadhouse perspective. It appropriately recognized which settlement positioned extra restrictions on the borrower, which gave lenders larger management throughout defaults, which was extra weak to property being moved out of attain, and which required the corporate to reveal extra info. None of those are explicitly written in both agreements. They turn out to be evident to a authorized analyst when a number of clauses, exceptions, thresholds, and definitions are learn collectively. The outcome feels much less like a easy clause comparability and extra like understanding how threat and management are shared between the borrower and the lender.

    Analysis Paper Comparability

    For the VectorFusion and VectorPainter papers, I in contrast utilizing the next standards: Evaluate how every paper approaches fashion management and primitive initialization in vector graphics synthesis. Particularly, analyze how VectorFusion makes use of path reinitialization and raster pattern initialization versus how VectorPainter extracts and rearranges vectorized strokes from a reference picture utilizing stroke imitation studying and style-preserving losses

    Right here is one comparability:

    The evaluation exhibits a deep domain-intensive comparability, a instrument {that a} researcher can use to match each papers with out studying them of their entirety. Proxy-Pointer strikes past surface-level structure matching and identifies the deeper design philosophy behind each papers. As well as, it appropriately acknowledges that VectorFusion treats SVG era as a dynamic optimization downside with steady path reinitialization, whereas VectorPainter approaches it as a style-guided synthesis downside centered on creative consistency and realized stroke historical past. What was additionally fairly fascinating was that it may join concepts unfold throughout fully completely different sections of the papers and steadiness the underlying limitations. This demonstrates a fine-grained evaluation of two methods in the identical slender area however that work in a different way.

    Open-Supply Repository

    Proxy-Pointer is absolutely open-source (MIT License) and will be accessed at Proxy-Pointer Github repository. The Doc Comparator is being added to the repo along with the present Textual content-Solely and Multimodal Answering bots.

    A 5-minute quickstart will allow you to check rapidly with out there knowledge.

    DocComparator/
    ├── src/
    │   ├── comparability/
    │   │   ├── cross_retriever.py    # Stage 2 PP Retrieval (Doc 2)
    │   │   ├── section_comparator.py # Pairwise LLM analysis engine
    │   │   └── section_selector.py   # Stage 1 PP Retrieval (Doc 1)
    │   ├── extraction/
    │   │   └── extract_pdf_to_md.py  # LlamaParse PDF ingestion & formatting
    │   ├── indexing/
    │   │   └── build_doc_index.py    # Skeleton tree & FAISS vector builder
    │   ├── report/
    │   │   └── report_builder.py     # Markdown report era logic
    │   ├── validation/
    │   │   └── criteria_validator.py # Persona injection & standards feasibility
    │   └── config.py                 # Core configurations and mannequin definitions
    ├── knowledge/                         # Unified Knowledge Hub
    │   └── uploads/                  # Uncooked PDFs and take a look at paperwork
    ├── outcomes/                      # Artifact studies for the take a look at circumstances tried
    └── app.py                        # Streamlit Comparator UI

    Conclusion

    Doc comparability utilizing a Chunk-Embed-Match method isn’t seemingly to offer good outcomes. In a posh enterprise doc akin to Contract Phrases and Situations, semantic which means is encapsulated into sections and subsections containing dense textual content. Every of those sections might be pages in size and a part of a really lengthy doc. For efficient comparability and evaluation – sections, definitions, exceptions, and structural relationships must be extracted collectively to make sense when learn collectively.

    Proxy-Pointer with its correct two-step retrieval pipeline is right for this activity. Because the outcomes above present, even with a finances LLM akin to gemini-flash, one can evaluate agreements or analysis papers such that it may protect the underlying intent and trade-offs hidden throughout structurally disparate sections.

    The three-tier structure of the Doc Comparator can scale to different domains with no change to the comparability engine itself. This permits structure-aware retrieval to generalize higher than a custom-built instrument that works just for a selected kind of doc. Organizations can adapt this to their particular industries and use circumstances, with minimal incremental engineering effort.

    Clone the repo. Strive your personal paperwork. Let me know your ideas.

    Join with me and share your feedback at www.linkedin.com/in/partha-sarkar-lets-talk-AI

    All analysis papers used on this article can be found at VectorFusion and VectorPainter with CC-BY license. The credit score agreements are publicly out there at SEC.gov. Code and benchmark outcomes are open-source below the MIT License. Photographs used on this article are generated utilizing Google Gemini.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    Why My Coding Assistant Started Replying in Korean When I Typed Chinese

    May 15, 2026

    From Raw Data to Risk Classes

    May 15, 2026

    How I Continually Improve My Claude Code

    May 15, 2026

    Stop Evaluating LLMs with “Vibe Checks”

    May 15, 2026

    I Let CodeSpeak Take Over My Repository

    May 14, 2026

    The Next AI Bottleneck Isn’t the Model: It’s the Inference System

    May 14, 2026
    Leave A Reply Cancel Reply

    Editors Picks

    Today’s NYT Connections: Sports Edition Hints, Answers for May 16 #600

    May 16, 2026

    Proxy-Pointer RAG — Structure-Aware Document Comparison at Enterprise Scale

    May 16, 2026

    Musk v. Altman week 3: Musk and Altman traded blows over each other’s credibility. Now the jury will pick a side.

    May 16, 2026

    Airstream World Traveler camper is a lighter, cheaper Silver Bullet

    May 16, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    Today’s NYT Mini Crossword Answers for April 21

    April 21, 2026

    HARIS AL BAHR- SEA GUARD –UUAV-USV Submarine Drone.

    April 13, 2025

    Edible sensor detects flu by tasting like thyme [HOLD FOR WEEKEND]

    October 1, 2025
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.