Ms don’t deserve the traditional playbook. Article 3 mentioned there isn’t a THE RAG method. You continue to have to choose one. This text is the diagnostic that tells you which of them.
Most groups constructing RAG techniques attain for a similar playbook: parse the doc into chunks, embed each chunk, drop them in a vector retailer, embed the query, retrieve the top-k by cosine similarity, hand the consequence to an LLM. Name it the traditional RAG playbook. Each tutorial teaches it. Each demo runs on it.
The precise issues range way more than the playbook suggests. A couple of actual instances.
Three instances at three completely different extremes.
Templated, high-volume paperwork. Insurance coverage certificates, KYC varieties, regulatory filings, month-to-month brokerage statements. The identical software program writes the identical format on each doc. 100 strains of regex extract the fields in microseconds. The traditional playbook runs right here too nevertheless it pays an LLM to do what the format gave you without cost.
Identical form throughout industries: payroll stubs, financial institution statements, lab take a look at reviews, tax filings, compliance attestations, provider invoices from one ERP. Wherever one piece of software program writes each doc, the format is a contract.
Sarcasm in customer-service transcripts. “Discover each sarcastic comment on this month’s name recordings.” Commonplace sentiment scoring (anger, frustration, pleasure) is essentially solved by a sentiment lexicon: unacceptable, ridiculous, annoyed all flag clearly. Sarcasm is the canonical exception. “Oh, incredible service, solely needed to wait 45 minutes” scores constructive on each lexicon, and the embedding clusters it with the honest model as a result of the floor phrases are practically the identical. The one sincere methodology is an LLM that reads every name in full and judges the hole between what is claimed and what’s meant.
Identical form throughout capabilities: HR exit interviews in search of hidden frustration, internal-chat archives in search of cultural purple flags earlier than an M&An in depth, earnings-call transcripts in search of locations the CFO hedged, sales-call recordings in search of guarantees the contract didn’t authorise. Tone and intent, no anchor within the textual content.
Engineering schematics (a unique axis altogether). Drawings, slides the place knowledge lives within the chart, technical specs with embedded pictures. Pure-text RAG returns the caption and misses the schematic. Imaginative and prescient fashions match right here, and solely right here.
Identical form: architectural blueprints, scanned handwritten data, slide decks the place knowledge lives within the chart, lab pocket book pages, medical imaging reviews. Wherever the which means lives within the pixels.
The traditional playbook is overkill on templated paperwork (regex would do), dimensionally improper on name transcripts (no anchor exists), and modality-blind on schematics (imaginative and prescient is required). It suits a center band of issues and ships as if it lined the whole lot. That center band is actual and Part 3.3 catalogues it; the price of mismatch on the remainder is what this text exists to stop.
This text is the diagnostic. Three steps, so as.
- Establish the 2 axes: RAG issues aren’t a single drawback. They sit on an image with two axes: how structured your paperwork are, and the way managed your questions are. Every mixture requires a unique stack.
- Establish the strategies per area: Every area of the image has its personal stack: regex, part retrieval, hybrid retrieval (lexical search + embedding similarity), imaginative and prescient, SQL aggregation. A 3rd axis (the agentic dimension, part 2.4) sits on prime of those and decides how a lot runtime management the LLM will get. The catalog later within the article maps every area to its method zone.
- Find your individual case: The place do your paperwork sit on the complexity axis? The place do your questions sit on the management axis? The intersection factors to a area, and to the strategies that match it.
You’re not right here to construct the whole lot. You’re right here to seek out the place you sit, then learn the components of the sequence that match. Most readers will skip half of it.
A observe earlier than the article will get technical. Most enterprise RAG is in two shapes: extracting fields from templated paperwork (the regex case within the opener), or answering free-form questions on heterogeneous paperwork like contracts and reviews (the place the remainder of the sequence spends most of its time). Conversational transcripts are an actual third form, widespread in customer support, HR, and compliance; sarcasm is the toughest query they elevate. Pure imaginative and prescient content material (schematics, slide decks) and corpus-scale questions (Half IV) come up much less usually. It’s possible you’ll meet one or two of those. The grid under permits you to find your case on sight.
This diagnostic is one piece of a bigger framing: Enterprise Doc Intelligence Volume 1 builds enterprise RAG brick by brick, and the areas of the grid this text maps level to the articles within the sequence the place every method will get constructed.
1. Two axes: doc complexity and query management
Every problem we’ll meet in this series sits somewhere on two axes:
- Document complexity: How redundant is the structure across your documents? Can a parser address fields by position, by heading, or do you need a model that sees the page?
- Question control: Who frames the question? An engineer writing a fixed prompt, or a user typing freely into a chat box, possibly with no idea what to ask?
These two axes are almost independent. The one coupling: a fixed-template document (Tier 1, below) usually forces engineer-templated questions (Tier A), since the user never types a question. Outside that corner, any document tier can pair with any question tier.
1.1 Document axis: from a fixed template to a vision model
Volume 1 stays inside the PDF scope. Multi-format documents (Word, Excel, PowerPoint, mail) are Volume 2’s territory; everything below describes one PDF at a time.
Documents vary in structural redundancy: how much of their layout is shared across the corpus. Five tiers cover most enterprise situations.

Tier 1: Fixed template: Every document has the same structure, the same fields in the same place, often produced by the same software: insurance certificates from a single broker, KYC forms, tax filings, internal compliance attestations. The structure is so predictable that you can address fields by their coordinates on the page. Technique: regex or coordinate-based extraction, no model.
Tier 2: Family of templates: Documents follow a recognizable pattern with variations (different vendor, different software, different year): invoices across suppliers, leases across landlords, employment contracts across companies in the same legal framework. Technique: a regex per template plus a few-shot LLM as fallback when the template drifts.
Tier 3: Heterogeneous structured: Each document has its own structure (sections, headings, tables of contents) but the structures don’t repeat across documents: custom legal contracts, technical manuals from different vendors, financial reports. Technique: parse the structure, retrieve via the document’s own table of contents.
Tier 4: Unstructured / OCR’d: Scanned PDFs, photos of paper, emails, free-form notes: the text is there but the layout is degraded or absent. Technique: OCR with confidence scoring, then hybrid retrieval (lexical + embeddings) over the noisy text.
Tier 5: Visually rich: Documents where the meaning lives in the visuals: schematics, dense data tables embedded as images, slide decks with charts, engineering drawings. A pure-text parse loses the answer. Technique: a vision-capable model on the page image, often combined with text-side RAG.
The further down this axis you sit, the more you pay per document. The right move is to push every problem as far up as honest analysis allows. A team that decides their corpus is “too complex for regex” without checking the structural redundancy is choosing the expensive answer by default.
1.2 Question axis: from a fixed prompt to a multi-turn chatbot
The question axis is the one most teams skip. Two questions can look identical syntactically yet require completely different stacks. The dimension that matters is who controls the question and how much.

Tier A: Engineer-templated: The question is a parameter of the system: “Extract the effective date.”, “What is the policy number?”. The engineer wrote the prompt, calibrated it, tested it on a thousand documents. The user, if any, doesn’t even type a question. Technique: field extraction, structured output, no question-parsing step needed.
Tier B: User fills slots: The question is a template with user-supplied values: “Show me the clause about {topic} in this contract.” The user picks the topic from a list, or types a tag. The shape of the query is fixed, only one slot varies. Technique: section retrieval, lookup against a known taxonomy.
Tier C: Free user query, one-shot: The user types whatever they want, the system answers in one go: “Why does this contract differ from last year’s?”. This is the classic chat-with-your-document setup, where the pipeline must parse the question, decide what to retrieve, and answer. Technique: single-document RAG with question parsing.
Tier D: Free query plus clarification. Same as C, but the system can ask the user back when the question is ambiguous: “Which page do you mean? Did you mean the sub-tenant or the main tenant?” This is what real chatbots do, and it dramatically widens the range of questions a system can serve. Technique: question parsing plus a clarification loop.
A small example to make the clarification idea concrete. Imagine a user asks: “What is the deductible?” on a single insurance contract that mentions deductibles in three sections (home, auto, travel coverage). A naive pipeline retrieves something plausible and returns a confident wrong answer. A system that can ask back (“Which coverage: home, auto, or travel?”) fixes the problem at the source.
This pushes a constraint upstream into parsing. To detect that the user mentioned “page 3” or “the second appendix”, your parser must have preserved page numbers, section indices, and heading text as metadata on every chunk. The page number sounds trivial when you look at any single document, but it is the simplest example of a parsing decision that the question side depends on. Article 5 covers this in detail.
Question scale is a separate question, not a tier on this axis. “How many PDFs are in your corpus, and are they homogeneous or heterogeneous?” is a data-side concern, picked up by section 3.2 of the diagnostic and developed in Part IV (Articles 14-17). Mixing it into the question axis blurs two different things, so it stays out.
1.3 From case to technique zone
Cross the two axes and every single-PDF RAG problem lands somewhere on the picture. Each region calls for a different stack. Most teams build for one or two regions and pretend the rest don’t exist. The grid below is a thinking tool, not a strict taxonomy: real problems often sit between two cases, and the boundaries between zones are fuzzy on purpose.

The top-left corner (rows 1-2, columns A-B) is deterministic territory. Fixed templates, controlled questions. No LLM is needed for the field extraction itself; the LLM appears at most as a fallback when the template drifts. This is where the insurance-broker mistake from the opening lives. Most enterprise document workflows fall here, and most of them are over-engineered. The broker case from the opening is the canonical example: an LLM stack at sixty thousand euros a year when a hundred-line regex would do.
The middle band (rows 2-4, columns C-D) is single-document RAG. The chat-with-your-PDF use case every vendor demo shows. It is real, it is hard, and the rest of the series spends most of its time here. Chunking (splitting the document into searchable units), retrieval (picking the right ones), reranking (a precision pass on the shortlist), and evaluation (knowing it works) all matter when the document is heterogeneous and the question is open.
The bottom row (row 5, all columns) is vision territory. Charts, schematics, dense tables. A text parser loses the answer regardless of how clever the retrieval is. Vision models fit here, and only here. Article 10 discusses when the vision step is worth its cost and when it isn’t.
Corpus-scale cases sit off the grid, since the grid is one PDF at a time. When the question targets many PDFs at once (“find every supplier contract with a liability cap below one million”), the diagnostic routes to Part IV (Articles 14-17): classification at ingestion, structured fields, SQL on the structured side, RAG on the residual unstructured questions.
The grid isn’t a recipe. It’s a sanity check. Locate your problem, look at the technique zone, and ask whether the system you’re building matches. If you’re building deeper than the case calls for, you’re paying for nothing. If you’re building shallower, you’ll discover the gap in production.
2. The techniques per case, and what isn’t a technique
Once you’ve placed your problem on the grid, you know roughly which family of techniques applies. The rest of the series develops each technique in detail.

The deterministic family (regex, section anchors that locate a heading by name, coordinate-based extraction that pulls a field from a fixed bounding box on the page) doesn’t have its own article. It’s the baseline every engineer should already know. Every engineer reading this series should already know how to write a regex. The point of including it on the map is to remind you that it’s an option. When the structure of your input is fixed, it’s the option.
The single-document RAG family is what Parts II and III of the series are about. Layout-aware parsing (Article 5), question parsing and calibration (Article 6), retrieval as scope selection (Article 7), generation as controlled execution (Article 8), hybrid retrieval and TOC routing (Article 9), adaptive parsing including vision (Article 10), cross-references (Article 11), listing and synthesis (Article 12), composite pipelines with feedback loops (Article 13). Each of these is a technique you’ll reach for in the central band of the grid.
The corpus-scale family is Part IV. The corpus problem (Article 14), preparing a queryable corpus from a folder of PDFs (Article 15), the corpus ontology (Article 16), querying with SQL filter first and retrieval second (Article 17). These come in when you go from one PDF to a corpus of PDFs.
If your problem is in the top-left corner of the grid, you can stop reading the series after Article 5 (parsing) and skip ahead to Article 15 (preparing a queryable corpus). If your problem sits in the middle band, you’ll need Parts II and III. If your problem is corpus-scale, you’ll need Part IV on top of the foundation. The map tells you which.
2.1 Pick the simplest technique that works
The instinct of every engineering team is to build the most powerful pipeline they can justify. That instinct is wrong here. The right instinct is to pick the least powerful technique that solves the actual problem. Three reasons:
- Cost: At two million docs a year, a regex on a VM is a rounding error; an LLM per document is sixty thousand euros.
- Latency: Microseconds vs seconds, the difference between “feels instant” and “feels like waiting”.
- Reliability: A regex either matches or it doesn’t and the engineer can read the rule; an LLM produces answers that are sometimes subtly wrong with failure modes harder to detect, which disqualifies it for audit-grade extraction.
Most production document workflows land on a hybrid: a deterministic core handling the bulk cleanly, with an LLM fallback for the cases where the format breaks. That hybrid is almost always the right shape, and almost never what teams build first.
2.2 Long context isn’t a way out
Every few months someone announces that “RAG is dead” because context windows just got bigger. The argument: dump the whole document in the prompt and let the model figure it out.
This works for one document and one user. It doesn’t work in production for four reasons:
- Wasteful: A typical question doesn’t need the whole document. The effective date of a contract sits on one page; sending the other thirty-nine pays for tokens that won’t be used.
- Misses information: Transformers reliably read what’s at the start and end of a long context and routinely skip what’s in the middle, so the relevant page might never be read even when it’s in the prompt.
- Doesn’t scale: Real use cases involve many documents. No context window will ever hold a corporate archive; at any meaningful scale you have to choose what to send, and that choice is retrieval.
- No grounded answer: Without explicit retrieval and citation, you can’t tell which part of the document the answer came from, you can’t verify it, you can’t audit it. For any enterprise use case where the answer needs to be traceable, that’s disqualifying.
Long contexts are useful as a tool, especially for single-document deep analysis. They’re not a substitute for retrieval. Anyone telling you otherwise is selling something.
2.3 Fancy techniques are usually keyword work in disguise
Techniques sold as “advanced” often turn out to be keyword work in another form, and often the wrong form. HyDE (Hypothetical Document Embeddings, Gao et al., 2022) is the clearest example. The protocol asks an LLM to write the hypothetical document that would answer the query, then retrieves against the embedding of that hypothetical. The pitch is that the hypothetical carries the vocabulary a real answer would use, widening the cosine margin.
The companion notebook tests this on the Attention paper: ask why multi-head attention, let HyDE generate its passage, compare against the actual vocabulary of section 3.2.2. The two lists overlap on exactly one phrase, the section title. HyDE writes ML-textbook vocabulary (semantic relationships, contextual dependencies, parallel processing, attention patterns); the paper writes operational vocabulary (attention layers, encoder-decoder attention, different positions, linear transformations).
HyDE understood the question. It never read the document. In enterprise the keywords exist somewhere on the page and the domain expert who has read the page knows them. HyDE pays per query to invent vocabulary that often does not even land on the page. The expert dictionary (Article 6), a curated list of the corpus’s actual vocabulary built once with the domain expert, gets the same job done at a fraction of the cost, reused across every future question.
2.4 Letting the LLM pick the case
Each combination of document tier and question tier is an elementary case, with one matching technique. In Volume 1, the engineer picks the case at compile-time and ships the technique. The dispatcher (Article 13) encodes the team’s routing wisdom in Python; the LLM critiques outputs inside fixed loops; every brick is auditable. That is enough for the vast majority of enterprise RAG.
A natural extension has the LLM itself pick the case at runtime, looking at the question, classifying it into a case, and choosing the technique to apply. That is what 2026 industry calls agentic RAG. Volume 3 (Agentic Bricks) builds that runtime-pick layer on top of the bricks Volume 1 produces. The shift is about who decides when, not about the bricks themselves: agentic stacks still reach for the same parsing, retrieval, and generation primitives that Volume 1 audits and tests.
3. Locate your case, in practice
3.1 Position the system around the expert who exists
The diagnostic below needs one input most teams skip: who is the user of this system?
For almost all enterprise RAG, the answer is the expert who already knows the documents. Not an open-domain user typing any question. Not a curious browser exploring a public archive. The lawyer reading a contract. The underwriter checking a quote. The compliance officer auditing a clause. Someone who has read documents like these for years, and who knows the vocabulary, the cases where one term means two things, and the failure modes to watch for.
The job of the system is then clear: amplify that expert, not replace them. Codify their vocabulary, their disambiguations, their year-by-year heuristics. Let the pipeline handle the volume; let the expert stay the source of truth.
This matters before the grid, because it changes which cases are realistic. A team that says “anyone can ask anything across the whole archive” is choosing the bottom-right case by default: open question, mixed corpus, the hardest one. A team that says “our underwriter checks a known field on a known document type” is choosing the top-left, often regex territory.
The framing is rarely a property of the documents or the questions. It is a choice the team makes. Most teams inherit it from consumer chatbots without noticing. First, position the system around the expert who is already there. Then read the case on the grid the answer points to.
3.2 The diagnostic questions
Before writing any code, work through these questions. Out loud, in front of a whiteboard, with the domain experts in the room.
About the documents: How alike are they across the corpus? Native text or OCR? How many PDFs do you have, and are they homogeneous or heterogeneous? (this is where corpus-scale concerns enter the diagnostic — they route to Part IV). Static or daily ingestion? Where on the document axis do they sit?
About the questions: Who frames them? An engineer at design time, or a user at run time? Is the system one-shot or can it ask back for clarification? Is the answer always in one document, or distributed across several? What does no answer mean: acceptable, or unacceptable? Where on the question axis do they sit?
About the constraints: Does the answer need to be traceable to the source? How precise (best-effort, or audit-grade: every citation traceable to a source line, every answer replayable)? What’s the cost budget per document? Sometimes the difference between regex and LLM is the difference between profitable and not.
The answers point you to a case on the grid. The case points you to a technique zone. The technique zone points you to the articles in the rest of the series you’ll need.
3.3 Common enterprise cases on the grid
A handful of patterns show up repeatedly in real engagements. Most readers will recognize themselves in one of these.
Field extraction from a fixed-template form. Think insurance certificates from one broker, KYC forms from one bank, tax filings from one administration: the same software writes the same layout on every page. Case: doc tier 1, question A, top-left corner. Stack: regex on coordinate-addressable fields, with an LLM fallback for the rare drift. The classic playbook is overkill here, and that’s the most common mistake we meet in real projects.
Field extraction across template variants. Think invoices across hundreds of suppliers, leases across landlords, employment contracts across companies in the same legal framework: every document follows one of a handful of recognizable patterns. Case: doc tier 2, question A or B. Stack: a regex per recognized template, plus a few-shot LLM extraction when the document doesn’t match anything in the registry. Classification before extraction.
Q&A on a long custom contract: Each contract is structured differently, sections vary, ten-page glossaries don’t repeat. The user asks free-form questions about the contract in front of them. Case: doc tier 3, question C or D, middle band. Stack: full single-document RAG with TOC routing, hybrid retrieval, schema-driven generation. This is where the four bricks of the series each carry their own weight.
Reading a slide deck or a schematic: Think engineering drawings, financial decks where data lives in the chart, technical specs with embedded images: pure-text parsing loses the answer outright. Case: doc tier 5, any question column, bottom row. Stack: vision-capable model on the page image, combined with text-side RAG for the prose around the visuals.
Off the grid – corpus territory: “Find every supplier contract with a liability cap below one million” on hundreds or thousands of contracts. The single-PDF grid stops being the right frame; the question targets the corpus, not one document. Stack: field extraction at ingestion, structured fields stored in a database, SQL on the structured side, RAG only as a fallback for the residual unstructured questions. Articles 14-17 (Part IV) develop this.
Off the grid – no structure to anchor on: a novel, an intent classification, sarcasm detection. The document has no structure, the vocabulary has no characteristic terms, and the question requires understanding tone or intent rather than locating a passage. Stack: an LLM that scans the whole text paragraph by paragraph, deciding what to flag. Not a RAG problem in Volume 1’s sense; section 2.4 hints at where this kind of runtime decision-making belongs (Volume 3).
If your case doesn’t quite match any of these, walk the diagnostic in section 3.2 and the result will tell you which of the patterns above is closest.
4. Conclusion
Run the diagnostic on your own corpus before writing code, ideally with the domain experts in the room. The output is the list of articles in the rest of the series you need to read, and the list you can skip. Teams that get RAG to ship in production are the ones that located their problem on the grid first. Teams still tuning six months in are usually the ones that started building before they did.
The next article opens Part II with the first brick: document parsing. Everything lost there cannot be recovered later, no matter how clever the retrieval.
5. Sources and further reading
The two-axis grid is a map of where each approach fits across document complexity and question control on a single PDF. The long-context-doesn’t-replace-retrieval claim the grid leans on is grounded by Liu et al. (Lost in the Middle, TACL 2024) and Lee et al. (long-context benchmark, 2024). The imaginative and prescient row maps to Faysse et al. (ColPali, 2024). The HyDE demo makes use of the method from Gao et al. (HyDE, 2022). The agentic extension hinted at in part 2.4 (the LLM selecting the case at runtime) is the course Quantity 3 develops on prime of the bricks constructed right here.
Identical course because the article:
- Liu et al., Misplaced within the Center: How Language Fashions Use Lengthy Contexts, TACL 2024 (arXiv:2307.03172). Fashions systematically miss data mid-input. Helps the declare that lengthy context just isn’t a means out.
- Lee et al., Can Lengthy-Context Language Fashions Subsume Retrieval, RAG, SQL, and Extra?, 2024 (arXiv:2406.13121). Concrete knowledge on the place long-context replaces retrieval and the place it breaks.
- Faysse et al., ColPali: Environment friendly Doc Retrieval with Imaginative and prescient Language Fashions, 2024 (arXiv:2407.01449). Imaginative and prescient-language retrieval on the web page picture itself. Anchors the visible row of the grid.
- Gao et al., Exact Zero-Shot Dense Retrieval with out Relevance Labels (HyDE), 2022 (arXiv:2212.10496). The hypothetical-document-embedding method examined in part 2.3.
Totally different angle, completely different context:
- Yao et al., ReAct: Synergizing Reasoning and Appearing in Language Fashions, ICLR 2023 (arXiv:2210.03629). Founding paper of the LLM-picks-tools-at-runtime line. Quantity 3 develops this on prime of the bricks Quantity 1 builds.
- Schick et al., Toolformer: Language Fashions Can Educate Themselves to Use Instruments, NeurIPS 2023 (arXiv:2302.04761). Identical course as ReAct.
- Gao et al., Retrieval-Augmented Technology for Massive Language Fashions: A Survey, 2024 (arXiv:2312.10997). RAG survey; treats RAG as one paradigm with shared considerations (retriever high quality, generator faithfulness).

