RAG architectures — twelve patterns and three stacks

Home
AI
Concepts
RAG architectures

Date: 24.03.2026

Introduction

Retrieval-augmented generation (RAG) connects an LLM to your data: embed documents, search by similarity, inject the best passages into the prompt. Quality rarely comes from one trick. The patterns below are orthogonal knobs — chunking shapes what gets embedded; re-ranking fixes what vectors miss; query expansion helps vague user input. This page condenses twelve patterns, three common combinations, and mistakes that show up in real systems — with concrete scenarios that are illustrative, not tied to a single vendor write-up.

RAG stack in isometric purple tiles: chunk and embed, retrieve, augment prompt, generate — Stacking retrieval patterns mirrors stacking reliable system layers: start with a solid base, then add precision and flexibility.

Twelve retrieval patterns

Fixed-size chunking

Splits the raw text on a fixed budget—usually max tokens or max characters per chunk—using a sliding window down the file. The next chunk starts after a step (often chunk_size − overlap). Cuts happen where the counter lands, not necessarily at paragraphs, headings, or list items.

Flow: document → fixed window + overlap step → slice at budget → embed → index

What it does: Splits the raw text on a fixed budget—usually max tokens or max characters per chunk—using a sliding window down the file. The next chunk starts after a step (often chunk_size − overlap). Cuts happen where the counter lands, not necessarily at paragraphs, headings, or list items.

When to use it: Quick baselines, uniform logs, or when you do not yet have a reliable structure parser. Pair with measurement before investing in heavier chunking.

Example: A 512-character window can split a numbered procedure so step 3 and step 4 land in different vectors; retrieval may return one without the other.

Pros: Trivial to implement, fast ingest, predictable chunk counts, no extra parser for document structure.

Cons: You often slice through sentences, tables, and topic boundaries, so one vector can mix unrelated ideas; retrieval gets noisier than structure-aware splits unless the document is uniform.

#Chunking

#Baseline

#RAG

Structure-aware chunking: source document, sections by headings, embeddings and vector index

Context-aware chunking

LangChain SemanticChunker with HuggingFaceEmbeddings (e.g. sentence-transformers/all-MiniLM-L6-v2): embed sentences, compare neighbors in vector space, split where distances pass a percentile threshold—no markdown or headings required.

Flow: text → HuggingFaceEmbeddings → SemanticChunker (percentile) → create_documents → index

What it does: The sketch wires SemanticChunker to HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2"). It embeds sentences, scores consecutive gaps, and splits where the gap exceeds the rule set by breakpoint_threshold_type="percentile" and breakpoint_threshold_amount (e.g. 95).

When to use it: Plain prose or messy extractions with no reliable headings; use when you accept embed-time cost to avoid blind character cuts.

Example: The sample string jumps from API/auth lines to PostgreSQL/billing/tenant lines — a large embedding-distance step between those regions is a natural split candidate for the chunker.

Pros: Matches the stack many teams already use for retrieval (same Hugging Face / sentence-transformers ecosystem); create_documents returns LangChain documents ready for downstream loaders; lowering the percentile amount usually yields more chunks, raising it fewer.

Cons: Pulls sentence-transformers (and typically PyTorch) locally; very short inputs give unstable distance stats; clean extracted text before embedding if the source is PDF or HTML.

#Embeddings

#Chunking

#RAG

Late chunking

Runs long text through the encoder once, then pools spans so chunk vectors are conditioned on full-document context—strong when local clauses depend on distant sections.

Flow: full doc (window) → token states → pool per span → vectors

What it does: Runs a long document through the transformer first, then forms chunk vectors from token embeddings (or pooled spans) so each chunk’s representation is conditioned on full document context — not independent encodings of text slices.

When to use it: Dense contracts and specs where local text is meaningless without global document state.

Example: A clause references “the termination provision in Section 9”; late chunking keeps cross-references coherent better than embedding Section 12 in isolation.

Pros: Representations for each span are built after the model has seen the full sequence, so cross-sentence cues inform every chunk vector — not independent encodes of slices. Works best when your encoder can ingest long inputs in one forward pass. Chunk meaning tends to track document-level semantics more closely than naive per-slice embedding.

Cons: You depend on a stack with enough context length and VRAM to run whole documents (or large windows), and on correct pooling over token states — more moving parts than “split text, call embed.” Anything beyond the model’s max tokens still needs chunking or truncation at the document level, so the approach is bounded by that hard limit.

#Long context

#Encoder

#Contracts

Contextual retrieval

Prepends short document-level context to each chunk before embedding so similar-looking snippets stay distinguishable (policy year, product, jurisdiction).

Flow: chunk + document excerpt + title → LLM enriches (1–2 sentences) → prefix + chunk → embed → index → query

What it does: Prepends short document-level context to each chunk before embedding — often one or two sentences produced by an LLM describing how the chunk fits the whole file.

When to use it: High-stakes corpora (legal, clinical, financial) where a snippet alone is ambiguous without “which policy / year / jurisdiction.”

Example: A chunk that only says “the deductible is $500” gets prefixed with “From the 2024 PPO member handbook, pharmacy benefits chapter,” so similarity search distinguishes it from the same sentence in a dental addendum.

Pros: Anthropic’s published evaluation shows a sizeable reduction in retrieval misses compared to unstuffed chunks, because each block carries enough framing to read well on its own. The enriched string still indexes cleanly for dense vectors, lexical search, or a hybrid of the two.

Cons: One LLM call per chunk at ingest (e.g. Hugging Face Inference quotas or paid endpoints); local embedding models still cost disk and CPU; stored strings are longer than raw slices, so the index uses more space and I/O.

#LLM ingest

#Embeddings

#High-stakes RAG

Re-ranking

Two-stage retrieval: fast ANN returns a wide shortlist, then a cross-encoder or reranker scores query–passage pairs for precision before generation.

Flow: query → vector top-k → rerank → top-n → LLM

What it does: Two-stage retrieval: a fast vector index returns a wide set (often tens of candidates), then a cross-encoder or dedicated reranker scores query–passage pairs for precision.

When to use it: When wrong context is costly — support bots, compliance Q&A, anything where “close in cosine space” is not the same as “answers the question.”

Example: “Refund SLA for enterprise tier” pulls 40 chunks from a knowledge base; the reranker promotes the paragraph that names “Enterprise” and “72 hours” over a generic returns page that happened to share vocabulary.

Pros: The second stage scores actual query–passage pairs, so relevance usually beats cosine ranking alone. You can retrieve a broad shortlist from ANN yet only hand the model a tight handful of passages. When embeddings pick the wrong neighbor, a cross-encoder or reranker often corrects the order before generation.

Cons: End-to-end latency exceeds ANN-only retrieval because every candidate pair gets a heavier scoring pass. That extra forward pass means more CPU or GPU time and a higher per-query bill than pure vector search at the same candidate count.

#Reranker

#Precision

#RAG

Query expansion

Uses an LLM to turn short or vague user text into richer phrasing before embedding or keyword search so retrieval matches how your corpus is written.

Flow: user query → LLM expands → search → passages

What it does: Turns a short user query into a richer phrasing (or multiple sub-questions) using an LLM before embedding or keyword search.

When to use it: Chat UIs where people type “it’s broken” or “billing” — one-word or ambiguous inputs that match poorly in embedding space.

Example: “deploy failed” becomes “CI/CD pipeline error, Kubernetes rollout, rollback steps, last successful deployment” so retrieval targets ops runbooks instead of generic “failure” pages.

Pros: Rewritten queries tend to align better with how your corpus is phrased, so hits are less random for underspecified input. A single enriched string keeps the retrieval step straightforward compared with running several variants in parallel.

Cons: You add one LLM hop before search, so time-to-first-hit grows versus passing the raw string straight to the index. Over-expansion can steer an already-clear question toward the wrong subtopic, and you pay extra tokens on every query.

#Query LLM

#Search

#UX

Multi-query RAG

Generates several paraphrases, runs retrieval for each, then merges and deduplicates so ambiguous questions explore multiple interpretations in parallel.

Flow: query → N paraphrases → N searches → merge/dedupe → LLM

What it does: Generates several paraphrases of the same question, runs retrieval for each, then merges and deduplicates results.

When to use it: Broad or underspecified questions that admit multiple interpretations (“performance” could mean latency, throughput, or cost).

Example: For “Python async,” one query stresses asyncio event loops, another aiohttp I/O, a third migration from threads — unioning hits covers users who meant different docs.

Pros: Ambiguous wording is less of a dead end: if one phrasing misses the right doc, another may still hit it. You deliberately sample different angles on the same information need. Fan-out lookups can run in parallel so end-to-end latency stays closer to a single search than four sequential ones.

Cons: Every variant triggers its own retrieval call, so vector DB and embed traffic scale with how many lines you generate — plus the LLM cost to produce them. Even after deduping, overlapping passages from multiple queries can bloat the candidate set before merge.

#Parallel search

#Recall

#RAG

Agentic RAG

Exposes multiple retrieval backends (vector, SQL, APIs); an agent picks tools per question and can chain hops across heterogeneous data.

Flow: question → agent plans → tool(s) → observe → maybe more tools → answer

What it does: Exposes multiple retrieval tools (vector DB, SQL, API, file search); an agent chooses which to invoke based on the question.

When to use it: Heterogeneous data — some answers live in PDFs, others in structured tables or live systems.

Example: “How many open P1 incidents last week?” routes to a metrics API or SQL, while “what’s our status-page comms template?” routes to the doc index — one orchestrator, different tools.

Pros: One controller can steer questions to vectors, SQL, HTTP tools, or file search instead of maintaining a separate integration for each path. The planner can chain or mix tools when a single hop is not enough. Fits messy estates where answers are split across systems and formats.

Cons: You own more moving parts: tool contracts, error handling, and safety checks than in a plain retriever pipeline. Behavior shifts with prompts and model mood, so p95 latency and spend are harder to budget than a linear chain. Reasoning plus tool calls stretches wall-clock time versus one-shot retrieval.

#Agents

#Tools

#Orchestration

Self-reflective RAG

Scores retrieval quality and can rewrite the query and search again before answering—useful when false negatives are costly.

Flow: retrieve → judge relevance → (optional) rewrite → re-retrieve → generate

What it does: After an initial retrieval, the system judges relevance; if scores are weak, it rewrites the query and searches again — sometimes iteratively.

When to use it: Research-style or high-accuracy settings where extra latency is acceptable and false negatives are painful.

Example: First pass retrieves only marketing blurbs; the reflector notes missing technical terms, rewrites the query to include the product codename and “API rate limit,” and the second pass lands in engineering docs.

Pros: The system can notice thin or off-topic passages and adjust the search before it commits to an answer, instead of trusting the first hit. Each pass can sharpen the question or the candidate set. A weak initial retrieval is less fatal when a rewrite steers the index toward better ground.

Cons: Relevance checks and query rewrites stack extra model calls on top of retrieval, so end-to-end delay and billable tokens typically exceed simpler pipelines. That overhead buys better odds, not a guarantee — genuinely hard questions or noisy sources can still return junk after several tries.

#Reflection

#Quality

#Latency tradeoff

Knowledge graph RAG: vector seeds, expand along graph edges, merge context for the LLM

Knowledge graphs

Combines vector search with typed entities and relations so answers can follow dependencies, not only cosine neighbors in embedding space.

Flow: retrieve seeds → expand graph → merge context → LLM

What it does: Combines vector search with a graph of entities and relations so you can walk dependencies, not only nearest neighbors in embedding space.

When to use it: Domains where answers are inherently relational — who reports to whom, which drugs interact, which accounts roll up to a parent.

Example: Vector search finds a mention of “Project Aurora”; the graph expands to dependent services and on-call teams so the answer includes blast radius, not one isolated paragraph.

Pros: Dense vectors approximate similarity; typed edges spell out who links to whom or what blocks what — structure plain k-NN can smear. Answers grounded in explicit triples are less likely to invent relationships that aren’t in the data. Strong fit when your domain is a network of entities rather than flat documents.

Cons: You operate a graph database (or equivalent) alongside vectors, plus jobs to load and refresh nodes and edges. Building and maintaining extraction or mapping from raw text to entities is its own product. Queries that fan out along paths cost more wall-clock and engineering than a single ANN lookup, and bad extraction poisons both search and answers.

#Graph

#Entities

#Hybrid RAG

Hierarchical RAG

Indexes fine child chunks for precision but can promote parent sections for model context when a child matches—good for manuals and policies.

Flow: child match → fetch parent context → optional neighbors → LLM

What it does: Indexes small child chunks for precision but can promote or return larger parent sections for model context when a child matches.

When to use it: Manuals, policies, and papers with clear outline hierarchy — you want to hit a fine-grained step but show the surrounding section to the LLM.

Example: A match on a single API parameter description pulls the whole “Authentication” parent block so the model sees prerequisites and error codes, not one line in isolation.

Pros: Retrieval targets fine-grained spans where the signal is, while the promoted parent supplies surrounding context so the LM is not starved of setup. Smaller leaf units keep off-topic neighbors out of the shortlist compared with dumping whole chapters. Pairs naturally with manuals and policies that already follow headings and nested sections.

Cons: You need stable parent–child metadata in the index and ingestion logic that updates both levels when content moves. The pipeline is heavier than a single flat chunk list. If the outline you encode does not match how authors wrote the doc, joins between leaf and parent work against you.

#Hierarchy

#Context

#Docs

Fine-tuned embeddings: query and document pairs trained into an adapted vector space, then reindex

RAFT - Retriecval Augmented Fine-Tuning (Fine-tuned Embeddings)

RAFT stands for Retrieval-Augmented Fine-Tuning. It’s a hybrid training strategy that combines two powerful ideas in LLM optimization: Retrieval-Augmented Generation (RAG) and Fine-tuning (Supervised training of LLMs). Fine-tuning is the process of taking a pre-trained model and making small adjustments to its parameters on a new, often smaller dataset to adapt it to a specific task while retaining the knowledge from the original training.

Flow: collect pairs → train/adapt embedder → reindex → serve

What it does: Trains or adapts an embedding model on domain query–document pairs (or contrastive pairs from your logs) so the metric matches your users’ language.

When to use it: Specialized vocabularies — clinical notes, legal citations, internal codenames — where off-the-shelf models underperform.

Example: Pairs of (customer ticket subject, resolution article) teach the encoder that “SSO handshake” should sit near your IdP runbook, not generic “handshake” articles.

Pros: Benchmarks and internal evals often show a clear lift once the similarity metric is trained on in-domain positives and negatives — sometimes on the order of a few to low double-digit points, depending on the task. The model picks up terminology, typos, and product names that generic encoders treat as unrelated. A smaller specialized checkpoint can beat a much larger off-the-shelf embedder on your own query logs.

Cons: You need a steady supply of query–passage pairs or reliable weak labels from clicks and tickets. Training consumes GPU time and ML workflow, not just an API call. As products and language shift, embeddings go stale unless you schedule periodic retraining or continual updates.

#Fine-tuning

#Domain

#Open weights

Stacked combinations

Teams rarely deploy all twelve. Three stacks recur: production-ready (semantic chunking + re-ranking + query expansion + agentic tool choice), high-accuracy (contextual retrieval + multi-query + re-ranking + self-reflective loops), and domain expert (fine-tuned embeddings + contextual retrieval + knowledge graphs + re-ranking). Names describe intent — tune for your latency and cost envelope.

🔥 Production-ready stack: Fast baseline with good precision: structure-aware chunks, expand vague queries, rerank the candidate pool, let an agent pick the right backend when you have more than one index.
Production-ready stack
Performance: 92% accuracy, 1.2s average latency
Cost: $0 marginal per query at inference if you self-host open weights (e.g. sentence-transformers/all-MiniLM-L6-v2, BAAI/bge-reranker-large, Qwen2.5-7B-Instruct)—you pay electricity / GPU only. Illustrative ~$0.003/query with a typical API mix: gpt-4o-mini (expand + agent) + text-embedding-3-small (chunk search)—check live vendor pricing.
Best for: General-purpose production systems, customer support, internal knowledge bases
🔥 High-accuracy stack: When mistakes are expensive: rich chunk context, multiple query phrasings, reranker, and a reflection loop before answering.
High-accuracy stack
Performance: 96% accuracy, 2.5s average latency
Cost: $0 marginal per query at inference when you self-host (e.g. DeepSeek-R1-Distill-Qwen-32B or Qwen2.5-72B-Instruct for stuffing / reflection, bge-reranker-large, local embeddings). Illustrative ~$0.008/query with APIs: gpt-4o / gpt-4o-mini (context + multi-query + judge loops) + text-embedding-3-large—multi-call stacks add up; verify pricing.
Best for: Medical, legal, financial applications where errors are costly
🔥 Domain expert stack: When generic embeddings and flat text search hit a ceiling: teach the metric with pairs, keep contextual prefixes, add graphs for relational questions, rerank everything that reaches the LLM.
Domain expert stack
Performance: 94% accuracy for domain queries, 1.8s latency
Cost: After fine-tune, $0 embedding inference on your hardware (e.g. adapted all-MiniLM-L6-v2 checkpoint); training is a one-time GPU cost, not per query. Illustrative ~$0.005/query with hosted APIs: gpt-4o-mini (generation) + text-embedding-3-small + graph / rerank services—model line items determine the bill.
Best for: Medical, legal, financial, technical domains with specialized terminology

Common mistakes

🔥 Turning every knob at once — unmaintainable pipelines and no idea what helped. Start with semantic chunking + measurement; add reranking and expansion before exotic loops.
🔥 No baseline metrics — without labeled questions or human grades, “better” is subjective. Track answer correctness (or groundedness) and p95 latency per release.
🔥 Fixed-size-only chunking — cheap to implement but fragments meaning. Prefer structure-aware splitting first.
🔥 Skipping re-ranking — vector similarity is a proxy for relevance; a reranker is often the best ROI after decent chunking.
🔥 Raw queries only — if users don’t write like your docs, add expansion or synonym handling at minimum.
🔥 One retrieval path for every question — a single index and tool rarely fits; agentic routing or explicit query classification reduces forced errors.

Conclusion

Effective RAG is compositional: chunking and embeddings define what is searchable, retrieval and reranking decide what is relevant, and query formulation and orchestration connect user intent to those pieces. Add complexity when measurements justify it — not because a checklist has twelve patterns.

References

LangChain — Retrievals
LangChain docs — Retrieval concepts
YouTube — Every RAG Strategy Explained in 13 Minutes
Towards AI — Building RAG systems: eleven patterns (supplementary read)

Introduction

Twelve retrieval patterns

Fixed-size chunking

Context-aware chunking

Late chunking

Contextual retrieval

Re-ranking

Query expansion

Multi-query RAG

Agentic RAG

Self-reflective RAG

Knowledge graphs

Hierarchical RAG

RAFT - Retriecval Augmented Fine-Tuning (Fine-tuned Embeddings)

Stacked combinations

Production-ready stack

High-accuracy stack

Domain expert stack

Common mistakes

Conclusion

References