AI basics – RAG systems

A «Bare minimum» article on retrieval-augmented generation: how to ground a language model in external documents and up-to-date knowledge.

What RAG is

Retrieval-Augmented Generation means the model answers not only from weights learned at training time but also from fragments retrieved from an external store.

The LLM-only issue: knowledge is largely static after training — the model does not automatically learn what happened next, does not self-update, and reflects the world as of the training cutoff. That is weak for fresh facts, internal playbooks, or a personal paper library.

RAG is the technology of wiring external sources into the generation step.
It mitigates stale knowledge by pulling relevant chunks from a current corpus.
The model is augmented with document context, not just pre-trained parameters.

RAG does not eliminate hallucinations, but it grounds answers in retrievable snippets that humans (or rules) can verify more easily.

Stages of a RAG pipeline

Think of two phases: offline indexing and the online user query.

Ingestion and chunking. Documents enter the system and are split into pieces sized for the index and the context window.
Vector index build. Each chunk gets an embedding; similarity search finds chunks “close” to the query in meaning.
Retrieval. For a user question, the system selects the most relevant chunks from the index.
Generation. Those chunks are placed in the prompt (with instructions and the question), and the LLM produces an answer conditioned on that context.

Concrete choices (embeddings, vector DB, how many chunks to inject) strongly affect quality, but the pattern retrieve → inject → generate stays the same.

Chunking

Chunking splits documents into smaller segments. Those segments are the basic units of indexing and search — what you embed and what you pass to the model.

The LLM sees fragments, not whole documents at once — full docs rarely fit the window, and search needs granular matches.
Chunk quality drives whether facts are reachable: split a coherent block in the wrong place and retrieval may miss it; make chunks huge and noise dilutes the signal.

In practice people tune chunk size, overlap between neighbors, and sometimes structure-aware splits (headings, paragraphs) instead of fixed character counts only.

Types of RAG systems

People loosely contrast “naive” and “advanced” pipelines — the boundary is fuzzy, but the labels help navigate complexity.

Naive RAG: query → nearest-chunk search → chunks go straight into the model without extra processing. Easy to ship; quality hinges on corpus, chunks, and embeddings.
Advanced RAG: adds steps around retrieval and generation: query rewriting or expansion, reranking with a cross-encoder or another model, deduplication of overlapping hits, sometimes metadata filters. Goal: sharper, cleaner context for the LLM.

For coursework or a research prototype, naive RAG is a common start; you add sophistication where you see failure modes like wrong paragraph, duplicates, or vocabulary mismatch between query and docs.

Who uses it

Teams adopt RAG when answers must be grounded in a chosen document set — internal, customer-facing, or personal — rather than only in the model’s training-time knowledge.

Enterprises. Knowledge bases, policies, support playbooks: employees or customers ask questions, the system retrieves relevant snippets, and the model answers with reference to up-to-date org text.
Developers and product teams. Assistants over docs, wikis, tickets: less guessing about APIs from the open web — a controlled corpus sets the boundary.
Education and research. Working with a curated stack of papers, notes, and PDFs: ask questions over course materials or a literature review without replacing source checks.
Regulated or expert domains. Legal, clinical, finance, and similar settings where tying answers to company or regulatory text matters — always with human verification and data-access policies.