alexi.shπŸ” Search research…
Browser securityOS privacyToolingThreat modelingAI-codingVPNEncryption

alexi.shAI Engineering Lab

ai-coding

What Is RAG? Retrieval-Augmented Generation Explained (2026)

PrivSec Lab3 min read
An open laptop showing code on a desk

RAG (Retrieval-Augmented Generation) lets an LLM answer from your own documents by retrieving relevant text and feeding it into the prompt β€” instead of relying only on what it memorized. How it works, why it cuts hallucination, and its honest limits.

Ask a plain LLM about your company's internal docs or a private codebase and it will either say it doesn't know or, worse, confidently make something up β€” it was never trained on your data. RAG (Retrieval-Augmented Generation) is how you fix that without retraining anything: retrieve the relevant text first, then let the model answer grounded in it. This guide explains what RAG is, how the pipeline works step by step, why it beats fine-tuning for facts, and its honest limits.

What RAG actually is

RAG combines two parts: a retriever that finds relevant passages from a knowledge source, and a generator (the LLM) that writes an answer using those passages. Instead of hoping the model memorized the right fact during training, you fetch the fact at answer time and put it in the prompt.

The key mental model: the LLM doesn't learn your documents. Each time you ask, the system pulls the relevant pieces and the model reads them fresh β€” like an open-book exam rather than recall from memory.

Code on a computer screen
Code on a screen β€” a RAG pipeline indexes your own documents or codebase so the model can retrieve and cite the relevant pieces.

How the pipeline works

  1. Chunk β€” split documents into passages small enough to be precise but large enough to keep context.
  2. Embed β€” turn each chunk into a vector (a numeric representation of meaning) with an embedding model.
  3. Store β€” keep the vectors in a vector database or index.
  4. Retrieve β€” embed the incoming question and find the most similar chunks.
  5. Augment & generate β€” insert the retrieved chunks into the prompt next to the question; the LLM answers grounded in them, ideally with citations.

Update your knowledge by changing the documents β€” no retraining, no waiting.

RAG vs fine-tuning

A common confusion. Fine-tuning adjusts the model's weights β€” good for changing style or skill, bad and expensive for injecting facts, and stale the moment your data changes. RAG leaves the model fixed and supplies facts at query time, so knowledge stays current, private and citable. For "answer questions about my documents or code," RAG is almost always the right tool. Reach for fine-tuning to change behaviour, not to memorize a knowledge base.

The honest limits

RAG reduces hallucination but doesn't abolish it. It is only as good as its retrieval:

  • If the right passage isn't retrieved, the model may still guess.
  • If irrelevant chunks are injected, they can mislead the answer.
  • Chunking strategy and the embedding model often matter more than which LLM you use.

RAG is grounding, not a guarantee β€” treat retrieval quality as the thing to engineer.

Building it privately

You can run the entire pipeline on your own hardware: a local embedding model and LLM through Ollama, plus a local vector store, so sensitive documents never leave your machine. For choosing the model that generates the final answer, see our guide to the best local LLMs for coding. The architecture is identical whether you run it locally or in the cloud β€” only where the compute and data live changes.

The bottom line

RAG is the practical way to make an LLM answer accurately about information it was never trained on: retrieve relevant text, ground the answer in it, cite the source. It beats fine-tuning for facts, can run fully private with local models, and cuts hallucination β€” as long as you invest in good retrieval, because RAG is only ever as strong as the passages it pulls.

Photo: Unsplash (source)

Also available in