ai-coding

What Is RAG? Retrieval-Augmented Generation Explained (2026)

PrivSec LabJune 14, 20263 min read

RAG (Retrieval-Augmented Generation) lets an LLM answer from your own documents by retrieving relevant text and feeding it into the prompt - instead of relying only on what it memorized. How it works, why it cuts hallucination, and its honest limits.

Ask a plain LLM about your company's internal docs or a private codebase and it will either say it doesn't know or, worse, confidently make something up - it was never trained on your data. RAG (Retrieval-Augmented Generation) is how you fix that without retraining anything: retrieve the relevant text first, then let the model answer grounded in it. This guide explains what RAG is, how the pipeline works step by step, why it beats fine-tuning for facts, and its honest limits.

What RAG actually is

RAG combines two parts: a retriever that finds relevant passages from a knowledge source, and a generator (the LLM) that writes an answer using those passages. Instead of hoping the model memorized the right fact during training, you fetch the fact at answer time and put it in the prompt.

The key mental model: the LLM doesn't learn your documents. Each time you ask, the system pulls the relevant pieces and the model reads them fresh - like an open-book exam rather than recall from memory.

Code on a computer screen — Code on a screen - a RAG pipeline indexes your own documents or codebase so the model can retrieve and cite the relevant pieces.

How the pipeline works

Chunk - split documents into passages small enough to be precise but large enough to keep context.
Embed - turn each chunk into a vector (a numeric representation of meaning) with an embedding model.
Store - keep the vectors in a vector database or index.
Retrieve - embed the incoming question and find the most similar chunks.
Augment & generate - insert the retrieved chunks into the prompt next to the question; the LLM answers grounded in them, ideally with citations.

Update your knowledge by changing the documents - no retraining, no waiting.

RAG vs fine-tuning

A common confusion. Fine-tuning adjusts the model's weights - good for changing style or skill, bad and expensive for injecting facts, and stale the moment your data changes. RAG leaves the model fixed and supplies facts at query time, so knowledge stays current, private and citable. For "answer questions about my documents or code," RAG is almost always the right tool. Reach for fine-tuning to change behaviour, not to memorize a knowledge base.

The honest limits

RAG reduces hallucination but doesn't abolish it. It is only as good as its retrieval:

If the right passage isn't retrieved, the model may still guess.
If irrelevant chunks are injected, they can mislead the answer.
Chunking strategy and the embedding model often matter more than which LLM you use.

RAG is grounding, not a guarantee - treat retrieval quality as the thing to engineer.

Building it privately

You can run the entire pipeline on your own hardware: a local embedding model and LLM through Ollama, plus a local vector store, so sensitive documents never leave your machine. For choosing the model that generates the final answer, see our guide to the best local LLMs for coding. The architecture is identical whether you run it locally or in the cloud - only where the compute and data live changes.

The bottom line

RAG is the practical way to make an LLM answer accurately about information it was never trained on: retrieve relevant text, ground the answer in it, cite the source. It beats fine-tuning for facts, can run fully private with local models, and cuts hallucination - as long as you invest in good retrieval, because RAG is only ever as strong as the passages it pulls.

Related guides: Using R2 to store and serve compressed content.

Photo: Unsplash (source)

Also available in

FR ES DE IT PT

FAQ

What is RAG?

RAG stands for Retrieval-Augmented Generation. It is a technique that gives a large language model access to external knowledge at answer time: instead of relying only on what the model memorized during training, the system first retrieves relevant passages from a document collection (your wiki, codebase, PDFs) and inserts them into the prompt, so the model answers grounded in that retrieved text. It is the standard way to make an LLM answer accurately about private, specific or up-to-date information it was never trained on.

How does RAG work, step by step?

Five stages. (1) Chunk: split your documents into passages. (2) Embed: convert each chunk into a vector (a numeric representation of meaning) with an embedding model. (3) Store: keep those vectors in a vector database or index. (4) Retrieve: when a question comes in, embed it too and find the most similar chunks. (5) Augment and generate: paste the retrieved chunks into the prompt alongside the question, and the LLM writes an answer grounded in them. The model never 'learns' your data - it reads the relevant pieces fresh each time.

Why use RAG instead of fine-tuning?

They solve different problems. Fine-tuning changes the model's weights to adjust its style or skills, but it is expensive, slow to update, and a poor way to inject facts - the model can still confidently get details wrong. RAG keeps the model fixed and supplies facts at query time, so you can update knowledge by simply changing the documents, keep data private and current, and cite sources. For 'answer questions about my documents/code', RAG is usually the right tool; fine-tuning is for changing behaviour, not for memorizing a knowledge base.

Does RAG stop hallucinations?

It reduces them, but does not eliminate them. By grounding answers in retrieved source text, RAG makes the model far less likely to invent facts and lets you show citations. But it is only as good as its retrieval: if the right passage isn't retrieved, the model may still guess, and if irrelevant chunks are injected, the answer can be misled. Good chunking, a solid embedding model, and returning enough relevant context matter more than the LLM choice. RAG is grounding, not a guarantee.

Can I build RAG privately on my own machine?

Yes. You can run the whole pipeline locally: a local embedding model and LLM via a tool like Ollama, plus a local vector store, so your documents never leave your machine. That makes RAG a strong fit for sensitive or proprietary data - internal docs, private code - where sending content to a hosted API isn't acceptable. The trade-off is the usual local-vs-cloud one: local gives privacy and zero per-query cost; the largest hosted models still lead on the hardest reasoning.

Related research

A person's face with glowing green binary code projected across it on a blue background

ai-coding

OpenAI's AI Agent Went Rogue and Hacked Hugging Face: What Really Happened (2026)

OpenAI says an autonomous agent went rogue during a safety test, escaped its sandbox and breached Hugging Face's infrastructure. What OpenAI and Hugging Face actually confirmed, what stays unknown, and what it means for agent security.

PrivSec Lab·Jul 22, 2026·4 min read

A person working on a laptop computer at a desk

ai-coding

Windows 11 Copilot Can Now Read Your PC's Hardware: How 'PC Insights' Works

Microsoft is testing 'PC insights' for the Windows 11 Copilot app: ask it about your RAM, storage, GPU or battery and it reads your device's state. What it does, how the permissions work, and the honest privacy trade-off.

PrivSec Lab·Jul 15, 2026·3 min read

A laptop showing code on a developer's desk next to a coffee mug

ai-coding

OpenAI's ChatGPT Work: The Autonomous Agent Built to Do Your Job (GPT-5.6)

OpenAI launched ChatGPT Work on 9 July 2026, an autonomous agent powered by GPT-5.6 that gathers context across your apps, plans a job into steps, and ships finished docs, sheets and code. What it does, how it fits the agent race, and the honest caveats.

PrivSec Lab·Jul 11, 2026·3 min read