Lý thuyết
Bài 1/5

What is RAG?

Hiểu về Retrieval-Augmented Generation và tại sao cần Vector Databases

Bài 1: What is RAG?

1. The Problem with LLMs

1.1 LLM Limitations

Large Language Models (GPT-4, Claude, Gemini) có 3 vấn đề lớn:

Text
1┌─────────────────────────────────────────────────────────────┐
2│ LLM LIMITATIONS │
3├─────────────────────────────────────────────────────────────┤
4│ │
5│ 1. 📅 KNOWLEDGE CUTOFF │
6│ "Who won the 2025 World Cup?" │
7│ → "I don't have information after my training date" │
8│ │
9│ 2. 🏢 NO PRIVATE DATA │
10│ "What's our company's revenue policy?" │
11│ → "I don't have access to your company documents" │
12│ │
13│ 3. 🎭 HALLUCINATION │
14│ "What's the price of Product X?" │
15│ → "Product X costs $199" (completely made up!) │
16│ │
17└─────────────────────────────────────────────────────────────┘

1.2 Traditional Solutions

ApproachLimitation
Fine-tuningExpensive, time-consuming, static
Prompt with docsContext window limit (~100K tokens)
Retrain modelImpractical for most companies

We need a better solution → RAG!


2. What is RAG?

2.1 Definition

RAG = Retrieval-Augmented Generation

A technique that enhances LLM responses by:

  1. Retrieving relevant documents from your data
  2. Augmenting the prompt with retrieved context
  3. Generating accurate answers based on your data

2.2 RAG Architecture

Text
1┌─────────────────────────────────────────────────────────────┐
2│ RAG PIPELINE │
3├─────────────────────────────────────────────────────────────┤
4│ │
5│ USER QUERY │
6│ "What's our refund policy?" │
7│ │ │
8│ ▼ │
9│ ┌─────────────┐ │
10│ │ EMBEDDING │ → Convert query to vector │
11│ └─────────────┘ │
12│ │ │
13│ ▼ │
14│ ┌─────────────────────────────┐ │
15│ │ VECTOR DATABASE │ │
16│ │ ┌───┐ ┌───┐ ┌───┐ │ │
17│ │ │Doc│ │Doc│ │Doc│ ... │ → Find similar docs │
18│ │ └───┘ └───┘ └───┘ │ │
19│ └─────────────────────────────┘ │
20│ │ │
21│ ▼ │
22│ RETRIEVED CONTEXT: │
23│ "Refund policy: Full refund within 30 days..." │
24│ │ │
25│ ▼ │
26│ ┌─────────────┐ │
27│ │ LLM │ → Generate answer with context │
28│ └─────────────┘ │
29│ │ │
30│ ▼ │
31│ ANSWER: "Our refund policy allows full refunds │
32│ within 30 days of purchase..." │
33│ │
34└─────────────────────────────────────────────────────────────┘

2.3 Why RAG Works

Without RAGWith RAG
LLM guesses based on trainingLLM answers based on YOUR data
May hallucinateGrounded in retrieved documents
Static knowledgeDynamic, up-to-date knowledge
Generic answersSpecific, accurate answers

3. Vector Embeddings Explained

3.1 What is an Embedding?

Embedding = Converting text to numbers (vectors) that capture meaning

Text
1"dog" → [0.2, 0.8, 0.1, 0.5, ...] (1536 dimensions)
2"cat" → [0.3, 0.7, 0.2, 0.4, ...] (similar to dog!)
3"car" → [0.9, 0.1, 0.8, 0.2, ...] (very different)

3.2 Semantic Similarity

Similar meanings → Similar vectors → Close in vector space

Text
1Vector Space (simplified 2D)
2
3
4 "puppy" ● │ ● "kitten"
5 \ │ /
6 "dog" ● ─●─ ● "cat"
7 ╱│╲
8 / │ \
9 "vehicle" ● │ ● "automobile"
10
11
12 "car" ●│
13 ──────────────────┼──────────────────►
14
15
16 Animals cluster together, vehicles cluster together

3.3 Embedding Models

ModelDimensionsBest For
OpenAI text-embedding-3-small1536General use
OpenAI text-embedding-3-large3072Higher accuracy
Cohere embed-v31024Multilingual
BGE (open source)768-1024Free alternative
Sentence Transformers384-768Local deployment

3.4 Creating Embeddings

Python
1from openai import OpenAI
2
3client = OpenAI()
4
5def get_embedding(text):
6 response = client.embeddings.create(
7 model="text-embedding-3-small",
8 input=text
9 )
10 return response.data[0].embedding
11
12# Example
13doc_embedding = get_embedding("Our refund policy allows returns within 30 days")
14query_embedding = get_embedding("How do I get a refund?")
15
16# These will be similar because they're about the same topic!

4. Vector Databases

4.1 Why Vector Database?

Regular databases can't do similarity search:

SQL
1-- This doesn't work in SQL!
2SELECT * FROM documents
3WHERE embedding SIMILAR TO query_embedding

Vector Databases are designed for:

  • Store millions of vectors efficiently
  • Fast similarity search (milliseconds)
  • Filter + search combined

4.2 Popular Vector Databases

DatabaseTypeBest For
PineconeManaged cloudProduction, scale
WeaviateSelf-hosted/cloudFlexibility
QdrantSelf-hosted/cloudPerformance
ChromaLocal/embeddedPrototyping
MilvusSelf-hostedEnterprise
pgvectorPostgreSQL extensionExisting Postgres users

4.3 Vector Search Types

Cosine Similarity (most common):

Text
1similarity = cos(θ) between vectors
2Range: -1 to 1 (1 = identical, 0 = unrelated)

Euclidean Distance:

Text
1distance = √(Σ(a-b)²)
2Lower = more similar

Dot Product:

Text
1score = Σ(a × b)
2Higher = more similar

5. RAG Pipeline Steps

Step 1: Document Ingestion

Python
1# Load documents
2documents = [
3 "Our refund policy allows full refunds within 30 days...",
4 "Shipping takes 3-5 business days for domestic orders...",
5 "Premium members get 20% discount on all purchases...",
6]
7
8# Chunk documents (split long docs into smaller pieces)
9chunks = []
10for doc in documents:
11 # Split into ~500 character chunks with overlap
12 chunks.extend(split_text(doc, chunk_size=500, overlap=50))

Step 2: Create & Store Embeddings

Python
1import chromadb
2
3# Initialize Chroma
4client = chromadb.Client()
5collection = client.create_collection("company_docs")
6
7# Add documents with embeddings
8for i, chunk in enumerate(chunks):
9 embedding = get_embedding(chunk)
10 collection.add(
11 ids=[f"doc_{i}"],
12 embeddings=[embedding],
13 documents=[chunk],
14 metadatas=[{"source": "policy.pdf"}]
15 )

Step 3: Query & Retrieve

Python
1def retrieve_context(query, n_results=3):
2 # Get query embedding
3 query_embedding = get_embedding(query)
4
5 # Search vector database
6 results = collection.query(
7 query_embeddings=[query_embedding],
8 n_results=n_results
9 )
10
11 return results["documents"][0] # Top N relevant chunks

Step 4: Generate Answer

Python
1def generate_answer(query, context):
2 prompt = f"""Answer the question based ONLY on the following context:
3
4Context:
5{context}
6
7Question: {query}
8
9Answer:"""
10
11 response = client.chat.completions.create(
12 model="gpt-4",
13 messages=[{"role": "user", "content": prompt}]
14 )
15
16 return response.choices[0].message.content
17
18# Full RAG pipeline
19query = "What's your refund policy?"
20context = retrieve_context(query)
21answer = generate_answer(query, context)
22print(answer)

6. RAG vs Fine-tuning

When to Use RAG

Use RAG when:

  • Data changes frequently
  • Need source attribution
  • Limited training data
  • Quick implementation needed
  • Privacy concerns (data stays local)

When to Use Fine-tuning

Use Fine-tuning when:

  • Specific writing style needed
  • Domain-specific vocabulary
  • Consistent behavior required
  • Data is stable
  • Performance is critical

Comparison

AspectRAGFine-tuning
CostLow (API calls)High (training)
Setup timeHoursDays/weeks
Data freshnessReal-timeStatic
Source citation✅ Easy❌ Difficult
HallucinationLowerCan still occur
CustomizationLimited styleFull control

Best Practice: Combine Both

Text
1Fine-tuned model (for tone/style)
2 +
3RAG (for current data)
4 =
5Best of both worlds!

7. Real-World RAG Use Cases

7.1 Customer Support Bot

Text
1User: "I received a damaged product, what should I do?"
2
3RAG retrieves:
4- Return policy document
5- Damage claim process
6- Contact information
7
8Bot: "I'm sorry to hear that. According to our policy,
9you can file a damage claim within 48 hours..."

7.2 Internal Knowledge Base

Text
1Employee: "What's the process for requesting PTO?"
2
3RAG retrieves:
4- HR policy document
5- PTO request form
6- Manager approval workflow
7
8Answer: "To request PTO, submit a request in HR portal
9at least 2 weeks in advance..."

7.3 Legal Document Analysis

Text
1Lawyer: "What does clause 5.2 say about liability?"
2
3RAG retrieves:
4- Contract section 5.2
5- Related amendments
6- Previous interpretations
7
8Answer: "Clause 5.2 states that liability is limited to..."

7.4 Code Documentation Assistant

Text
1Developer: "How do I authenticate API requests?"
2
3RAG retrieves:
4- API documentation
5- Code examples
6- Authentication guide
7
8Answer: "Use Bearer token authentication. Here's an example..."

8. RAG Challenges & Solutions

Challenge 1: Chunking Strategy

Problem: How to split documents?

Solutions:

Python
1# Fixed size chunks
2chunks = split_by_tokens(doc, size=500)
3
4# Semantic chunks (by paragraph/section)
5chunks = split_by_headers(doc)
6
7# Sliding window with overlap
8chunks = sliding_window(doc, size=500, overlap=100)

Challenge 2: Retrieval Quality

Problem: Retrieved docs aren't relevant

Solutions:

  • Hybrid search (keyword + semantic)
  • Re-ranking retrieved results
  • Query expansion/rewriting
  • Metadata filtering

Challenge 3: Context Window Limits

Problem: Too much context for LLM

Solutions:

  • Summarize retrieved chunks
  • Use compression techniques
  • Selective context inclusion
  • Hierarchical retrieval

Challenge 4: Hallucination

Problem: LLM still makes things up

Solutions:

  • Strong prompting ("Only use provided context")
  • Fact verification step
  • Source citation requirement
  • Confidence scoring

9. Hands-on: Simple RAG System

Prerequisites

Bash
1pip install openai chromadb python-dotenv

Complete Example

Python
1import os
2from openai import OpenAI
3import chromadb
4from dotenv import load_dotenv
5
6load_dotenv()
7client = OpenAI()
8
9# Sample knowledge base
10knowledge_base = [
11 "MinAI Learning Platform offers courses in AI, Data Science, and Automation.",
12 "Course pricing starts at 500,000 VND for basic courses.",
13 "Premium subscription costs 2,000,000 VND per year with unlimited access.",
14 "Refunds are available within 7 days of purchase.",
15 "Contact support at support@minai.vn for assistance.",
16]
17
18# Initialize vector database
19chroma_client = chromadb.Client()
20collection = chroma_client.create_collection("minai_kb")
21
22# Index documents
23print("Indexing documents...")
24for i, doc in enumerate(knowledge_base):
25 embedding = client.embeddings.create(
26 model="text-embedding-3-small",
27 input=doc
28 ).data[0].embedding
29
30 collection.add(
31 ids=[f"doc_{i}"],
32 embeddings=[embedding],
33 documents=[doc]
34 )
35
36def ask_rag(question):
37 # Get query embedding
38 query_embedding = client.embeddings.create(
39 model="text-embedding-3-small",
40 input=question
41 ).data[0].embedding
42
43 # Retrieve relevant docs
44 results = collection.query(
45 query_embeddings=[query_embedding],
46 n_results=2
47 )
48 context = "\n".join(results["documents"][0])
49
50 # Generate answer
51 response = client.chat.completions.create(
52 model="gpt-4o-mini",
53 messages=[
54 {"role": "system", "content": "Answer based only on the provided context. If you don't know, say so."},
55 {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
56 ]
57 )
58
59 return response.choices[0].message.content
60
61# Test
62questions = [
63 "How much does a premium subscription cost?",
64 "What's the refund policy?",
65 "How do I contact support?",
66]
67
68for q in questions:
69 print(f"\nQ: {q}")
70 print(f"A: {ask_rag(q)}")

10. Bài tập về nhà

Bài 1: Concept Review

Trả lời các câu hỏi:

  1. RAG giải quyết những vấn đề gì của LLM?
  2. Embedding là gì và tại sao cần thiết?
  3. Khi nào nên dùng RAG vs Fine-tuning?

Bài 2: Build Simple RAG

  1. Chuẩn bị 10 FAQs về một topic bạn quan tâm
  2. Index vào ChromaDB
  3. Build Q&A system
  4. Test với 5 câu hỏi khác nhau

Bài 3: Explore Vector Databases

  1. Tìm hiểu về Pinecone hoặc Weaviate
  2. So sánh features với ChromaDB
  3. Thử deploy một vector database

Summary

Trong bài này bạn đã học:

  • ✅ LLM limitations và tại sao cần RAG
  • ✅ RAG architecture: Retrieve → Augment → Generate
  • ✅ Vector embeddings và semantic similarity
  • ✅ Vector databases và similarity search
  • ✅ RAG pipeline từ ingestion đến generation
  • ✅ RAG vs Fine-tuning trade-offs
  • ✅ Real-world use cases
  • ✅ Common challenges và solutions

Next: Bài 2 - Vector Databases Deep Dive - Pinecone, Weaviate, Chroma comparison