Bài 1: What is RAG?

1. The Problem with LLMs

1.1 LLM Limitations

Large Language Models (GPT-4, Claude, Gemini) có 3 vấn đề lớn:

Text

1┌─────────────────────────────────────────────────────────────┐
2│                    LLM LIMITATIONS                          │
3├─────────────────────────────────────────────────────────────┤
4│                                                             │
5│  1. 📅 KNOWLEDGE CUTOFF                                     │
6│     "Who won the 2025 World Cup?"                          │
7│     → "I don't have information after my training date"     │
8│                                                             │
9│  2. 🏢 NO PRIVATE DATA                                      │
10│     "What's our company's revenue policy?"                  │
11│     → "I don't have access to your company documents"       │
12│                                                             │
13│  3. 🎭 HALLUCINATION                                        │
14│     "What's the price of Product X?"                       │
15│     → "Product X costs $199" (completely made up!)          │
16│                                                             │
17└─────────────────────────────────────────────────────────────┘

1.2 Traditional Solutions

Approach	Limitation
Fine-tuning	Expensive, time-consuming, static
Prompt with docs	Context window limit (~100K tokens)
Retrain model	Impractical for most companies

We need a better solution → RAG!

2. What is RAG?

2.1 Definition

RAG = Retrieval-Augmented Generation

A technique that enhances LLM responses by:

Retrieving relevant documents from your data
Augmenting the prompt with retrieved context
Generating accurate answers based on your data

2.2 RAG Architecture

Text

1┌─────────────────────────────────────────────────────────────┐
2│                      RAG PIPELINE                           │
3├─────────────────────────────────────────────────────────────┤
4│                                                             │
5│  USER QUERY                                                 │
6│  "What's our refund policy?"                                │
7│        │                                                    │
8│        ▼                                                    │
9│  ┌─────────────┐                                           │
10│  │  EMBEDDING  │ → Convert query to vector                 │
11│  └─────────────┘                                           │
12│        │                                                    │
13│        ▼                                                    │
14│  ┌─────────────────────────────┐                           │
15│  │    VECTOR DATABASE         │                            │
16│  │  ┌───┐ ┌───┐ ┌───┐        │                            │
17│  │  │Doc│ │Doc│ │Doc│ ...    │ → Find similar docs         │
18│  │  └───┘ └───┘ └───┘        │                            │
19│  └─────────────────────────────┘                           │
20│        │                                                    │
21│        ▼                                                    │
22│  RETRIEVED CONTEXT:                                         │
23│  "Refund policy: Full refund within 30 days..."            │
24│        │                                                    │
25│        ▼                                                    │
26│  ┌─────────────┐                                           │
27│  │     LLM     │ → Generate answer with context            │
28│  └─────────────┘                                           │
29│        │                                                    │
30│        ▼                                                    │
31│  ANSWER: "Our refund policy allows full refunds            │
32│          within 30 days of purchase..."                     │
33│                                                             │
34└─────────────────────────────────────────────────────────────┘

2.3 Why RAG Works

Without RAG	With RAG
LLM guesses based on training	LLM answers based on YOUR data
May hallucinate	Grounded in retrieved documents
Static knowledge	Dynamic, up-to-date knowledge
Generic answers	Specific, accurate answers

3. Vector Embeddings Explained

3.1 What is an Embedding?

Embedding = Converting text to numbers (vectors) that capture meaning

Text

1"dog" → [0.2, 0.8, 0.1, 0.5, ...]  (1536 dimensions)
2"cat" → [0.3, 0.7, 0.2, 0.4, ...]  (similar to dog!)
3"car" → [0.9, 0.1, 0.8, 0.2, ...]  (very different)

3.2 Semantic Similarity

Similar meanings → Similar vectors → Close in vector space

Text

1Vector Space (simplified 2D)
2                          ▲
3                          │
4               "puppy" ●  │  ● "kitten"
5                       \  │  /
6                "dog" ● ─●─ ● "cat"
7                         ╱│╲
8                        / │ \
9             "vehicle" ●  │  ● "automobile"
10                          │
11                          │
12                   "car" ●│
13        ──────────────────┼──────────────────►
14                          │
15                          
16  Animals cluster together, vehicles cluster together

3.3 Embedding Models

Model	Dimensions	Best For
OpenAI text-embedding-3-small	1536	General use
OpenAI text-embedding-3-large	3072	Higher accuracy
Cohere embed-v3	1024	Multilingual
BGE (open source)	768-1024	Free alternative
Sentence Transformers	384-768	Local deployment

3.4 Creating Embeddings

Python

1from openai import OpenAI
2
3client = OpenAI()
4
5def get_embedding(text):
6    response = client.embeddings.create(
7        model="text-embedding-3-small",
8        input=text
9    )
10    return response.data[0].embedding
11
12# Example
13doc_embedding = get_embedding("Our refund policy allows returns within 30 days")
14query_embedding = get_embedding("How do I get a refund?")
15
16# These will be similar because they're about the same topic!

4. Vector Databases

4.1 Why Vector Database?

Regular databases can't do similarity search:

SQL

1-- This doesn't work in SQL!
2SELECT * FROM documents
3WHERE embedding SIMILAR TO query_embedding

Vector Databases are designed for:

Store millions of vectors efficiently
Fast similarity search (milliseconds)
Filter + search combined

4.2 Popular Vector Databases

Database	Type	Best For
Pinecone	Managed cloud	Production, scale
Weaviate	Self-hosted/cloud	Flexibility
Qdrant	Self-hosted/cloud	Performance
Chroma	Local/embedded	Prototyping
Milvus	Self-hosted	Enterprise
pgvector	PostgreSQL extension	Existing Postgres users

4.3 Vector Search Types

Cosine Similarity (most common):

Text

1similarity = cos(θ) between vectors
2Range: -1 to 1 (1 = identical, 0 = unrelated)

Euclidean Distance:

Text

1distance = √(Σ(a-b)²)
2Lower = more similar

Dot Product:

Text

1score = Σ(a × b)
2Higher = more similar

5. RAG Pipeline Steps

Step 1: Document Ingestion

Python

1# Load documents
2documents = [
3    "Our refund policy allows full refunds within 30 days...",
4    "Shipping takes 3-5 business days for domestic orders...",
5    "Premium members get 20% discount on all purchases...",
6]
7
8# Chunk documents (split long docs into smaller pieces)
9chunks = []
10for doc in documents:
11    # Split into ~500 character chunks with overlap
12    chunks.extend(split_text(doc, chunk_size=500, overlap=50))

Step 2: Create & Store Embeddings

Python

1import chromadb
2
3# Initialize Chroma
4client = chromadb.Client()
5collection = client.create_collection("company_docs")
6
7# Add documents with embeddings
8for i, chunk in enumerate(chunks):
9    embedding = get_embedding(chunk)
10    collection.add(
11        ids=[f"doc_{i}"],
12        embeddings=[embedding],
13        documents=[chunk],
14        metadatas=[{"source": "policy.pdf"}]
15    )

Step 3: Query & Retrieve

Python

1def retrieve_context(query, n_results=3):
2    # Get query embedding
3    query_embedding = get_embedding(query)
4    
5    # Search vector database
6    results = collection.query(
7        query_embeddings=[query_embedding],
8        n_results=n_results
9    )
10    
11    return results["documents"][0]  # Top N relevant chunks

Step 4: Generate Answer

Python

1def generate_answer(query, context):
2    prompt = f"""Answer the question based ONLY on the following context:
3
4Context:
5{context}
6
7Question: {query}
8
9Answer:"""
10
11    response = client.chat.completions.create(
12        model="gpt-4",
13        messages=[{"role": "user", "content": prompt}]
14    )
15    
16    return response.choices[0].message.content
17
18# Full RAG pipeline
19query = "What's your refund policy?"
20context = retrieve_context(query)
21answer = generate_answer(query, context)
22print(answer)

6. RAG vs Fine-tuning

When to Use RAG

✅ Use RAG when:

Data changes frequently
Need source attribution
Limited training data
Quick implementation needed
Privacy concerns (data stays local)

When to Use Fine-tuning

✅ Use Fine-tuning when:

Specific writing style needed
Domain-specific vocabulary
Consistent behavior required
Data is stable
Performance is critical

Comparison

Aspect	RAG	Fine-tuning
Cost	Low (API calls)	High (training)
Setup time	Hours	Days/weeks
Data freshness	Real-time	Static
Source citation	✅ Easy	❌ Difficult
Hallucination	Lower	Can still occur
Customization	Limited style	Full control

Best Practice: Combine Both

Text

1Fine-tuned model (for tone/style)
2        +
3RAG (for current data)
4        =
5Best of both worlds!

7. Real-World RAG Use Cases

7.1 Customer Support Bot

Text

1User: "I received a damaged product, what should I do?"
2 
3RAG retrieves:
4- Return policy document
5- Damage claim process
6- Contact information
7 
8Bot: "I'm sorry to hear that. According to our policy, 
9you can file a damage claim within 48 hours..."

7.2 Internal Knowledge Base

Text

1Employee: "What's the process for requesting PTO?"
2 
3RAG retrieves:
4- HR policy document
5- PTO request form
6- Manager approval workflow
7 
8Answer: "To request PTO, submit a request in HR portal 
9at least 2 weeks in advance..."

7.3 Legal Document Analysis

Text

1Lawyer: "What does clause 5.2 say about liability?"
2 
3RAG retrieves:
4- Contract section 5.2
5- Related amendments
6- Previous interpretations
7 
8Answer: "Clause 5.2 states that liability is limited to..."

7.4 Code Documentation Assistant

Text

1Developer: "How do I authenticate API requests?"
2 
3RAG retrieves:
4- API documentation
5- Code examples
6- Authentication guide
7 
8Answer: "Use Bearer token authentication. Here's an example..."

8. RAG Challenges & Solutions

Challenge 1: Chunking Strategy

Problem: How to split documents?

Solutions:

Python

1# Fixed size chunks
2chunks = split_by_tokens(doc, size=500)
3
4# Semantic chunks (by paragraph/section)
5chunks = split_by_headers(doc)
6
7# Sliding window with overlap
8chunks = sliding_window(doc, size=500, overlap=100)

Challenge 2: Retrieval Quality

Problem: Retrieved docs aren't relevant

Solutions:

Hybrid search (keyword + semantic)
Re-ranking retrieved results
Query expansion/rewriting
Metadata filtering

Challenge 3: Context Window Limits

Problem: Too much context for LLM

Solutions:

Summarize retrieved chunks
Use compression techniques
Selective context inclusion
Hierarchical retrieval

Challenge 4: Hallucination

Problem: LLM still makes things up

Solutions:

Strong prompting ("Only use provided context")
Fact verification step
Source citation requirement
Confidence scoring

9. Hands-on: Simple RAG System

Prerequisites

Bash

1pip install openai chromadb python-dotenv

Complete Example

Python

1import os
2from openai import OpenAI
3import chromadb
4from dotenv import load_dotenv
5
6load_dotenv()
7client = OpenAI()
8
9# Sample knowledge base
10knowledge_base = [
11    "MinAI Learning Platform offers courses in AI, Data Science, and Automation.",
12    "Course pricing starts at 500,000 VND for basic courses.",
13    "Premium subscription costs 2,000,000 VND per year with unlimited access.",
14    "Refunds are available within 7 days of purchase.",
15    "Contact support at support@minai.vn for assistance.",
16]
17
18# Initialize vector database
19chroma_client = chromadb.Client()
20collection = chroma_client.create_collection("minai_kb")
21
22# Index documents
23print("Indexing documents...")
24for i, doc in enumerate(knowledge_base):
25    embedding = client.embeddings.create(
26        model="text-embedding-3-small",
27        input=doc
28    ).data[0].embedding
29    
30    collection.add(
31        ids=[f"doc_{i}"],
32        embeddings=[embedding],
33        documents=[doc]
34    )
35
36def ask_rag(question):
37    # Get query embedding
38    query_embedding = client.embeddings.create(
39        model="text-embedding-3-small",
40        input=question
41    ).data[0].embedding
42    
43    # Retrieve relevant docs
44    results = collection.query(
45        query_embeddings=[query_embedding],
46        n_results=2
47    )
48    context = "\n".join(results["documents"][0])
49    
50    # Generate answer
51    response = client.chat.completions.create(
52        model="gpt-4o-mini",
53        messages=[
54            {"role": "system", "content": "Answer based only on the provided context. If you don't know, say so."},
55            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
56        ]
57    )
58    
59    return response.choices[0].message.content
60
61# Test
62questions = [
63    "How much does a premium subscription cost?",
64    "What's the refund policy?",
65    "How do I contact support?",
66]
67
68for q in questions:
69    print(f"\nQ: {q}")
70    print(f"A: {ask_rag(q)}")

10. Bài tập về nhà

Bài 1: Concept Review

Trả lời các câu hỏi:

RAG giải quyết những vấn đề gì của LLM?
Embedding là gì và tại sao cần thiết?
Khi nào nên dùng RAG vs Fine-tuning?

Bài 2: Build Simple RAG

Chuẩn bị 10 FAQs về một topic bạn quan tâm
Index vào ChromaDB
Build Q&A system
Test với 5 câu hỏi khác nhau

Bài 3: Explore Vector Databases

Tìm hiểu về Pinecone hoặc Weaviate
So sánh features với ChromaDB
Thử deploy một vector database

Summary

Trong bài này bạn đã học:

✅ LLM limitations và tại sao cần RAG
✅ RAG architecture: Retrieve → Augment → Generate
✅ Vector embeddings và semantic similarity
✅ Vector databases và similarity search
✅ RAG pipeline từ ingestion đến generation
✅ RAG vs Fine-tuning trade-offs
✅ Real-world use cases
✅ Common challenges và solutions

Next: Bài 2 - Vector Databases Deep Dive - Pinecone, Weaviate, Chroma comparison