Bài 1: What is RAG?
1. The Problem with LLMs
1.1 LLM Limitations
Large Language Models (GPT-4, Claude, Gemini) có 3 vấn đề lớn:
1┌─────────────────────────────────────────────────────────────┐2│ LLM LIMITATIONS │3├─────────────────────────────────────────────────────────────┤4│ │5│ 1. 📅 KNOWLEDGE CUTOFF │6│ "Who won the 2025 World Cup?" │7│ → "I don't have information after my training date" │8│ │9│ 2. 🏢 NO PRIVATE DATA │10│ "What's our company's revenue policy?" │11│ → "I don't have access to your company documents" │12│ │13│ 3. 🎭 HALLUCINATION │14│ "What's the price of Product X?" │15│ → "Product X costs $199" (completely made up!) │16│ │17└─────────────────────────────────────────────────────────────┘1.2 Traditional Solutions
| Approach | Limitation |
|---|---|
| Fine-tuning | Expensive, time-consuming, static |
| Prompt with docs | Context window limit (~100K tokens) |
| Retrain model | Impractical for most companies |
We need a better solution → RAG!
2. What is RAG?
2.1 Definition
RAG = Retrieval-Augmented Generation
A technique that enhances LLM responses by:
- Retrieving relevant documents from your data
- Augmenting the prompt with retrieved context
- Generating accurate answers based on your data
2.2 RAG Architecture
1┌─────────────────────────────────────────────────────────────┐2│ RAG PIPELINE │3├─────────────────────────────────────────────────────────────┤4│ │5│ USER QUERY │6│ "What's our refund policy?" │7│ │ │8│ ▼ │9│ ┌─────────────┐ │10│ │ EMBEDDING │ → Convert query to vector │11│ └─────────────┘ │12│ │ │13│ ▼ │14│ ┌─────────────────────────────┐ │15│ │ VECTOR DATABASE │ │16│ │ ┌───┐ ┌───┐ ┌───┐ │ │17│ │ │Doc│ │Doc│ │Doc│ ... │ → Find similar docs │18│ │ └───┘ └───┘ └───┘ │ │19│ └─────────────────────────────┘ │20│ │ │21│ ▼ │22│ RETRIEVED CONTEXT: │23│ "Refund policy: Full refund within 30 days..." │24│ │ │25│ ▼ │26│ ┌─────────────┐ │27│ │ LLM │ → Generate answer with context │28│ └─────────────┘ │29│ │ │30│ ▼ │31│ ANSWER: "Our refund policy allows full refunds │32│ within 30 days of purchase..." │33│ │34└─────────────────────────────────────────────────────────────┘2.3 Why RAG Works
| Without RAG | With RAG |
|---|---|
| LLM guesses based on training | LLM answers based on YOUR data |
| May hallucinate | Grounded in retrieved documents |
| Static knowledge | Dynamic, up-to-date knowledge |
| Generic answers | Specific, accurate answers |
3. Vector Embeddings Explained
3.1 What is an Embedding?
Embedding = Converting text to numbers (vectors) that capture meaning
1"dog" → [0.2, 0.8, 0.1, 0.5, ...] (1536 dimensions)2"cat" → [0.3, 0.7, 0.2, 0.4, ...] (similar to dog!)3"car" → [0.9, 0.1, 0.8, 0.2, ...] (very different)3.2 Semantic Similarity
Similar meanings → Similar vectors → Close in vector space
1Vector Space (simplified 2D)2 ▲3 │4 "puppy" ● │ ● "kitten"5 \ │ /6 "dog" ● ─●─ ● "cat"7 ╱│╲8 / │ \9 "vehicle" ● │ ● "automobile"10 │11 │12 "car" ●│13 ──────────────────┼──────────────────►14 │15 16 Animals cluster together, vehicles cluster together3.3 Embedding Models
| Model | Dimensions | Best For |
|---|---|---|
| OpenAI text-embedding-3-small | 1536 | General use |
| OpenAI text-embedding-3-large | 3072 | Higher accuracy |
| Cohere embed-v3 | 1024 | Multilingual |
| BGE (open source) | 768-1024 | Free alternative |
| Sentence Transformers | 384-768 | Local deployment |
3.4 Creating Embeddings
1from openai import OpenAI23client = OpenAI()45def get_embedding(text):6 response = client.embeddings.create(7 model="text-embedding-3-small",8 input=text9 )10 return response.data[0].embedding1112# Example13doc_embedding = get_embedding("Our refund policy allows returns within 30 days")14query_embedding = get_embedding("How do I get a refund?")1516# These will be similar because they're about the same topic!4. Vector Databases
4.1 Why Vector Database?
Regular databases can't do similarity search:
1-- This doesn't work in SQL!2SELECT * FROM documents3WHERE embedding SIMILAR TO query_embeddingVector Databases are designed for:
- Store millions of vectors efficiently
- Fast similarity search (milliseconds)
- Filter + search combined
4.2 Popular Vector Databases
| Database | Type | Best For |
|---|---|---|
| Pinecone | Managed cloud | Production, scale |
| Weaviate | Self-hosted/cloud | Flexibility |
| Qdrant | Self-hosted/cloud | Performance |
| Chroma | Local/embedded | Prototyping |
| Milvus | Self-hosted | Enterprise |
| pgvector | PostgreSQL extension | Existing Postgres users |
4.3 Vector Search Types
Cosine Similarity (most common):
1similarity = cos(θ) between vectors2Range: -1 to 1 (1 = identical, 0 = unrelated)Euclidean Distance:
1distance = √(Σ(a-b)²)2Lower = more similarDot Product:
1score = Σ(a × b)2Higher = more similar5. RAG Pipeline Steps
Step 1: Document Ingestion
1# Load documents2documents = [3 "Our refund policy allows full refunds within 30 days...",4 "Shipping takes 3-5 business days for domestic orders...",5 "Premium members get 20% discount on all purchases...",6]78# Chunk documents (split long docs into smaller pieces)9chunks = []10for doc in documents:11 # Split into ~500 character chunks with overlap12 chunks.extend(split_text(doc, chunk_size=500, overlap=50))Step 2: Create & Store Embeddings
1import chromadb23# Initialize Chroma4client = chromadb.Client()5collection = client.create_collection("company_docs")67# Add documents with embeddings8for i, chunk in enumerate(chunks):9 embedding = get_embedding(chunk)10 collection.add(11 ids=[f"doc_{i}"],12 embeddings=[embedding],13 documents=[chunk],14 metadatas=[{"source": "policy.pdf"}]15 )Step 3: Query & Retrieve
1def retrieve_context(query, n_results=3):2 # Get query embedding3 query_embedding = get_embedding(query)4 5 # Search vector database6 results = collection.query(7 query_embeddings=[query_embedding],8 n_results=n_results9 )10 11 return results["documents"][0] # Top N relevant chunksStep 4: Generate Answer
1def generate_answer(query, context):2 prompt = f"""Answer the question based ONLY on the following context:34Context:5{context}67Question: {query}89Answer:"""1011 response = client.chat.completions.create(12 model="gpt-4",13 messages=[{"role": "user", "content": prompt}]14 )15 16 return response.choices[0].message.content1718# Full RAG pipeline19query = "What's your refund policy?"20context = retrieve_context(query)21answer = generate_answer(query, context)22print(answer)6. RAG vs Fine-tuning
When to Use RAG
✅ Use RAG when:
- Data changes frequently
- Need source attribution
- Limited training data
- Quick implementation needed
- Privacy concerns (data stays local)
When to Use Fine-tuning
✅ Use Fine-tuning when:
- Specific writing style needed
- Domain-specific vocabulary
- Consistent behavior required
- Data is stable
- Performance is critical
Comparison
| Aspect | RAG | Fine-tuning |
|---|---|---|
| Cost | Low (API calls) | High (training) |
| Setup time | Hours | Days/weeks |
| Data freshness | Real-time | Static |
| Source citation | ✅ Easy | ❌ Difficult |
| Hallucination | Lower | Can still occur |
| Customization | Limited style | Full control |
Best Practice: Combine Both
1Fine-tuned model (for tone/style)2 +3RAG (for current data)4 =5Best of both worlds!7. Real-World RAG Use Cases
7.1 Customer Support Bot
1User: "I received a damaged product, what should I do?"2 3RAG retrieves:4- Return policy document5- Damage claim process6- Contact information7 8Bot: "I'm sorry to hear that. According to our policy, 9you can file a damage claim within 48 hours..."7.2 Internal Knowledge Base
1Employee: "What's the process for requesting PTO?"2 3RAG retrieves:4- HR policy document5- PTO request form6- Manager approval workflow7 8Answer: "To request PTO, submit a request in HR portal 9at least 2 weeks in advance..."7.3 Legal Document Analysis
1Lawyer: "What does clause 5.2 say about liability?"2 3RAG retrieves:4- Contract section 5.25- Related amendments6- Previous interpretations7 8Answer: "Clause 5.2 states that liability is limited to..."7.4 Code Documentation Assistant
1Developer: "How do I authenticate API requests?"2 3RAG retrieves:4- API documentation5- Code examples6- Authentication guide7 8Answer: "Use Bearer token authentication. Here's an example..."8. RAG Challenges & Solutions
Challenge 1: Chunking Strategy
Problem: How to split documents?
Solutions:
1# Fixed size chunks2chunks = split_by_tokens(doc, size=500)34# Semantic chunks (by paragraph/section)5chunks = split_by_headers(doc)67# Sliding window with overlap8chunks = sliding_window(doc, size=500, overlap=100)Challenge 2: Retrieval Quality
Problem: Retrieved docs aren't relevant
Solutions:
- Hybrid search (keyword + semantic)
- Re-ranking retrieved results
- Query expansion/rewriting
- Metadata filtering
Challenge 3: Context Window Limits
Problem: Too much context for LLM
Solutions:
- Summarize retrieved chunks
- Use compression techniques
- Selective context inclusion
- Hierarchical retrieval
Challenge 4: Hallucination
Problem: LLM still makes things up
Solutions:
- Strong prompting ("Only use provided context")
- Fact verification step
- Source citation requirement
- Confidence scoring
9. Hands-on: Simple RAG System
Prerequisites
1pip install openai chromadb python-dotenvComplete Example
1import os2from openai import OpenAI3import chromadb4from dotenv import load_dotenv56load_dotenv()7client = OpenAI()89# Sample knowledge base10knowledge_base = [11 "MinAI Learning Platform offers courses in AI, Data Science, and Automation.",12 "Course pricing starts at 500,000 VND for basic courses.",13 "Premium subscription costs 2,000,000 VND per year with unlimited access.",14 "Refunds are available within 7 days of purchase.",15 "Contact support at support@minai.vn for assistance.",16]1718# Initialize vector database19chroma_client = chromadb.Client()20collection = chroma_client.create_collection("minai_kb")2122# Index documents23print("Indexing documents...")24for i, doc in enumerate(knowledge_base):25 embedding = client.embeddings.create(26 model="text-embedding-3-small",27 input=doc28 ).data[0].embedding29 30 collection.add(31 ids=[f"doc_{i}"],32 embeddings=[embedding],33 documents=[doc]34 )3536def ask_rag(question):37 # Get query embedding38 query_embedding = client.embeddings.create(39 model="text-embedding-3-small",40 input=question41 ).data[0].embedding42 43 # Retrieve relevant docs44 results = collection.query(45 query_embeddings=[query_embedding],46 n_results=247 )48 context = "\n".join(results["documents"][0])49 50 # Generate answer51 response = client.chat.completions.create(52 model="gpt-4o-mini",53 messages=[54 {"role": "system", "content": "Answer based only on the provided context. If you don't know, say so."},55 {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}56 ]57 )58 59 return response.choices[0].message.content6061# Test62questions = [63 "How much does a premium subscription cost?",64 "What's the refund policy?",65 "How do I contact support?",66]6768for q in questions:69 print(f"\nQ: {q}")70 print(f"A: {ask_rag(q)}")10. Bài tập về nhà
Bài 1: Concept Review
Trả lời các câu hỏi:
- RAG giải quyết những vấn đề gì của LLM?
- Embedding là gì và tại sao cần thiết?
- Khi nào nên dùng RAG vs Fine-tuning?
Bài 2: Build Simple RAG
- Chuẩn bị 10 FAQs về một topic bạn quan tâm
- Index vào ChromaDB
- Build Q&A system
- Test với 5 câu hỏi khác nhau
Bài 3: Explore Vector Databases
- Tìm hiểu về Pinecone hoặc Weaviate
- So sánh features với ChromaDB
- Thử deploy một vector database
Summary
Trong bài này bạn đã học:
- ✅ LLM limitations và tại sao cần RAG
- ✅ RAG architecture: Retrieve → Augment → Generate
- ✅ Vector embeddings và semantic similarity
- ✅ Vector databases và similarity search
- ✅ RAG pipeline từ ingestion đến generation
- ✅ RAG vs Fine-tuning trade-offs
- ✅ Real-world use cases
- ✅ Common challenges và solutions
Next: Bài 2 - Vector Databases Deep Dive - Pinecone, Weaviate, Chroma comparison
