🎯 Mục tiêu bài học
Chunking là bước quan trọng nhất trong RAG pipeline. Chunk tốt = retrieval tốt = answer tốt. Chunk chất lượng kém sẽ phá hỏng toàn bộ system.
Sau bài này, bạn sẽ:
✅ Hiểu tại sao chunking quan trọng ✅ Áp dụng Fixed-size, recursive, semantic chunking ✅ Tối ưu overlap strategy và chunk size ✅ Thêm metadata enrichment cho chunks
🔍 Why Chunking Matters
Chunking Strategy Pipeline
Impact on RAG Quality
| Chunk Size | Retrieval | Generation | Problem |
|---|---|---|---|
| Too large (2000+) | Low precision | Noisy context | Answer buried in irrelevant text |
| Too small (100-) | High precision | Missing context | Incomplete information |
| Optimal (300-800) | Balanced | Good context | Best trade-off |
Checkpoint
Bạn đã hiểu tại sao chunk size ảnh hưởng trực tiếp đến chất lượng RAG chưa?
📐 Chunking Methods
Fixed-Size Chunking (Simple)
1from langchain.text_splitter import CharacterTextSplitter23splitter = CharacterTextSplitter(4 separator="\n",5 chunk_size=500, # Max characters per chunk6 chunk_overlap=100, # Overlap between chunks7 length_function=len8)910chunks = splitter.split_text(long_document)11print(f"Number of chunks: {len(chunks)}")12for i, chunk in enumerate(chunks[:3]):13 print(f"Chunk {i}: {len(chunk)} chars")14 print(chunk[:100], "...")15 print()Recursive Character Splitting (Recommended)
1from langchain.text_splitter import RecursiveCharacterTextSplitter23# Best general-purpose splitter4splitter = RecursiveCharacterTextSplitter(5 chunk_size=500,6 chunk_overlap=100,7 separators=["\n\n", "\n", ". ", " ", ""],8 # Tries each separator in order:9 # 1. Double newline (paragraph breaks)10 # 2. Single newline11 # 3. Period + space (sentences)12 # 4. Space (words)13 # 5. Empty string (characters) — last resort14)1516chunks = splitter.split_text(document_text)Markdown / Code Splitting
1from langchain.text_splitter import MarkdownTextSplitter23md_splitter = MarkdownTextSplitter(4 chunk_size=500,5 chunk_overlap=506)7# Splits on: #, ##, ###, ```, ---, etc.89from langchain.text_splitter import Language, RecursiveCharacterTextSplitter1011# Code-aware splitting12python_splitter = RecursiveCharacterTextSplitter.from_language(13 language=Language.PYTHON,14 chunk_size=500,15 chunk_overlap=5016)17# Splits on: class, def, if, for, etc.Semantic Chunking
1from langchain_experimental.text_splitter import SemanticChunker2from langchain_openai import OpenAIEmbeddings34embeddings = OpenAIEmbeddings(model="text-embedding-3-small")56# Group sentences by semantic similarity7semantic_splitter = SemanticChunker(8 embeddings=embeddings,9 breakpoint_threshold_type="percentile",10 breakpoint_threshold_amount=95 # Higher = fewer, larger chunks11)1213chunks = semantic_splitter.split_text(document_text)1415# Each chunk contains semantically related sentences16for i, chunk in enumerate(chunks[:3]):17 print(f"Semantic Chunk {i}: {len(chunk)} chars")18 print(chunk[:150], "...")19 print()Checkpoint
Bạn đã hiểu sự khác biệt giữa Fixed-size, Recursive, Markdown và Semantic chunking chưa?
⚡ Chunk Size Optimization
Finding Optimal Size
1def evaluate_chunk_sizes(document, query, sizes=[200, 500, 800, 1200]):2 """Test different chunk sizes to find optimal."""3 from langchain_community.vectorstores import Chroma4 from langchain_openai import OpenAIEmbeddings5 6 embeddings = OpenAIEmbeddings(model="text-embedding-3-small")7 results = {}8 9 for size in sizes:10 splitter = RecursiveCharacterTextSplitter(11 chunk_size=size,12 chunk_overlap=int(size * 0.2) # 20% overlap13 )14 chunks = splitter.split_text(document)15 16 # Create temp vector store17 vectorstore = Chroma.from_texts(chunks, embeddings)18 19 # Search20 docs = vectorstore.similarity_search_with_score(query, k=3)21 22 results[size] = {23 "n_chunks": len(chunks),24 "avg_chunk_len": sum(len(c) for c in chunks) / len(chunks),25 "top_score": docs[0][1] if docs else None,26 "top_content": docs[0][0].page_content[:100] if docs else None27 }28 29 print(f"Size {size}: {len(chunks)} chunks, avg {results[size]['avg_chunk_len']:.0f} chars")30 31 return resultsOverlap Strategy
| Without Overlap | With Overlap (20%) |
|---|---|
[Chunk 1][Chunk 2] — no shared content | [Chunk 1 ~~~overlap~~~ Chunk 2] — shared boundary |
| ❌ Info at boundary LOST | ✅ Context preserved |
Recommended overlap: 10-20% of chunk_size
chunk_size=500→overlap=50-100chunk_size=1000→overlap=100-200
Guidelines by Document Type
| Document Type | Chunk Size | Overlap | Splitter |
|---|---|---|---|
| Technical docs | 500-800 | 100 | Recursive |
| Legal/policy | 800-1200 | 200 | Recursive |
| Chat logs | 200-400 | 50 | Character |
| Code | 500-1000 | 50 | Language-aware |
| Q&A pairs | Per pair | 0 | Custom |
| Markdown docs | 500-800 | 100 | Markdown splitter |
Checkpoint
Bạn đã biết cách chọn chunk size và overlap phù hợp cho từng loại document chưa?
📝 Metadata Enrichment
Adding Context to Chunks
1from langchain_core.documents import Document23def enrich_chunks(documents, source_info):4 """Add metadata to chunks for better filtering."""5 enriched = []6 7 for i, doc in enumerate(documents):8 # Add positional metadata9 doc.metadata.update({10 "chunk_index": i,11 "total_chunks": len(documents),12 "char_count": len(doc.page_content),13 "word_count": len(doc.page_content.split()),14 15 # Source info16 "source": source_info.get("filename", "unknown"),17 "category": source_info.get("category", "general"),18 "department": source_info.get("department", ""),19 20 # Content hints21 "has_code": "```" in doc.page_content,22 "has_numbers": any(c.isdigit() for c in doc.page_content),23 })24 enriched.append(doc)25 26 return enriched2728# Usage29chunks = splitter.split_documents(raw_docs)30chunks = enrich_chunks(chunks, {"filename": "hr_policy.pdf", "category": "policy", "department": "HR"})Parent Document Strategy
1from langchain.retrievers import ParentDocumentRetriever2from langchain.storage import InMemoryStore3from langchain_community.vectorstores import Chroma45# Small chunks for search, return parent (larger) chunk6child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)7parent_splitter = RecursiveCharacterTextSplitter(chunk_size=1000)89vectorstore = Chroma(embedding_function=embeddings)10docstore = InMemoryStore()1112retriever = ParentDocumentRetriever(13 vectorstore=vectorstore,14 docstore=docstore,15 child_splitter=child_splitter,16 parent_splitter=parent_splitter17)1819# Index documents20retriever.add_documents(raw_docs)2122# Search: matches small chunk, returns parent context23relevant_docs = retriever.invoke("query about policy")24# Returns larger 1000-char parents, not small 200-char childrenCheckpoint
Bạn đã hiểu cách enrich metadata và Parent Document Retriever strategy chưa?
💻 Complete Chunking Pipeline
1class ChunkingPipeline:2 def __init__(self, chunk_size=500, chunk_overlap=100):3 self.splitter = RecursiveCharacterTextSplitter(4 chunk_size=chunk_size,5 chunk_overlap=chunk_overlap,6 separators=["\n\n", "\n", ". ", " ", ""]7 )8 9 def process(self, documents):10 """Full pipeline: split, clean, enrich."""11 # 1. Split12 chunks = self.splitter.split_documents(documents)13 print(f"Split into {len(chunks)} chunks")14 15 # 2. Clean16 cleaned = []17 for chunk in chunks:18 text = chunk.page_content.strip()19 if len(text) < 30: # Skip tiny chunks20 continue21 chunk.page_content = " ".join(text.split()) # Normalize whitespace22 cleaned.append(chunk)23 24 # 3. Enrich metadata25 for i, chunk in enumerate(cleaned):26 chunk.metadata["chunk_id"] = i27 chunk.metadata["word_count"] = len(chunk.page_content.split())28 29 print(f"After cleaning: {len(cleaned)} chunks")30 return cleaned3132# Usage33pipeline = ChunkingPipeline(chunk_size=500, chunk_overlap=100)34final_chunks = pipeline.process(raw_documents)Checkpoint
Bạn đã xây dựng được complete chunking pipeline với split, clean và enrich chưa?
🎯 Tổng kết
📝 Quiz
-
Recursive text splitter ưu điểm gì?
- Nhanh nhất
- Thử nhiều separator theo thứ tự ưu tiên, giữ nguyên cấu trúc văn bản
- Tạo chunks đều nhau hoàn toàn
- Không cần overlap
-
Parent Document Retriever strategy là gì?
- Search trên small chunks, return larger parent chunks cho more context
- Chỉ search parent documents
- Tạo thêm documents mới
- Delete small chunks sau khi search
-
Chunk overlap giúp gì?
- Tăng tốc độ search
- Giảm storage
- Bảo toàn context ở ranh giới giữa chunks, tránh mất thông tin
- Không có tác dụng
Key Takeaways
- Chunk size 300-800 — Sweet spot cho hầu hết use cases
- RecursiveCharacterTextSplitter — Default choice
- 20% overlap — Preserve context at boundaries
- Metadata enrichment — Better filtering in retrieval
- Parent document strategy — Search small, return big
Câu hỏi tự kiểm tra
- Chunk size nên đặt bao nhiêu và các yếu tố nào ảnh hưởng đến việc chọn chunk size tối ưu?
- RecursiveCharacterTextSplitter thử các separator theo thứ tự nào và tại sao điều này giúp giữ cấu trúc văn bản?
- Parent Document Retriever strategy hoạt động như thế nào (search small chunks, return large parents)?
- Metadata enrichment giúp cải thiện retrieval ra sao và nên thêm những metadata nào?
🎉 Tuyệt vời! Bạn đã hoàn thành bài học Chunking Strategies!
Tiếp theo: Hãy cùng tìm hiểu về Query Enhancement trong bài tiếp theo!
🚀 Bài tiếp theo
Query Enhancement — Kỹ thuật cải thiện query: HyDE, multi-query, step-back prompting!
