MinAI - Về trang chủ
Lý thuyết
8/1330 phút
Đang tải...

Chunking Strategies

Kỹ thuật chia documents thành chunks tối ưu cho RAG retrieval

0

🎯 Mục tiêu bài học

TB5 min

Chunking là bước quan trọng nhất trong RAG pipeline. Chunk tốt = retrieval tốt = answer tốt. Chunk chất lượng kém sẽ phá hỏng toàn bộ system.

Sau bài này, bạn sẽ:

✅ Hiểu tại sao chunking quan trọng ✅ Áp dụng Fixed-size, recursive, semantic chunking ✅ Tối ưu overlap strategy và chunk size ✅ Thêm metadata enrichment cho chunks

1

🔍 Why Chunking Matters

TB5 min

Chunking Strategy Pipeline

📄Document (10,000 words)
✂️Chunking Strategy Too Large → Noisy | Too Small → Fragmented | Just Right → Relevant
📑Chunks (500 words each, with overlap)
🔍Embeddings → Vector DB → Retrieval

Impact on RAG Quality

Chunk SizeRetrievalGenerationProblem
Too large (2000+)Low precisionNoisy contextAnswer buried in irrelevant text
Too small (100-)High precisionMissing contextIncomplete information
Optimal (300-800)BalancedGood contextBest trade-off

Checkpoint

Bạn đã hiểu tại sao chunk size ảnh hưởng trực tiếp đến chất lượng RAG chưa?

2

📐 Chunking Methods

TB5 min

Fixed-Size Chunking (Simple)

python.py
1from langchain.text_splitter import CharacterTextSplitter
2
3splitter = CharacterTextSplitter(
4 separator="\n",
5 chunk_size=500, # Max characters per chunk
6 chunk_overlap=100, # Overlap between chunks
7 length_function=len
8)
9
10chunks = splitter.split_text(long_document)
11print(f"Number of chunks: {len(chunks)}")
12for i, chunk in enumerate(chunks[:3]):
13 print(f"Chunk {i}: {len(chunk)} chars")
14 print(chunk[:100], "...")
15 print()

Recursive Character Splitting (Recommended)

python.py
1from langchain.text_splitter import RecursiveCharacterTextSplitter
2
3# Best general-purpose splitter
4splitter = RecursiveCharacterTextSplitter(
5 chunk_size=500,
6 chunk_overlap=100,
7 separators=["\n\n", "\n", ". ", " ", ""],
8 # Tries each separator in order:
9 # 1. Double newline (paragraph breaks)
10 # 2. Single newline
11 # 3. Period + space (sentences)
12 # 4. Space (words)
13 # 5. Empty string (characters) — last resort
14)
15
16chunks = splitter.split_text(document_text)

Markdown / Code Splitting

python.py
1from langchain.text_splitter import MarkdownTextSplitter
2
3md_splitter = MarkdownTextSplitter(
4 chunk_size=500,
5 chunk_overlap=50
6)
7# Splits on: #, ##, ###, ```, ---, etc.
8
9from langchain.text_splitter import Language, RecursiveCharacterTextSplitter
10
11# Code-aware splitting
12python_splitter = RecursiveCharacterTextSplitter.from_language(
13 language=Language.PYTHON,
14 chunk_size=500,
15 chunk_overlap=50
16)
17# Splits on: class, def, if, for, etc.

Semantic Chunking

python.py
1from langchain_experimental.text_splitter import SemanticChunker
2from langchain_openai import OpenAIEmbeddings
3
4embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
5
6# Group sentences by semantic similarity
7semantic_splitter = SemanticChunker(
8 embeddings=embeddings,
9 breakpoint_threshold_type="percentile",
10 breakpoint_threshold_amount=95 # Higher = fewer, larger chunks
11)
12
13chunks = semantic_splitter.split_text(document_text)
14
15# Each chunk contains semantically related sentences
16for i, chunk in enumerate(chunks[:3]):
17 print(f"Semantic Chunk {i}: {len(chunk)} chars")
18 print(chunk[:150], "...")
19 print()

Checkpoint

Bạn đã hiểu sự khác biệt giữa Fixed-size, Recursive, Markdown và Semantic chunking chưa?

3

⚡ Chunk Size Optimization

TB5 min

Finding Optimal Size

python.py
1def evaluate_chunk_sizes(document, query, sizes=[200, 500, 800, 1200]):
2 """Test different chunk sizes to find optimal."""
3 from langchain_community.vectorstores import Chroma
4 from langchain_openai import OpenAIEmbeddings
5
6 embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
7 results = {}
8
9 for size in sizes:
10 splitter = RecursiveCharacterTextSplitter(
11 chunk_size=size,
12 chunk_overlap=int(size * 0.2) # 20% overlap
13 )
14 chunks = splitter.split_text(document)
15
16 # Create temp vector store
17 vectorstore = Chroma.from_texts(chunks, embeddings)
18
19 # Search
20 docs = vectorstore.similarity_search_with_score(query, k=3)
21
22 results[size] = {
23 "n_chunks": len(chunks),
24 "avg_chunk_len": sum(len(c) for c in chunks) / len(chunks),
25 "top_score": docs[0][1] if docs else None,
26 "top_content": docs[0][0].page_content[:100] if docs else None
27 }
28
29 print(f"Size {size}: {len(chunks)} chunks, avg {results[size]['avg_chunk_len']:.0f} chars")
30
31 return results

Overlap Strategy

Overlap Strategy Comparison
Without OverlapWith Overlap (20%)
[Chunk 1][Chunk 2] — no shared content[Chunk 1 ~~~overlap~~~ Chunk 2] — shared boundary
❌ Info at boundary LOST✅ Context preserved

Recommended overlap: 10-20% of chunk_size

  • chunk_size=500overlap=50-100
  • chunk_size=1000overlap=100-200

Guidelines by Document Type

Document TypeChunk SizeOverlapSplitter
Technical docs500-800100Recursive
Legal/policy800-1200200Recursive
Chat logs200-40050Character
Code500-100050Language-aware
Q&A pairsPer pair0Custom
Markdown docs500-800100Markdown splitter

Checkpoint

Bạn đã biết cách chọn chunk size và overlap phù hợp cho từng loại document chưa?

4

📝 Metadata Enrichment

TB5 min

Adding Context to Chunks

python.py
1from langchain_core.documents import Document
2
3def enrich_chunks(documents, source_info):
4 """Add metadata to chunks for better filtering."""
5 enriched = []
6
7 for i, doc in enumerate(documents):
8 # Add positional metadata
9 doc.metadata.update({
10 "chunk_index": i,
11 "total_chunks": len(documents),
12 "char_count": len(doc.page_content),
13 "word_count": len(doc.page_content.split()),
14
15 # Source info
16 "source": source_info.get("filename", "unknown"),
17 "category": source_info.get("category", "general"),
18 "department": source_info.get("department", ""),
19
20 # Content hints
21 "has_code": "```" in doc.page_content,
22 "has_numbers": any(c.isdigit() for c in doc.page_content),
23 })
24 enriched.append(doc)
25
26 return enriched
27
28# Usage
29chunks = splitter.split_documents(raw_docs)
30chunks = enrich_chunks(chunks, {"filename": "hr_policy.pdf", "category": "policy", "department": "HR"})

Parent Document Strategy

python.py
1from langchain.retrievers import ParentDocumentRetriever
2from langchain.storage import InMemoryStore
3from langchain_community.vectorstores import Chroma
4
5# Small chunks for search, return parent (larger) chunk
6child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)
7parent_splitter = RecursiveCharacterTextSplitter(chunk_size=1000)
8
9vectorstore = Chroma(embedding_function=embeddings)
10docstore = InMemoryStore()
11
12retriever = ParentDocumentRetriever(
13 vectorstore=vectorstore,
14 docstore=docstore,
15 child_splitter=child_splitter,
16 parent_splitter=parent_splitter
17)
18
19# Index documents
20retriever.add_documents(raw_docs)
21
22# Search: matches small chunk, returns parent context
23relevant_docs = retriever.invoke("query about policy")
24# Returns larger 1000-char parents, not small 200-char children

Checkpoint

Bạn đã hiểu cách enrich metadata và Parent Document Retriever strategy chưa?

5

💻 Complete Chunking Pipeline

TB5 min
python.py
1class ChunkingPipeline:
2 def __init__(self, chunk_size=500, chunk_overlap=100):
3 self.splitter = RecursiveCharacterTextSplitter(
4 chunk_size=chunk_size,
5 chunk_overlap=chunk_overlap,
6 separators=["\n\n", "\n", ". ", " ", ""]
7 )
8
9 def process(self, documents):
10 """Full pipeline: split, clean, enrich."""
11 # 1. Split
12 chunks = self.splitter.split_documents(documents)
13 print(f"Split into {len(chunks)} chunks")
14
15 # 2. Clean
16 cleaned = []
17 for chunk in chunks:
18 text = chunk.page_content.strip()
19 if len(text) < 30: # Skip tiny chunks
20 continue
21 chunk.page_content = " ".join(text.split()) # Normalize whitespace
22 cleaned.append(chunk)
23
24 # 3. Enrich metadata
25 for i, chunk in enumerate(cleaned):
26 chunk.metadata["chunk_id"] = i
27 chunk.metadata["word_count"] = len(chunk.page_content.split())
28
29 print(f"After cleaning: {len(cleaned)} chunks")
30 return cleaned
31
32# Usage
33pipeline = ChunkingPipeline(chunk_size=500, chunk_overlap=100)
34final_chunks = pipeline.process(raw_documents)

Checkpoint

Bạn đã xây dựng được complete chunking pipeline với split, clean và enrich chưa?

6

🎯 Tổng kết

TB5 min

📝 Quiz

  1. Recursive text splitter ưu điểm gì?

    • Nhanh nhất
    • Thử nhiều separator theo thứ tự ưu tiên, giữ nguyên cấu trúc văn bản
    • Tạo chunks đều nhau hoàn toàn
    • Không cần overlap
  2. Parent Document Retriever strategy là gì?

    • Search trên small chunks, return larger parent chunks cho more context
    • Chỉ search parent documents
    • Tạo thêm documents mới
    • Delete small chunks sau khi search
  3. Chunk overlap giúp gì?

    • Tăng tốc độ search
    • Giảm storage
    • Bảo toàn context ở ranh giới giữa chunks, tránh mất thông tin
    • Không có tác dụng

Key Takeaways

  1. Chunk size 300-800 — Sweet spot cho hầu hết use cases
  2. RecursiveCharacterTextSplitter — Default choice
  3. 20% overlap — Preserve context at boundaries
  4. Metadata enrichment — Better filtering in retrieval
  5. Parent document strategy — Search small, return big

Câu hỏi tự kiểm tra

  1. Chunk size nên đặt bao nhiêu và các yếu tố nào ảnh hưởng đến việc chọn chunk size tối ưu?
  2. RecursiveCharacterTextSplitter thử các separator theo thứ tự nào và tại sao điều này giúp giữ cấu trúc văn bản?
  3. Parent Document Retriever strategy hoạt động như thế nào (search small chunks, return large parents)?
  4. Metadata enrichment giúp cải thiện retrieval ra sao và nên thêm những metadata nào?

🎉 Tuyệt vời! Bạn đã hoàn thành bài học Chunking Strategies!

Tiếp theo: Hãy cùng tìm hiểu về Query Enhancement trong bài tiếp theo!


🚀 Bài tiếp theo

Query Enhancement — Kỹ thuật cải thiện query: HyDE, multi-query, step-back prompting!