✂️ Text Splitting Strategies
🎯 Mục tiêu bài học
Sau bài học này, bạn sẽ:
✅ Hiểu tại sao text splitting strategy ảnh hưởng đến RAG quality
✅ Nắm vững 4 splitting methods: Character, Recursive, Markdown, Semantic
✅ Biết cách chọn overlap strategy và chunk size phù hợp
✅ Thêm metadata cho chunks để improve retrieval
✅ Test và evaluate chunking quality
Cách bạn split documents ảnh hưởng trực tiếp đến chất lượng RAG. Bài này cover các strategies từ basic đến advanced.
💡 Tại sao Text Splitting quan trọng?
Checkpoint
Chunks quá lớn và quá nhỏ gây ra vấn đề gì? Thế nào là chunk size 'just right'?
🛠️ Splitting Methods
1. Character Splitter (Basic)
1// Simple fixed-size splitting2// n8n Text Splitter: Character3// Chunk Size: 10004// Overlap: 20056// Problem: Có thể cắt giữa câu7// "The company was founded in" | "2020 by John Smith."2. Recursive Character Splitter (Recommended)
1// Splits by hierarchy: paragraph > sentence > word2// n8n Text Splitter: Recursive Character3// Chunk Size: 10004// Chunk Overlap: 2005// Separators: ["\n\n", "\n", ". ", " "]67// Flow:8// 1. Try split by "\n\n" (paragraphs)9// 2. If still too large, split by "\n" (lines)10// 3. If still too large, split by ". " (sentences)11// 4. Last resort: split by " " (words)3. Markdown Splitter
1// Respect markdown structure2// Splits by: headers, code blocks, paragraphs34// Code node: Custom markdown splitter5function splitMarkdown(text, maxChars = 1000) {6 const sections = text.split(/(?=^#{1,3} )/gm);7 const chunks = [];8 let current = '';9 10 for (const section of sections) {11 if ((current + section).length > maxChars && current) {12 chunks.push(current.trim());13 current = section;14 } else {15 current += section;16 }17 }18 if (current) chunks.push(current.trim());19 20 return chunks;21}2223const chunks = splitMarkdown($json.content);24return chunks.map((c, i) => ({ json: { text: c, index: i } }));4. Semantic Splitter
1// Split by semantic similarity2// Group related sentences together34// Code node: Semantic chunking concept5// 1. Split into sentences6// 2. Embed each sentence7// 3. Compare adjacent sentence embeddings8// 4. Split where similarity drops significantly910function splitIntoSentences(text) {11 return text.match(/[^.!?]+[.!?]+/g) || [text];12}1314// In practice, use AI to determine chunk boundaries:15const semanticPrompt = `16Split this text into logical chunks. Each chunk should be about 17one topic or idea. Mark split points with "---SPLIT---".1819Text: ${$json.text}20`;Checkpoint
So sánh 4 splitting methods. Tại sao Recursive Character Splitter được recommend?
🔗 Overlap Strategy
| Chunk Size | Recommended Overlap | Percentage |
|---|---|---|
| 500 chars | 50-100 | 10-20% |
| 1000 chars | 100-200 | 10-20% |
| 2000 chars | 200-400 | 10-20% |
Overlap giúp preserv context tại boundaries. Quá nhiều overlap = duplicated results. Quá ít = lost context.
Checkpoint
Overlap percentage nên là bao nhiêu? Quá nhiều và quá ít overlap gây ra vấn đề gì?
📏 Chunk Size Selection
| Document Type | Recommended Size | Why |
|---|---|---|
| FAQ | 200-500 | Short, focused answers |
| Technical docs | 500-1000 | Need context but precise |
| Legal contracts | 1000-2000 | Long clauses need full context |
| Chat logs | 300-500 | Short messages |
| Books/articles | 500-1500 | Balanced |
Checkpoint
Chunk size recommended cho FAQ, technical docs, và legal contracts là bao nhiêu? Tại sao khác nhau?
🏷️ Metadata for Chunks
1// Code node: Enrich chunks with metadata2const chunks = $input.all();3const fileName = $json.source;45return chunks.map((chunk, i) => ({6 json: {7 content: chunk.json.text,8 metadata: {9 source: fileName,10 chunkIndex: i,11 totalChunks: chunks.length,12 position: i === 0 ? "beginning" : i === chunks.length - 1 ? "end" : "middle",13 charCount: chunk.json.text.length,14 heading: extractHeading(chunk.json.text)15 }16 }17}));1819function extractHeading(text) {20 const match = text.match(/^#+\s+(.+)/m);21 return match ? match[1] : "untitled";22}Checkpoint
Metadata cho chunks nên bao gồm những fields nào? Tại sao position và heading quan trọng?
📊 Testing Chunk Quality
1// Evaluate chunking by testing retrieval2// Code node: Chunk quality metrics3const testQueries = [4 "What is the refund policy?",5 "How do I reset my password?",6 "What are the pricing plans?"7];89// For each query:10// 1. Search vector store11// 2. Check if top result contains the answer12// 3. Measure relevance score1314const qualityReport = {15 totalQueries: testQueries.length,16 avgRelevanceScore: 0.85,17 queriesWithAnswer: 2,18 queriesWithout: 1,19 suggestion: "Consider smaller chunks for FAQ-style content"20};Checkpoint
Làm thế nào để evaluate chunking quality? Cần test những gì?
📝 Bài tập thực hành
- Experiment: Same document, 3 different chunk sizes (300, 800, 1500)
- Compare retrieval quality cho mỗi chunk size
- Build markdown-aware splitter cho technical docs
- Add metadata enrichment vào chunking pipeline
Checkpoint
Bạn sẽ experiment với 3 chunk sizes nào? Tại sao cần so sánh retrieval quality?
🚀 Bài tiếp theo
Embedding Documents → — Tạo embeddings và index documents vào vector store.
