MinAI - Về trang chủ
Lý thuyết
6/1330 phút
Đang tải...

Text Splitting Strategies

Các chiến lược split text hiệu quả cho RAG systems

✂️ Text Splitting Strategies

0

🎯 Mục tiêu bài học

TB5 min

Sau bài học này, bạn sẽ:

✅ Hiểu tại sao text splitting strategy ảnh hưởng đến RAG quality

✅ Nắm vững 4 splitting methods: Character, Recursive, Markdown, Semantic

✅ Biết cách chọn overlap strategy và chunk size phù hợp

✅ Thêm metadata cho chunks để improve retrieval

✅ Test và evaluate chunking quality

Cách bạn split documents ảnh hưởng trực tiếp đến chất lượng RAG. Bài này cover các strategies từ basic đến advanced.

1

💡 Tại sao Text Splitting quan trọng?

TB5 min
Diagram
Đang vẽ diagram...

Checkpoint

Chunks quá lớn và quá nhỏ gây ra vấn đề gì? Thế nào là chunk size 'just right'?

2

🛠️ Splitting Methods

TB5 min

1. Character Splitter (Basic)

JavaScript
1// Simple fixed-size splitting
2// n8n Text Splitter: Character
3// Chunk Size: 1000
4// Overlap: 200
5
6// Problem: Có thể cắt giữa câu
7// "The company was founded in" | "2020 by John Smith."

2. Recursive Character Splitter (Recommended)

JavaScript
1// Splits by hierarchy: paragraph > sentence > word
2// n8n Text Splitter: Recursive Character
3// Chunk Size: 1000
4// Chunk Overlap: 200
5// Separators: ["\n\n", "\n", ". ", " "]
6
7// Flow:
8// 1. Try split by "\n\n" (paragraphs)
9// 2. If still too large, split by "\n" (lines)
10// 3. If still too large, split by ". " (sentences)
11// 4. Last resort: split by " " (words)

3. Markdown Splitter

JavaScript
1// Respect markdown structure
2// Splits by: headers, code blocks, paragraphs
3
4// Code node: Custom markdown splitter
5function splitMarkdown(text, maxChars = 1000) {
6 const sections = text.split(/(?=^#{1,3} )/gm);
7 const chunks = [];
8 let current = '';
9
10 for (const section of sections) {
11 if ((current + section).length > maxChars && current) {
12 chunks.push(current.trim());
13 current = section;
14 } else {
15 current += section;
16 }
17 }
18 if (current) chunks.push(current.trim());
19
20 return chunks;
21}
22
23const chunks = splitMarkdown($json.content);
24return chunks.map((c, i) => ({ json: { text: c, index: i } }));

4. Semantic Splitter

JavaScript
1// Split by semantic similarity
2// Group related sentences together
3
4// Code node: Semantic chunking concept
5// 1. Split into sentences
6// 2. Embed each sentence
7// 3. Compare adjacent sentence embeddings
8// 4. Split where similarity drops significantly
9
10function splitIntoSentences(text) {
11 return text.match(/[^.!?]+[.!?]+/g) || [text];
12}
13
14// In practice, use AI to determine chunk boundaries:
15const semanticPrompt = `
16Split this text into logical chunks. Each chunk should be about
17one topic or idea. Mark split points with "---SPLIT---".
18
19Text: ${$json.text}
20`;

Checkpoint

So sánh 4 splitting methods. Tại sao Recursive Character Splitter được recommend?

3

🔗 Overlap Strategy

TB5 min
Diagram
Đang vẽ diagram...
Overlap Guidelines
Chunk SizeRecommended OverlapPercentage
500 chars50-10010-20%
1000 chars100-20010-20%
2000 chars200-40010-20%

Overlap giúp preserv context tại boundaries. Quá nhiều overlap = duplicated results. Quá ít = lost context.

Checkpoint

Overlap percentage nên là bao nhiêu? Quá nhiều và quá ít overlap gây ra vấn đề gì?

4

📏 Chunk Size Selection

TB5 min
Document TypeRecommended SizeWhy
FAQ200-500Short, focused answers
Technical docs500-1000Need context but precise
Legal contracts1000-2000Long clauses need full context
Chat logs300-500Short messages
Books/articles500-1500Balanced

Checkpoint

Chunk size recommended cho FAQ, technical docs, và legal contracts là bao nhiêu? Tại sao khác nhau?

5

🏷️ Metadata for Chunks

TB5 min
JavaScript
1// Code node: Enrich chunks with metadata
2const chunks = $input.all();
3const fileName = $json.source;
4
5return chunks.map((chunk, i) => ({
6 json: {
7 content: chunk.json.text,
8 metadata: {
9 source: fileName,
10 chunkIndex: i,
11 totalChunks: chunks.length,
12 position: i === 0 ? "beginning" : i === chunks.length - 1 ? "end" : "middle",
13 charCount: chunk.json.text.length,
14 heading: extractHeading(chunk.json.text)
15 }
16 }
17}));
18
19function extractHeading(text) {
20 const match = text.match(/^#+\s+(.+)/m);
21 return match ? match[1] : "untitled";
22}

Checkpoint

Metadata cho chunks nên bao gồm những fields nào? Tại sao position và heading quan trọng?

6

📊 Testing Chunk Quality

TB5 min
JavaScript
1// Evaluate chunking by testing retrieval
2// Code node: Chunk quality metrics
3const testQueries = [
4 "What is the refund policy?",
5 "How do I reset my password?",
6 "What are the pricing plans?"
7];
8
9// For each query:
10// 1. Search vector store
11// 2. Check if top result contains the answer
12// 3. Measure relevance score
13
14const qualityReport = {
15 totalQueries: testQueries.length,
16 avgRelevanceScore: 0.85,
17 queriesWithAnswer: 2,
18 queriesWithout: 1,
19 suggestion: "Consider smaller chunks for FAQ-style content"
20};

Checkpoint

Làm thế nào để evaluate chunking quality? Cần test những gì?

7

📝 Bài tập thực hành

TB5 min
Exercises
  1. Experiment: Same document, 3 different chunk sizes (300, 800, 1500)
  2. Compare retrieval quality cho mỗi chunk size
  3. Build markdown-aware splitter cho technical docs
  4. Add metadata enrichment vào chunking pipeline

Checkpoint

Bạn sẽ experiment với 3 chunk sizes nào? Tại sao cần so sánh retrieval quality?

🚀 Bài tiếp theo

Embedding Documents → — Tạo embeddings và index documents vào vector store.