📄 Document Processing & Indexing
Chất lượng RAG phụ thuộc nhiều vào cách xử lý documents. Bài này cover document loading, chunking strategies, và indexing.
Document Processing Pipeline
Diagram
graph LR
D[Documents] --> L[Load]
L --> C[Clean]
C --> S[Split/Chunk]
S --> E[Enrich Metadata]
E --> Em[Embed]
Em --> I[Index]Document Loaders
1. PDF Documents
JavaScript
1// n8n PDF Loader node2// Or custom với pdf-parse34const pdfParse = require('pdf-parse');56const dataBuffer = fs.readFileSync('document.pdf');7const data = await pdfParse(dataBuffer);89return [{10 json: {11 text: data.text,12 pages: data.numpages,13 info: data.info14 }15}];2. Web Pages
JavaScript
1// HTTP Request node to fetch page2// Then parse HTML34const cheerio = require('cheerio');56const html = $input.first().json.body;7const $ = cheerio.load(html);89// Extract main content10const content = $('article').text() || $('main').text() || $('body').text();1112// Clean up13const cleaned = content14 .replace(/\s+/g, ' ')15 .replace(/\n+/g, '\n')16 .trim();1718return [{ json: { text: cleaned, url: $json.url } }];3. Notion Pages
JavaScript
1// Notion API integration2const notionClient = new Client({ auth: process.env.NOTION_API_KEY });34const page = await notionClient.pages.retrieve({ page_id: pageId });5const blocks = await notionClient.blocks.children.list({ block_id: pageId });67// Convert blocks to text8const text = blocks.results9 .map(block => {10 if (block.type === 'paragraph') {11 return block.paragraph.rich_text12 .map(t => t.plain_text)13 .join('');14 }15 // Handle other block types...16 })17 .join('\n\n');4. Google Docs
JavaScript
1// Google Docs API2// n8n có built-in Google Docs node34// Get document content5// Export as plain text or HTML6// Parse và cleanChunking Strategies
1. Fixed Size Chunking
Simplest approach:
JavaScript
1function fixedSizeChunk(text, chunkSize = 500, overlap = 50) {2 const chunks = [];3 let start = 0;4 5 while (start < text.length) {6 const end = start + chunkSize;7 chunks.push({8 text: text.slice(start, end),9 start,10 end11 });12 start = end - overlap;13 }14 15 return chunks;16}Pros: Simple, predictable Cons: Có thể cắt giữa câu/paragraph
2. Sentence-Based Chunking
JavaScript
1function sentenceChunk(text, maxChunkSize = 500) {2 const sentences = text.match(/[^.!?]+[.!?]+/g) || [text];3 const chunks = [];4 let currentChunk = '';5 6 for (const sentence of sentences) {7 if ((currentChunk + sentence).length > maxChunkSize && currentChunk) {8 chunks.push(currentChunk.trim());9 currentChunk = sentence;10 } else {11 currentChunk += sentence;12 }13 }14 15 if (currentChunk) {16 chunks.push(currentChunk.trim());17 }18 19 return chunks;20}3. Recursive Character Splitting
n8n's Text Splitter node uses this:
JavaScript
1// Config2Separators: ["\n\n", "\n", " ", ""] // Try in order3Chunk Size: 5004Chunk Overlap: 5056// Logic:7// 1. Try to split on \n\n (paragraphs)8// 2. If chunks too large, split on \n9// 3. If still too large, split on space10// 4. Last resort: character-level4. Semantic Chunking
Based on meaning, not size:
JavaScript
1// Use embeddings to find natural break points2async function semanticChunk(text) {3 const sentences = splitIntoSentences(text);4 const embeddings = await embed(sentences);5 6 const chunks = [];7 let currentChunk = [sentences[0]];8 9 for (let i = 1; i < sentences.length; i++) {10 const similarity = cosineSimilarity(11 embeddings[i], 12 embeddings[i-1]13 );14 15 if (similarity < 0.5) {16 // Low similarity = new topic17 chunks.push(currentChunk.join(' '));18 currentChunk = [sentences[i]];19 } else {20 currentChunk.push(sentences[i]);21 }22 }23 24 chunks.push(currentChunk.join(' '));25 return chunks;26}Metadata Enrichment
Adding Context
JavaScript
1const chunks = $input.all();23const enrichedChunks = chunks.map((chunk, index) => ({4 json: {5 text: chunk.json.text,6 metadata: {7 source: $json.filename,8 sourceType: "pdf",9 chunkIndex: index,10 totalChunks: chunks.length,11 dateProcessed: new Date().toISOString(),12 category: detectCategory(chunk.json.text),13 language: detectLanguage(chunk.json.text)14 }15 }16}));1718return enrichedChunks;Automatic Categorization
JavaScript
1function detectCategory(text) {2 const keywords = {3 'returns': ['return', 'refund', 'exchange', 'money back'],4 'shipping': ['delivery', 'shipping', 'tracking', 'arrive'],5 'pricing': ['price', 'cost', 'discount', 'promotion'],6 'technical': ['install', 'setup', 'error', 'troubleshoot']7 };8 9 const textLower = text.toLowerCase();10 11 for (const [category, words] of Object.entries(keywords)) {12 if (words.some(word => textLower.includes(word))) {13 return category;14 }15 }16 17 return 'general';18}Embedding
OpenAI Embeddings
JavaScript
1// n8n Embeddings node2Model: text-embedding-3-small // 1536 dimensions3// or text-embedding-3-large // 3072 dimensions45// Batch processing6const BATCH_SIZE = 100;7const chunks = $input.all();8const results = [];910for (let i = 0; i < chunks.length; i += BATCH_SIZE) {11 const batch = chunks.slice(i, i + BATCH_SIZE);12 const embeddings = await openai.embeddings.create({13 model: "text-embedding-3-small",14 input: batch.map(c => c.json.text)15 });16 17 batch.forEach((chunk, idx) => {18 results.push({19 json: {20 ...chunk.json,21 embedding: embeddings.data[idx].embedding22 }23 });24 });25}2627return results;Cost Estimation
JavaScript
1// text-embedding-3-small: $0.00002 per 1K tokens2// Estimate: ~1 token per 4 characters34const totalChars = chunks.reduce((sum, c) => sum + c.json.text.length, 0);5const estimatedTokens = totalChars / 4;6const estimatedCost = (estimatedTokens / 1000) * 0.00002;78console.log(`Estimated embedding cost: $${estimatedCost.toFixed(4)}`);Indexing
Pinecone Upsert
JavaScript
1// Batch upsert to Pinecone2const vectors = chunks.map(chunk => ({3 id: `doc-${chunk.json.sourceId}-chunk-${chunk.json.chunkIndex}`,4 values: chunk.json.embedding,5 metadata: chunk.json.metadata6}));78// Upsert in batches of 1009const BATCH_SIZE = 100;10for (let i = 0; i < vectors.length; i += BATCH_SIZE) {11 await pinecone.index('my-index').upsert(12 vectors.slice(i, i + BATCH_SIZE)13 );14}Supabase Insert
JavaScript
1// Insert với RPC function2const { error } = await supabase.rpc('insert_document_chunks', {3 chunks: chunks.map(c => ({4 content: c.json.text,5 embedding: c.json.embedding,6 metadata: c.json.metadata7 }))8});Complete Indexing Workflow
Text
1File Trigger (S3/Google Drive)2 ↓3Download File4 ↓5IF: File Type6 ├─ PDF → PDF Loader7 ├─ DOCX → Word Loader8 └─ TXT → Text Loader9 ↓10Clean Text (Code node)11 ↓12Text Splitter13 ↓14Enrich Metadata15 ↓16Batch Embeddings17 ↓18Upsert to Vector Store19 ↓20Log to Database (tracking)21 ↓22Notify (Slack/Email)Best Practices
Document Processing Tips
- Clean before chunking - Remove headers, footers, noise
- Preserve structure - Keep headings, lists intact
- Right chunk size - 500-1000 chars usually good
- Overlap chunks - 10-20% overlap
- Add context - Source, date, category
- Test retrieval - Verify chunks are retrievable
- Version documents - Track updates
Bài tập thực hành
Hands-on Exercise
Build Document Indexing Pipeline:
-
Create workflow to:
- Watch folder for new files
- Load PDF/TXT/Web pages
- Clean and chunk
- Add metadata
- Embed and index
-
Index 10+ documents
-
Test retrieval quality
Target: Automated indexing với proper metadata
Tiếp theo
Bài tiếp theo: RAG Query Pipeline - Building the query side.
