Lý thuyết
35 phút
Bài 2/3

Document Processing & Indexing

Xử lý và index documents cho RAG systems trong n8n

📄 Document Processing & Indexing

Chất lượng RAG phụ thuộc nhiều vào cách xử lý documents. Bài này cover document loading, chunking strategies, và indexing.

Document Processing Pipeline

Diagram
graph LR
    D[Documents] --> L[Load]
    L --> C[Clean]
    C --> S[Split/Chunk]
    S --> E[Enrich Metadata]
    E --> Em[Embed]
    Em --> I[Index]

Document Loaders

1. PDF Documents

JavaScript
1// n8n PDF Loader node
2// Or custom với pdf-parse
3
4const pdfParse = require('pdf-parse');
5
6const dataBuffer = fs.readFileSync('document.pdf');
7const data = await pdfParse(dataBuffer);
8
9return [{
10 json: {
11 text: data.text,
12 pages: data.numpages,
13 info: data.info
14 }
15}];

2. Web Pages

JavaScript
1// HTTP Request node to fetch page
2// Then parse HTML
3
4const cheerio = require('cheerio');
5
6const html = $input.first().json.body;
7const $ = cheerio.load(html);
8
9// Extract main content
10const content = $('article').text() || $('main').text() || $('body').text();
11
12// Clean up
13const cleaned = content
14 .replace(/\s+/g, ' ')
15 .replace(/\n+/g, '\n')
16 .trim();
17
18return [{ json: { text: cleaned, url: $json.url } }];

3. Notion Pages

JavaScript
1// Notion API integration
2const notionClient = new Client({ auth: process.env.NOTION_API_KEY });
3
4const page = await notionClient.pages.retrieve({ page_id: pageId });
5const blocks = await notionClient.blocks.children.list({ block_id: pageId });
6
7// Convert blocks to text
8const text = blocks.results
9 .map(block => {
10 if (block.type === 'paragraph') {
11 return block.paragraph.rich_text
12 .map(t => t.plain_text)
13 .join('');
14 }
15 // Handle other block types...
16 })
17 .join('\n\n');

4. Google Docs

JavaScript
1// Google Docs API
2// n8n có built-in Google Docs node
3
4// Get document content
5// Export as plain text or HTML
6// Parse và clean

Chunking Strategies

1. Fixed Size Chunking

Simplest approach:

JavaScript
1function fixedSizeChunk(text, chunkSize = 500, overlap = 50) {
2 const chunks = [];
3 let start = 0;
4
5 while (start < text.length) {
6 const end = start + chunkSize;
7 chunks.push({
8 text: text.slice(start, end),
9 start,
10 end
11 });
12 start = end - overlap;
13 }
14
15 return chunks;
16}

Pros: Simple, predictable Cons: Có thể cắt giữa câu/paragraph

2. Sentence-Based Chunking

JavaScript
1function sentenceChunk(text, maxChunkSize = 500) {
2 const sentences = text.match(/[^.!?]+[.!?]+/g) || [text];
3 const chunks = [];
4 let currentChunk = '';
5
6 for (const sentence of sentences) {
7 if ((currentChunk + sentence).length > maxChunkSize && currentChunk) {
8 chunks.push(currentChunk.trim());
9 currentChunk = sentence;
10 } else {
11 currentChunk += sentence;
12 }
13 }
14
15 if (currentChunk) {
16 chunks.push(currentChunk.trim());
17 }
18
19 return chunks;
20}

3. Recursive Character Splitting

n8n's Text Splitter node uses this:

JavaScript
1// Config
2Separators: ["\n\n", "\n", " ", ""] // Try in order
3Chunk Size: 500
4Chunk Overlap: 50
5
6// Logic:
7// 1. Try to split on \n\n (paragraphs)
8// 2. If chunks too large, split on \n
9// 3. If still too large, split on space
10// 4. Last resort: character-level

4. Semantic Chunking

Based on meaning, not size:

JavaScript
1// Use embeddings to find natural break points
2async function semanticChunk(text) {
3 const sentences = splitIntoSentences(text);
4 const embeddings = await embed(sentences);
5
6 const chunks = [];
7 let currentChunk = [sentences[0]];
8
9 for (let i = 1; i < sentences.length; i++) {
10 const similarity = cosineSimilarity(
11 embeddings[i],
12 embeddings[i-1]
13 );
14
15 if (similarity < 0.5) {
16 // Low similarity = new topic
17 chunks.push(currentChunk.join(' '));
18 currentChunk = [sentences[i]];
19 } else {
20 currentChunk.push(sentences[i]);
21 }
22 }
23
24 chunks.push(currentChunk.join(' '));
25 return chunks;
26}

Metadata Enrichment

Adding Context

JavaScript
1const chunks = $input.all();
2
3const enrichedChunks = chunks.map((chunk, index) => ({
4 json: {
5 text: chunk.json.text,
6 metadata: {
7 source: $json.filename,
8 sourceType: "pdf",
9 chunkIndex: index,
10 totalChunks: chunks.length,
11 dateProcessed: new Date().toISOString(),
12 category: detectCategory(chunk.json.text),
13 language: detectLanguage(chunk.json.text)
14 }
15 }
16}));
17
18return enrichedChunks;

Automatic Categorization

JavaScript
1function detectCategory(text) {
2 const keywords = {
3 'returns': ['return', 'refund', 'exchange', 'money back'],
4 'shipping': ['delivery', 'shipping', 'tracking', 'arrive'],
5 'pricing': ['price', 'cost', 'discount', 'promotion'],
6 'technical': ['install', 'setup', 'error', 'troubleshoot']
7 };
8
9 const textLower = text.toLowerCase();
10
11 for (const [category, words] of Object.entries(keywords)) {
12 if (words.some(word => textLower.includes(word))) {
13 return category;
14 }
15 }
16
17 return 'general';
18}

Embedding

OpenAI Embeddings

JavaScript
1// n8n Embeddings node
2Model: text-embedding-3-small // 1536 dimensions
3// or text-embedding-3-large // 3072 dimensions
4
5// Batch processing
6const BATCH_SIZE = 100;
7const chunks = $input.all();
8const results = [];
9
10for (let i = 0; i < chunks.length; i += BATCH_SIZE) {
11 const batch = chunks.slice(i, i + BATCH_SIZE);
12 const embeddings = await openai.embeddings.create({
13 model: "text-embedding-3-small",
14 input: batch.map(c => c.json.text)
15 });
16
17 batch.forEach((chunk, idx) => {
18 results.push({
19 json: {
20 ...chunk.json,
21 embedding: embeddings.data[idx].embedding
22 }
23 });
24 });
25}
26
27return results;

Cost Estimation

JavaScript
1// text-embedding-3-small: $0.00002 per 1K tokens
2// Estimate: ~1 token per 4 characters
3
4const totalChars = chunks.reduce((sum, c) => sum + c.json.text.length, 0);
5const estimatedTokens = totalChars / 4;
6const estimatedCost = (estimatedTokens / 1000) * 0.00002;
7
8console.log(`Estimated embedding cost: $${estimatedCost.toFixed(4)}`);

Indexing

Pinecone Upsert

JavaScript
1// Batch upsert to Pinecone
2const vectors = chunks.map(chunk => ({
3 id: `doc-${chunk.json.sourceId}-chunk-${chunk.json.chunkIndex}`,
4 values: chunk.json.embedding,
5 metadata: chunk.json.metadata
6}));
7
8// Upsert in batches of 100
9const BATCH_SIZE = 100;
10for (let i = 0; i < vectors.length; i += BATCH_SIZE) {
11 await pinecone.index('my-index').upsert(
12 vectors.slice(i, i + BATCH_SIZE)
13 );
14}

Supabase Insert

JavaScript
1// Insert với RPC function
2const { error } = await supabase.rpc('insert_document_chunks', {
3 chunks: chunks.map(c => ({
4 content: c.json.text,
5 embedding: c.json.embedding,
6 metadata: c.json.metadata
7 }))
8});

Complete Indexing Workflow

Text
1File Trigger (S3/Google Drive)
2
3Download File
4
5IF: File Type
6 ├─ PDF → PDF Loader
7 ├─ DOCX → Word Loader
8 └─ TXT → Text Loader
9
10Clean Text (Code node)
11
12Text Splitter
13
14Enrich Metadata
15
16Batch Embeddings
17
18Upsert to Vector Store
19
20Log to Database (tracking)
21
22Notify (Slack/Email)

Best Practices

Document Processing Tips
  1. Clean before chunking - Remove headers, footers, noise
  2. Preserve structure - Keep headings, lists intact
  3. Right chunk size - 500-1000 chars usually good
  4. Overlap chunks - 10-20% overlap
  5. Add context - Source, date, category
  6. Test retrieval - Verify chunks are retrievable
  7. Version documents - Track updates

Bài tập thực hành

Hands-on Exercise

Build Document Indexing Pipeline:

  1. Create workflow to:

    • Watch folder for new files
    • Load PDF/TXT/Web pages
    • Clean and chunk
    • Add metadata
    • Embed and index
  2. Index 10+ documents

  3. Test retrieval quality

Target: Automated indexing với proper metadata


Tiếp theo

Bài tiếp theo: RAG Query Pipeline - Building the query side.