📄 Document Processing & Indexing

Chất lượng RAG phụ thuộc nhiều vào cách xử lý documents. Bài này cover document loading, chunking strategies, và indexing.

Document Processing Pipeline

Diagram

graph LR
    D[Documents] --> L[Load]
    L --> C[Clean]
    C --> S[Split/Chunk]
    S --> E[Enrich Metadata]
    E --> Em[Embed]
    Em --> I[Index]

Document Loaders

1. PDF Documents

JavaScript

1// n8n PDF Loader node
2// Or custom với pdf-parse
3
4const pdfParse = require('pdf-parse');
5
6const dataBuffer = fs.readFileSync('document.pdf');
7const data = await pdfParse(dataBuffer);
8
9return [{
10  json: {
11    text: data.text,
12    pages: data.numpages,
13    info: data.info
14  }
15}];

2. Web Pages

JavaScript

1// HTTP Request node to fetch page
2// Then parse HTML
3
4const cheerio = require('cheerio');
5
6const html = $input.first().json.body;
7const $ = cheerio.load(html);
8
9// Extract main content
10const content = $('article').text() || $('main').text() || $('body').text();
11
12// Clean up
13const cleaned = content
14  .replace(/\s+/g, ' ')
15  .replace(/\n+/g, '\n')
16  .trim();
17
18return [{ json: { text: cleaned, url: $json.url } }];

3. Notion Pages

JavaScript

1// Notion API integration
2const notionClient = new Client({ auth: process.env.NOTION_API_KEY });
3
4const page = await notionClient.pages.retrieve({ page_id: pageId });
5const blocks = await notionClient.blocks.children.list({ block_id: pageId });
6
7// Convert blocks to text
8const text = blocks.results
9  .map(block => {
10    if (block.type === 'paragraph') {
11      return block.paragraph.rich_text
12        .map(t => t.plain_text)
13        .join('');
14    }
15    // Handle other block types...
16  })
17  .join('\n\n');

4. Google Docs

JavaScript

1// Google Docs API
2// n8n có built-in Google Docs node
3
4// Get document content
5// Export as plain text or HTML
6// Parse và clean

Chunking Strategies

1. Fixed Size Chunking

Simplest approach:

JavaScript

1function fixedSizeChunk(text, chunkSize = 500, overlap = 50) {
2  const chunks = [];
3  let start = 0;
4  
5  while (start < text.length) {
6    const end = start + chunkSize;
7    chunks.push({
8      text: text.slice(start, end),
9      start,
10      end
11    });
12    start = end - overlap;
13  }
14  
15  return chunks;
16}

Pros: Simple, predictable Cons: Có thể cắt giữa câu/paragraph

2. Sentence-Based Chunking

JavaScript

1function sentenceChunk(text, maxChunkSize = 500) {
2  const sentences = text.match(/[^.!?]+[.!?]+/g) || [text];
3  const chunks = [];
4  let currentChunk = '';
5  
6  for (const sentence of sentences) {
7    if ((currentChunk + sentence).length > maxChunkSize && currentChunk) {
8      chunks.push(currentChunk.trim());
9      currentChunk = sentence;
10    } else {
11      currentChunk += sentence;
12    }
13  }
14  
15  if (currentChunk) {
16    chunks.push(currentChunk.trim());
17  }
18  
19  return chunks;
20}

3. Recursive Character Splitting

n8n's Text Splitter node uses this:

JavaScript

1// Config
2Separators: ["\n\n", "\n", " ", ""]  // Try in order
3Chunk Size: 500
4Chunk Overlap: 50
5
6// Logic:
7// 1. Try to split on \n\n (paragraphs)
8// 2. If chunks too large, split on \n
9// 3. If still too large, split on space
10// 4. Last resort: character-level

4. Semantic Chunking

Based on meaning, not size:

JavaScript

1// Use embeddings to find natural break points
2async function semanticChunk(text) {
3  const sentences = splitIntoSentences(text);
4  const embeddings = await embed(sentences);
5  
6  const chunks = [];
7  let currentChunk = [sentences[0]];
8  
9  for (let i = 1; i < sentences.length; i++) {
10    const similarity = cosineSimilarity(
11      embeddings[i], 
12      embeddings[i-1]
13    );
14    
15    if (similarity < 0.5) {
16      // Low similarity = new topic
17      chunks.push(currentChunk.join(' '));
18      currentChunk = [sentences[i]];
19    } else {
20      currentChunk.push(sentences[i]);
21    }
22  }
23  
24  chunks.push(currentChunk.join(' '));
25  return chunks;
26}

Metadata Enrichment

Adding Context

JavaScript

1const chunks = $input.all();
2
3const enrichedChunks = chunks.map((chunk, index) => ({
4  json: {
5    text: chunk.json.text,
6    metadata: {
7      source: $json.filename,
8      sourceType: "pdf",
9      chunkIndex: index,
10      totalChunks: chunks.length,
11      dateProcessed: new Date().toISOString(),
12      category: detectCategory(chunk.json.text),
13      language: detectLanguage(chunk.json.text)
14    }
15  }
16}));
17
18return enrichedChunks;

Automatic Categorization

JavaScript

1function detectCategory(text) {
2  const keywords = {
3    'returns': ['return', 'refund', 'exchange', 'money back'],
4    'shipping': ['delivery', 'shipping', 'tracking', 'arrive'],
5    'pricing': ['price', 'cost', 'discount', 'promotion'],
6    'technical': ['install', 'setup', 'error', 'troubleshoot']
7  };
8  
9  const textLower = text.toLowerCase();
10  
11  for (const [category, words] of Object.entries(keywords)) {
12    if (words.some(word => textLower.includes(word))) {
13      return category;
14    }
15  }
16  
17  return 'general';
18}

Embedding

OpenAI Embeddings

JavaScript

1// n8n Embeddings node
2Model: text-embedding-3-small  // 1536 dimensions
3// or text-embedding-3-large   // 3072 dimensions
4
5// Batch processing
6const BATCH_SIZE = 100;
7const chunks = $input.all();
8const results = [];
9
10for (let i = 0; i < chunks.length; i += BATCH_SIZE) {
11  const batch = chunks.slice(i, i + BATCH_SIZE);
12  const embeddings = await openai.embeddings.create({
13    model: "text-embedding-3-small",
14    input: batch.map(c => c.json.text)
15  });
16  
17  batch.forEach((chunk, idx) => {
18    results.push({
19      json: {
20        ...chunk.json,
21        embedding: embeddings.data[idx].embedding
22      }
23    });
24  });
25}
26
27return results;

Cost Estimation

JavaScript

1// text-embedding-3-small: $0.00002 per 1K tokens
2// Estimate: ~1 token per 4 characters
3
4const totalChars = chunks.reduce((sum, c) => sum + c.json.text.length, 0);
5const estimatedTokens = totalChars / 4;
6const estimatedCost = (estimatedTokens / 1000) * 0.00002;
7
8console.log(`Estimated embedding cost: $${estimatedCost.toFixed(4)}`);

Indexing

Pinecone Upsert

JavaScript

1// Batch upsert to Pinecone
2const vectors = chunks.map(chunk => ({
3  id: `doc-${chunk.json.sourceId}-chunk-${chunk.json.chunkIndex}`,
4  values: chunk.json.embedding,
5  metadata: chunk.json.metadata
6}));
7
8// Upsert in batches of 100
9const BATCH_SIZE = 100;
10for (let i = 0; i < vectors.length; i += BATCH_SIZE) {
11  await pinecone.index('my-index').upsert(
12    vectors.slice(i, i + BATCH_SIZE)
13  );
14}

Supabase Insert

JavaScript

1// Insert với RPC function
2const { error } = await supabase.rpc('insert_document_chunks', {
3  chunks: chunks.map(c => ({
4    content: c.json.text,
5    embedding: c.json.embedding,
6    metadata: c.json.metadata
7  }))
8});

Complete Indexing Workflow

Text

1File Trigger (S3/Google Drive)
2    ↓
3Download File
4    ↓
5IF: File Type
6    ├─ PDF → PDF Loader
7    ├─ DOCX → Word Loader
8    └─ TXT → Text Loader
9    ↓
10Clean Text (Code node)
11    ↓
12Text Splitter
13    ↓
14Enrich Metadata
15    ↓
16Batch Embeddings
17    ↓
18Upsert to Vector Store
19    ↓
20Log to Database (tracking)
21    ↓
22Notify (Slack/Email)

Best Practices

Document Processing Tips

Clean before chunking - Remove headers, footers, noise
Preserve structure - Keep headings, lists intact
Right chunk size - 500-1000 chars usually good
Overlap chunks - 10-20% overlap
Add context - Source, date, category
Test retrieval - Verify chunks are retrievable
Version documents - Track updates

Bài tập thực hành

Hands-on Exercise

Build Document Indexing Pipeline:

Create workflow to:
- Watch folder for new files
- Load PDF/TXT/Web pages
- Clean and chunk
- Add metadata
- Embed and index
Index 10+ documents
Test retrieval quality

Target: Automated indexing với proper metadata

Bài tiếp theo: RAG Query Pipeline - Building the query side.

📄 Document Processing & Indexing

Document Processing Pipeline

Document Loaders

1. PDF Documents

2. Web Pages

3. Notion Pages

4. Google Docs

Chunking Strategies

1. Fixed Size Chunking

2. Sentence-Based Chunking

3. Recursive Character Splitting

4. Semantic Chunking

Metadata Enrichment

Adding Context

Automatic Categorization

Embedding

OpenAI Embeddings

Cost Estimation

Indexing

Pinecone Upsert

Supabase Insert

Complete Indexing Workflow

Best Practices

Bài tập thực hành

Tiếp theo