🧬 Embedding Documents

🎯 Mục tiêu bài học

TB5 min

Sau bài học này, bạn sẽ:

✅ Hiểu Embeddings là gì và cách chúng hoạt động

✅ Biết cách config Embeddings node trong n8n

✅ Xây dựng full indexing pipeline từ load đến index

✅ Implement batch indexing với rate limiting

✅ Thiết lập re-indexing strategy cho document updates

Embeddings biến text thành vectors — cho phép machines hiểu semantic similarity. Bài này cover embedding workflow trong n8n.

Task 0

🔍 Embeddings là gì?

TB5 min

Diagram

Đang vẽ diagram...

"Cat" và "Dog" có vectors gần nhau (cả hai là animals). "Car" có vector khác xa.

Checkpoint

Embeddings biến text thành gì? Tại sao 'Cat' và 'Dog' có vectors gần nhau?

Task 1

🔄 Embedding Workflow trong n8n

TB5 min

Diagram

Đang vẽ diagram...

Embeddings Node Configuration

JavaScript

1// n8n Embeddings Node
2// Provider: OpenAI
3// Model: text-embedding-3-small
4// Dimensions: 1536 (default)
5
6// Or for higher quality:
7// Model: text-embedding-3-large
8// Dimensions: 3072
9
10// Cost comparison (per 1M tokens):
11// text-embedding-3-small: $0.02
12// text-embedding-3-large: $0.13

Checkpoint

Mô tả các bước trong embedding workflow. So sánh cost và quality của 2 OpenAI embedding models.

Task 2

🏗️ Full Indexing Pipeline

TB5 min

Diagram

Đang vẽ diagram...

Step 1: Load Documents

JavaScript

1// HTTP Request or Google Drive node
2// Supports: PDF, TXT, DOCX, CSV, Markdown, HTML
3
4// For PDF: Use n8n PDF loader
5// For web pages: HTTP Request + Extract HTML
6// For Google Docs: Google Drive node

Step 2: Clean Text

JavaScript

1// Code node: Clean and normalize text
2function cleanText(text) {
3  return text
4    .replace(/\s+/g, ' ')           // Multiple spaces to single
5    .replace(/\n{3,}/g, '\n\n')     // Multiple newlines to double
6    .replace(/[^\S\n]+/g, ' ')      // Normalize whitespace
7    .replace(/\u0000/g, '')         // Remove null chars
8    .trim();
9}
10
11return { json: { text: cleanText($json.text), source: $json.source } };

Step 3: Split and Embed

JavaScript

1// Text Splitter node → Embeddings node → Vector Store node
2// These are connected as sub-nodes in the Vector Store Insert operation
3
4// Vector Store Insert node configuration:
5// Mode: Insert
6// Embedding: OpenAI Embeddings (sub-node)
7// Text Splitter: Recursive Character (sub-node)
8//   - Chunk Size: 800
9//   - Chunk Overlap: 150

Checkpoint

Liệt kê các bước trong full indexing pipeline từ Google Drive đến Report Stats.

Task 3

📦 Batch Indexing

TB5 min

JavaScript

1// Code node: Batch processing for large document sets
2const documents = $input.all();
3const batchSize = 20;
4const batches = [];
5
6for (let i = 0; i < documents.length; i += batchSize) {
7  batches.push(documents.slice(i, i + batchSize));
8}
9
10// Process each batch with a wait between them
11// to avoid rate limits
12return batches.map((batch, i) => ({
13  json: {
14    batchIndex: i,
15    documents: batch.map(d => d.json),
16    totalBatches: batches.length
17  }
18}));

Checkpoint

Tại sao cần batch processing khi indexing? Batch size nên là bao nhiêu?

Task 4

✅ Verifying Indexed Data

TB5 min

JavaScript

1// After indexing, verify with test queries
2const testQueries = [
3  "What is our return policy?",
4  "Company mission statement",
5  "How to contact support"
6];
7
8// Vector Store Search node
9// Query each test query
10// Verify relevant chunks are returned
11// Check similarity scores > 0.7

Checkpoint

Làm thế nào để verify data đã indexed đúng? Similarity score threshold nên là bao nhiêu?

Task 5

🔄 Re-indexing Strategy

TB5 min

Diagram

Đang vẽ diagram...

JavaScript

1// Code node: Track document versions
2const docTracker = {
3  documentId: $json.fileId,
4  lastModified: $json.modifiedTime,
5  chunkCount: $json.chunks.length,
6  version: ($json.currentVersion || 0) + 1,
7  indexedAt: new Date().toISOString()
8};
9
10// Save to Google Sheets for tracking
11return { json: docTracker };

Embedding Best Practices

Consistent model: Dùng cùng embedding model cho indexing và querying
Batch processing: Index documents theo batches, không 1-by-1
Metadata: Luôn attach source info, date, và type
Version tracking: Track versions khi re-index
Test after indexing: Verify với known queries

Checkpoint

Mô tả các bước trong re-indexing strategy khi document thay đổi.

Task 6

📝 Bài tập thực hành

TB5 min

Exercises

Build pipeline: Load 10 documents, clean, split, embed, index
Test retrieval accuracy với 5 test queries
Implement batch indexing với rate limiting
Create re-indexing workflow triggered by file changes

Checkpoint

Liệt kê 4 exercises cần hoàn thành. Exercise nào quan trọng nhất?

Task 7

🚀 Bài tiếp theo

Query Pipeline → — Xây dựng query pipeline hoàn chỉnh cho RAG.

Embedding Documents

🧬 Embedding Documents

🎯 Mục tiêu bài học

🔍 Embeddings là gì?

Checkpoint

🔄 Embedding Workflow trong n8n

Embeddings Node Configuration

Checkpoint

🏗️ Full Indexing Pipeline

Step 1: Load Documents

Step 2: Clean Text

Step 3: Split and Embed

Checkpoint

📦 Batch Indexing

Checkpoint

✅ Verifying Indexed Data

Checkpoint

🔄 Re-indexing Strategy

Checkpoint

📝 Bài tập thực hành

Checkpoint

🚀 Bài tiếp theo

Khóa học

Mentor & Hỗ trợ

Blog

Giới thiệu