Lý thuyết
30 phút
Bài 3/5

Document Processing & Chunking

Học cách xử lý documents và chunking strategies cho RAG

📄 Document Processing & Chunking

Document processing là bước quan trọng trong RAG pipeline. Cách bạn chunk documents ảnh hưởng trực tiếp đến chất lượng retrieval.

RAG Pipeline Overview

Diagram
graph LR
    D[Documents] --> L[Load]
    L --> C[Chunk]
    C --> E[Embed]
    E --> S[Store]
    S --> R[Retrieve]
    R --> G[Generate]

Document Loaders

LangChain hỗ trợ nhiều loại documents:

PDF Files

Python
1from langchain_community.document_loaders import PyPDFLoader
2
3# Load single PDF
4loader = PyPDFLoader("report.pdf")
5pages = loader.load()
6
7print(f"Loaded {len(pages)} pages")
8print(pages[0].page_content[:500])
9print(pages[0].metadata) # {'source': 'report.pdf', 'page': 0}

Word Documents

Python
1from langchain_community.document_loaders import Docx2txtLoader
2
3loader = Docx2txtLoader("document.docx")
4docs = loader.load()

Web Pages

Python
1from langchain_community.document_loaders import WebBaseLoader
2
3# Single URL
4loader = WebBaseLoader("https://example.com/article")
5docs = loader.load()
6
7# Multiple URLs
8loader = WebBaseLoader([
9 "https://example.com/page1",
10 "https://example.com/page2"
11])
12docs = loader.load()

CSV/Excel

Python
1from langchain_community.document_loaders import CSVLoader
2
3loader = CSVLoader(
4 "data.csv",
5 csv_args={
6 'delimiter': ',',
7 'quotechar': '"'
8 }
9)
10docs = loader.load()

Directory of Files

Python
1from langchain_community.document_loaders import DirectoryLoader
2
3loader = DirectoryLoader(
4 "./documents/",
5 glob="**/*.pdf",
6 loader_cls=PyPDFLoader
7)
8docs = loader.load()

Chunking Strategies

Tại sao cần Chunking?

Chunking Importance
  • LLM có context limit (4K-128K tokens)
  • Smaller chunks = more precise retrieval
  • Larger chunks = more context
  • Balance là key!

1. Fixed-Size Chunking

Đơn giản nhất, chia theo số characters:

Python
1from langchain.text_splitter import CharacterTextSplitter
2
3text_splitter = CharacterTextSplitter(
4 separator="\n",
5 chunk_size=1000, # Số characters mỗi chunk
6 chunk_overlap=200, # Overlap giữa chunks
7 length_function=len
8)
9
10chunks = text_splitter.split_documents(docs)

2. Recursive Character Splitting

Thông minh hơn, chia theo hierarchy:

Python
1from langchain.text_splitter import RecursiveCharacterTextSplitter
2
3text_splitter = RecursiveCharacterTextSplitter(
4 chunk_size=1000,
5 chunk_overlap=200,
6 separators=["\n\n", "\n", " ", ""] # Thử từng separator
7)
8
9chunks = text_splitter.split_documents(docs)

Hoạt động:

  1. Thử split theo \n\n (paragraphs)
  2. Nếu chunk vẫn quá lớn, split theo \n (lines)
  3. Tiếp tục với (spaces)
  4. Cuối cùng split từng character

3. Semantic Chunking

Chia theo semantic meaning:

Python
1from langchain_experimental.text_splitter import SemanticChunker
2from langchain_openai import OpenAIEmbeddings
3
4embeddings = OpenAIEmbeddings()
5
6text_splitter = SemanticChunker(
7 embeddings=embeddings,
8 breakpoint_threshold_type="percentile",
9 breakpoint_threshold_amount=95
10)
11
12chunks = text_splitter.split_documents(docs)

4. Code Splitting

Cho source code:

Python
1from langchain.text_splitter import RecursiveCharacterTextSplitter, Language
2
3python_splitter = RecursiveCharacterTextSplitter.from_language(
4 language=Language.PYTHON,
5 chunk_size=1000,
6 chunk_overlap=100
7)
8
9# Giữ nguyên functions/classes
10python_chunks = python_splitter.split_documents(code_docs)

5. Markdown Splitting

Cho markdown documents:

Python
1from langchain.text_splitter import MarkdownHeaderTextSplitter
2
3headers_to_split = [
4 ("#", "Header 1"),
5 ("##", "Header 2"),
6 ("###", "Header 3"),
7]
8
9markdown_splitter = MarkdownHeaderTextSplitter(
10 headers_to_split_on=headers_to_split
11)
12
13chunks = markdown_splitter.split_text(markdown_text)
14
15# Metadata sẽ chứa headers
16# {'Header 1': 'Introduction', 'Header 2': 'Setup'}

Chunk Size Guidelines

Use CaseChunk SizeOverlap
Q&A trên documents500-1000100-200
Summarization2000-4000200-400
Code analysis1000-2000100-200
Chat với context500-1000100-200
Trade-offs
  • Small chunks: Precise retrieval, nhưng thiếu context
  • Large chunks: Rich context, nhưng có thể retrieve irrelevant info
  • Optimal: Test và iterate!

Adding Metadata

Metadata giúp filtering và tracking:

Python
1from langchain.schema import Document
2
3# Manual metadata
4doc = Document(
5 page_content="Content here...",
6 metadata={
7 "source": "report.pdf",
8 "page": 1,
9 "chapter": "Introduction",
10 "author": "John Doe",
11 "date": "2024-01-01"
12 }
13)
14
15# Automatic metadata enrichment
16def enrich_metadata(docs):
17 for doc in docs:
18 # Add word count
19 doc.metadata["word_count"] = len(doc.page_content.split())
20
21 # Add timestamp
22 doc.metadata["processed_at"] = datetime.now().isoformat()
23
24 # Extract title from content
25 if doc.page_content.startswith("#"):
26 doc.metadata["title"] = doc.page_content.split("\n")[0].strip("# ")
27
28 return docs

Complete Pipeline

Python
1from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader
2from langchain.text_splitter import RecursiveCharacterTextSplitter
3from langchain_openai import OpenAIEmbeddings
4from langchain_community.vectorstores import Chroma
5
6# 1. Load documents
7loader = DirectoryLoader(
8 "./docs/",
9 glob="**/*.pdf",
10 loader_cls=PyPDFLoader,
11 show_progress=True
12)
13documents = loader.load()
14print(f"Loaded {len(documents)} documents")
15
16# 2. Split into chunks
17text_splitter = RecursiveCharacterTextSplitter(
18 chunk_size=1000,
19 chunk_overlap=200,
20 add_start_index=True # Track position in original doc
21)
22chunks = text_splitter.split_documents(documents)
23print(f"Created {len(chunks)} chunks")
24
25# 3. Create embeddings and store
26embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
27vectorstore = Chroma.from_documents(
28 documents=chunks,
29 embedding=embeddings,
30 persist_directory="./chroma_db"
31)
32
33print("Documents indexed successfully!")

Evaluation

Đánh giá chunking strategy:

Python
1def evaluate_chunking(chunks, test_queries):
2 """Đánh giá chất lượng chunks"""
3 results = []
4
5 for query in test_queries:
6 # Retrieve relevant chunks
7 relevant = vectorstore.similarity_search(query, k=5)
8
9 # Manual evaluation
10 print(f"\nQuery: {query}")
11 for i, chunk in enumerate(relevant):
12 print(f"\n--- Chunk {i+1} ---")
13 print(chunk.page_content[:200])
14
15 # Metrics to track:
16 # - Relevance: Chunks có answer query không?
17 # - Completeness: Answer đầy đủ không?
18 # - Noise: Có irrelevant content không?

Bài tập thực hành

Hands-on Exercise

Thử các chunking strategies:

  1. Load 1 PDF document
  2. Apply 3 strategies: Fixed, Recursive, Semantic
  3. So sánh số chunks và content
  4. Test retrieval với sample queries
  5. Chọn strategy tốt nhất
Python
1# Starter code
2from langchain_community.document_loaders import PyPDFLoader
3from langchain.text_splitter import (
4 CharacterTextSplitter,
5 RecursiveCharacterTextSplitter
6)
7
8# Load document
9loader = PyPDFLoader("your_document.pdf")
10docs = loader.load()
11
12# TODO: Try different splitters
13# TODO: Compare results

Tiếp theo

Trong bài tiếp theo, chúng ta sẽ học về Query Enhancement - các kỹ thuật để improve retrieval quality.


Tài liệu tham khảo