📄 Document Processing & Chunking
Document processing là bước quan trọng trong RAG pipeline. Cách bạn chunk documents ảnh hưởng trực tiếp đến chất lượng retrieval.
RAG Pipeline Overview
Diagram
graph LR
D[Documents] --> L[Load]
L --> C[Chunk]
C --> E[Embed]
E --> S[Store]
S --> R[Retrieve]
R --> G[Generate]Document Loaders
LangChain hỗ trợ nhiều loại documents:
PDF Files
Python
1from langchain_community.document_loaders import PyPDFLoader23# Load single PDF4loader = PyPDFLoader("report.pdf")5pages = loader.load()67print(f"Loaded {len(pages)} pages")8print(pages[0].page_content[:500])9print(pages[0].metadata) # {'source': 'report.pdf', 'page': 0}Word Documents
Python
1from langchain_community.document_loaders import Docx2txtLoader23loader = Docx2txtLoader("document.docx")4docs = loader.load()Web Pages
Python
1from langchain_community.document_loaders import WebBaseLoader23# Single URL4loader = WebBaseLoader("https://example.com/article")5docs = loader.load()67# Multiple URLs8loader = WebBaseLoader([9 "https://example.com/page1",10 "https://example.com/page2"11])12docs = loader.load()CSV/Excel
Python
1from langchain_community.document_loaders import CSVLoader23loader = CSVLoader(4 "data.csv",5 csv_args={6 'delimiter': ',',7 'quotechar': '"'8 }9)10docs = loader.load()Directory of Files
Python
1from langchain_community.document_loaders import DirectoryLoader23loader = DirectoryLoader(4 "./documents/",5 glob="**/*.pdf",6 loader_cls=PyPDFLoader7)8docs = loader.load()Chunking Strategies
Tại sao cần Chunking?
Chunking Importance
- LLM có context limit (4K-128K tokens)
- Smaller chunks = more precise retrieval
- Larger chunks = more context
- Balance là key!
1. Fixed-Size Chunking
Đơn giản nhất, chia theo số characters:
Python
1from langchain.text_splitter import CharacterTextSplitter23text_splitter = CharacterTextSplitter(4 separator="\n",5 chunk_size=1000, # Số characters mỗi chunk6 chunk_overlap=200, # Overlap giữa chunks7 length_function=len8)910chunks = text_splitter.split_documents(docs)2. Recursive Character Splitting
Thông minh hơn, chia theo hierarchy:
Python
1from langchain.text_splitter import RecursiveCharacterTextSplitter23text_splitter = RecursiveCharacterTextSplitter(4 chunk_size=1000,5 chunk_overlap=200,6 separators=["\n\n", "\n", " ", ""] # Thử từng separator7)89chunks = text_splitter.split_documents(docs)Hoạt động:
- Thử split theo
\n\n(paragraphs) - Nếu chunk vẫn quá lớn, split theo
\n(lines) - Tiếp tục với
(spaces) - Cuối cùng split từng character
3. Semantic Chunking
Chia theo semantic meaning:
Python
1from langchain_experimental.text_splitter import SemanticChunker2from langchain_openai import OpenAIEmbeddings34embeddings = OpenAIEmbeddings()56text_splitter = SemanticChunker(7 embeddings=embeddings,8 breakpoint_threshold_type="percentile",9 breakpoint_threshold_amount=9510)1112chunks = text_splitter.split_documents(docs)4. Code Splitting
Cho source code:
Python
1from langchain.text_splitter import RecursiveCharacterTextSplitter, Language23python_splitter = RecursiveCharacterTextSplitter.from_language(4 language=Language.PYTHON,5 chunk_size=1000,6 chunk_overlap=1007)89# Giữ nguyên functions/classes10python_chunks = python_splitter.split_documents(code_docs)5. Markdown Splitting
Cho markdown documents:
Python
1from langchain.text_splitter import MarkdownHeaderTextSplitter23headers_to_split = [4 ("#", "Header 1"),5 ("##", "Header 2"),6 ("###", "Header 3"),7]89markdown_splitter = MarkdownHeaderTextSplitter(10 headers_to_split_on=headers_to_split11)1213chunks = markdown_splitter.split_text(markdown_text)1415# Metadata sẽ chứa headers16# {'Header 1': 'Introduction', 'Header 2': 'Setup'}Chunk Size Guidelines
| Use Case | Chunk Size | Overlap |
|---|---|---|
| Q&A trên documents | 500-1000 | 100-200 |
| Summarization | 2000-4000 | 200-400 |
| Code analysis | 1000-2000 | 100-200 |
| Chat với context | 500-1000 | 100-200 |
Trade-offs
- Small chunks: Precise retrieval, nhưng thiếu context
- Large chunks: Rich context, nhưng có thể retrieve irrelevant info
- Optimal: Test và iterate!
Adding Metadata
Metadata giúp filtering và tracking:
Python
1from langchain.schema import Document23# Manual metadata4doc = Document(5 page_content="Content here...",6 metadata={7 "source": "report.pdf",8 "page": 1,9 "chapter": "Introduction",10 "author": "John Doe",11 "date": "2024-01-01"12 }13)1415# Automatic metadata enrichment16def enrich_metadata(docs):17 for doc in docs:18 # Add word count19 doc.metadata["word_count"] = len(doc.page_content.split())20 21 # Add timestamp22 doc.metadata["processed_at"] = datetime.now().isoformat()23 24 # Extract title from content25 if doc.page_content.startswith("#"):26 doc.metadata["title"] = doc.page_content.split("\n")[0].strip("# ")27 28 return docsComplete Pipeline
Python
1from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader2from langchain.text_splitter import RecursiveCharacterTextSplitter3from langchain_openai import OpenAIEmbeddings4from langchain_community.vectorstores import Chroma56# 1. Load documents7loader = DirectoryLoader(8 "./docs/",9 glob="**/*.pdf",10 loader_cls=PyPDFLoader,11 show_progress=True12)13documents = loader.load()14print(f"Loaded {len(documents)} documents")1516# 2. Split into chunks17text_splitter = RecursiveCharacterTextSplitter(18 chunk_size=1000,19 chunk_overlap=200,20 add_start_index=True # Track position in original doc21)22chunks = text_splitter.split_documents(documents)23print(f"Created {len(chunks)} chunks")2425# 3. Create embeddings and store26embeddings = OpenAIEmbeddings(model="text-embedding-3-small")27vectorstore = Chroma.from_documents(28 documents=chunks,29 embedding=embeddings,30 persist_directory="./chroma_db"31)3233print("Documents indexed successfully!")Evaluation
Đánh giá chunking strategy:
Python
1def evaluate_chunking(chunks, test_queries):2 """Đánh giá chất lượng chunks"""3 results = []4 5 for query in test_queries:6 # Retrieve relevant chunks7 relevant = vectorstore.similarity_search(query, k=5)8 9 # Manual evaluation10 print(f"\nQuery: {query}")11 for i, chunk in enumerate(relevant):12 print(f"\n--- Chunk {i+1} ---")13 print(chunk.page_content[:200])14 15 # Metrics to track:16 # - Relevance: Chunks có answer query không?17 # - Completeness: Answer đầy đủ không?18 # - Noise: Có irrelevant content không?Bài tập thực hành
Hands-on Exercise
Thử các chunking strategies:
- Load 1 PDF document
- Apply 3 strategies: Fixed, Recursive, Semantic
- So sánh số chunks và content
- Test retrieval với sample queries
- Chọn strategy tốt nhất
Python
1# Starter code2from langchain_community.document_loaders import PyPDFLoader3from langchain.text_splitter import (4 CharacterTextSplitter,5 RecursiveCharacterTextSplitter6)78# Load document9loader = PyPDFLoader("your_document.pdf")10docs = loader.load()1112# TODO: Try different splitters13# TODO: Compare resultsTiếp theo
Trong bài tiếp theo, chúng ta sẽ học về Query Enhancement - các kỹ thuật để improve retrieval quality.
