Document Loaders & Formats

0

🎯 Mục tiêu bài học

TB5 min

RAG cần "đọc" documents trước khi search. Bài này cover cách load nhiều format khác nhau và xử lý chúng cho indexing.

Sau bài này, bạn sẽ:

✅ Load PDF, Word, CSV, Web pages ✅ Handle multiple formats với LangChain ✅ Extract tables, images, structured data ✅ Build robust document pipeline

Task 0

1

📝 LangChain Document Loaders

TB5 min

PDF Loader

python.py

1# pip install pypdf
2from langchain_community.document_loaders import PyPDFLoader
3
4# Single PDF
5loader = PyPDFLoader("data/company_report.pdf")
6pages = loader.load()
7
8print(f"Total pages: {len(pages)}")
9print(f"Page 1 content: {pages[0].page_content[:200]}")
10print(f"Metadata: {pages[0].metadata}")
11# {'source': 'data/company_report.pdf', 'page': 0}

Advanced PDF (with tables)

python.py

1# pip install unstructured[pdf]
2from langchain_community.document_loaders import UnstructuredPDFLoader
3
4# Better for complex PDFs with tables, images
5loader = UnstructuredPDFLoader(
6    "data/financial_report.pdf",
7    mode="elements",  # Split into elements (paragraphs, tables, etc.)
8    strategy="hi_res" # High resolution parsing
9)
10elements = loader.load()
11
12for elem in elements[:5]:
13    print(f"Type: {elem.metadata.get('category', 'unknown')}")
14    print(f"Content: {elem.page_content[:100]}")
15    print()

Word Documents

python.py

1# pip install docx2txt
2from langchain_community.document_loaders import Docx2txtLoader
3
4loader = Docx2txtLoader("data/policy_document.docx")
5docs = loader.load()
6print(f"Content length: {len(docs[0].page_content)} chars")

CSV / Excel

python.py

1from langchain_community.document_loaders import CSVLoader
2
3loader = CSVLoader(
4    "data/products.csv",
5    csv_args={"delimiter": ","},
6    source_column="product_name"  # Use as source in metadata
7)
8docs = loader.load()
9
10# Each row becomes a document
11for doc in docs[:3]:
12    print(doc.page_content[:150])
13    print(doc.metadata)
14    print()

Web Pages

python.py

1from langchain_community.document_loaders import WebBaseLoader
2
3# Single page
4loader = WebBaseLoader("https://docs.python.org/3/tutorial/")
5docs = loader.load()
6
7# Multiple pages
8loader = WebBaseLoader([
9    "https://example.com/page1",
10    "https://example.com/page2"
11])
12docs = loader.load()
13print(f"Loaded {len(docs)} web pages")

Recursive Web Crawl

python.py

1from langchain_community.document_loaders import RecursiveUrlLoader
2from bs4 import BeautifulSoup
3
4def bs4_extractor(html: str) -> str:
5    soup = BeautifulSoup(html, "html.parser")
6    return soup.get_text(separator="\n", strip=True)
7
8loader = RecursiveUrlLoader(
9    url="https://docs.example.com/",
10    max_depth=2,
11    extractor=bs4_extractor
12)
13docs = loader.load()
14print(f"Crawled {len(docs)} pages")

Checkpoint

Bạn đã biết cách load PDF, Word, CSV, Web pages với LangChain chưa?

Task 1

2

📝 Directory Loading

TB5 min

Load Entire Folder

python.py

1from langchain_community.document_loaders import DirectoryLoader
2
3# Load all PDFs from a directory
4loader = DirectoryLoader(
5    "data/knowledge_base/",
6    glob="**/*.pdf",
7    loader_cls=PyPDFLoader,
8    show_progress=True
9)
10docs = loader.load()
11print(f"Loaded {len(docs)} pages from all PDFs")
12
13# Load multiple formats
14from langchain_community.document_loaders import TextLoader
15
16loader_txt = DirectoryLoader(
17    "data/knowledge_base/",
18    glob="**/*.txt",
19    loader_cls=TextLoader
20)

Custom Document Loader

python.py

1from langchain_core.documents import Document
2
3class CustomAPILoader:
4    """Load documents from internal API."""
5    
6    def __init__(self, api_url, api_key):
7        self.api_url = api_url
8        self.api_key = api_key
9    
10    def load(self):
11        import requests
12        headers = {"Authorization": f"Bearer {self.api_key}"}
13        response = requests.get(self.api_url, headers=headers)
14        articles = response.json()["articles"]
15        
16        documents = []
17        for article in articles:
18            doc = Document(
19                page_content=article["content"],
20                metadata={
21                    "source": article["url"],
22                    "title": article["title"],
23                    "author": article["author"],
24                    "published_at": article["date"]
25                }
26            )
27            documents.append(doc)
28        
29        return documents
30
31# Usage
32loader = CustomAPILoader("https://api.company.com/articles", "key123")
33docs = loader.load()

Checkpoint

Bạn đã biết cách load toàn bộ folder và tạo custom document loader chưa?

Task 2

3

💻 Document Processing Pipeline

TB5 min

Complete Pipeline

python.py

1import os
2from pathlib import Path
3from langchain_core.documents import Document
4
5class DocumentProcessor:
6    """Load and process multiple document formats."""
7    
8    LOADERS = {
9        ".pdf": PyPDFLoader,
10        ".txt": TextLoader,
11        ".docx": Docx2txtLoader,
12        ".csv": CSVLoader,
13    }
14    
15    def __init__(self, data_dir):
16        self.data_dir = Path(data_dir)
17    
18    def load_all(self):
19        """Load all supported documents from directory."""
20        all_docs = []
21        
22        for file_path in self.data_dir.rglob("*"):
23            ext = file_path.suffix.lower()
24            if ext in self.LOADERS:
25                try:
26                    loader_cls = self.LOADERS[ext]
27                    loader = loader_cls(str(file_path))
28                    docs = loader.load()
29                    
30                    # Add file-level metadata
31                    for doc in docs:
32                        doc.metadata["file_name"] = file_path.name
33                        doc.metadata["file_type"] = ext
34                        doc.metadata["file_size"] = file_path.stat().st_size
35                    
36                    all_docs.extend(docs)
37                    print(f"Loaded: {file_path.name} ({len(docs)} chunks)")
38                except Exception as e:
39                    print(f"Error loading {file_path.name}: {e}")
40        
41        print(f"\nTotal documents loaded: {len(all_docs)}")
42        return all_docs
43    
44    def clean_documents(self, docs):
45        """Clean and normalize document content."""
46        cleaned = []
47        for doc in docs:
48            content = doc.page_content
49            
50            # Remove excessive whitespace
51            content = " ".join(content.split())
52            
53            # Skip very short documents
54            if len(content) < 50:
55                continue
56            
57            # Skip duplicates
58            if content in [d.page_content for d in cleaned]:
59                continue
60            
61            doc.page_content = content
62            cleaned.append(doc)
63        
64        print(f"After cleaning: {len(cleaned)} documents (removed {len(docs) - len(cleaned)})")
65        return cleaned
66
67# Usage
68processor = DocumentProcessor("data/knowledge_base/")
69docs = processor.load_all()
70docs = processor.clean_documents(docs)

Checkpoint

Bạn đã xây dựng được document processing pipeline hoàn chỉnh chưa?

Task 3

4

📝 Handling Vietnamese Documents

TB5 min

Vietnamese PDF Issues

python.py

1# Vietnamese PDFs often have encoding issues
2# Use unstructured for better handling
3from langchain_community.document_loaders import UnstructuredPDFLoader
4
5loader = UnstructuredPDFLoader(
6    "data/quy_dinh_lao_dong.pdf",
7    mode="single",           # Entire PDF as one document
8    strategy="fast",          # Fast parsing
9    languages=["vie"]         # Vietnamese language hint
10)
11docs = loader.load()
12
13# Verify Vietnamese text
14print(docs[0].page_content[:500])

Text Normalization

python.py

1import unicodedata
2import re
3
4def normalize_vietnamese(text):
5    """Normalize Vietnamese text for better search."""
6    # Unicode normalization (NFC form)
7    text = unicodedata.normalize("NFC", text)
8    
9    # Fix common encoding issues
10    text = text.replace("\xa0", " ")  # Non-breaking space
11    text = re.sub(r'\s+', ' ', text)  # Multiple spaces
12    
13    # Remove control characters
14    text = "".join(c for c in text if not unicodedata.category(c).startswith("C") or c in "\n\t")
15    
16    return text.strip()
17
18# Apply to all documents
19for doc in docs:
20    doc.page_content = normalize_vietnamese(doc.page_content)

Checkpoint

Bạn đã biết cách xử lý Vietnamese PDFs và normalize Unicode text chưa?

Task 4

Document Loaders & Formats

🎯 Mục tiêu bài học

Sau bài này, bạn sẽ:

📝 LangChain Document Loaders

PDF Loader

Advanced PDF (with tables)

Word Documents

CSV / Excel

Web Pages

Recursive Web Crawl

Checkpoint

📝 Directory Loading

Load Entire Folder

Custom Document Loader

Checkpoint

💻 Document Processing Pipeline

Complete Pipeline

Checkpoint

📝 Handling Vietnamese Documents

Vietnamese PDF Issues

Text Normalization

Checkpoint

🎯 Tổng kết

📝 Quiz

Key Takeaways

Câu hỏi tự kiểm tra

🚀 Bài tiếp theo

Khóa học

Mentor & Hỗ trợ

Blog

Giới thiệu