Visual QA và Document Vision

🎯 Mục tiêu bài học

TB5 min

Visual Question Answering (VQA) cho phép hỏi và trả lời về hình ảnh. Document Vision giúp extract thông tin từ tài liệu scan, receipts, forms.

Sau bài này, bạn sẽ:

✅ Xây dựng Visual QA system tương tác ✅ Extract dữ liệu từ receipts, business cards, forms ✅ Implement table extraction từ hình ảnh ✅ Xử lý multi-page documents

Task 0

🔍 Visual QA System

TB5 min

Diagram

Đang vẽ diagram...

Interactive VQA

python.py

1from langchain_openai import ChatOpenAI
2from langchain_core.messages import HumanMessage, AIMessage
3import base64
4
5llm = ChatOpenAI(model="gpt-4o")
6
7class VisualQA:
8    def __init__(self):
9        self.history = []
10        self.current_image = None
11    
12    def set_image(self, image_path):
13        with open(image_path, "rb") as f:
14            self.current_image = base64.b64encode(f.read()).decode()
15        self.history = []
16    
17    def ask(self, question):
18        messages = []
19        
20        # First message includes image
21        if not self.history:
22            messages.append(HumanMessage(content=[
23                {"type": "text", "text": question},
24                {"type": "image_url", "image_url": {
25                    "url": f"data:image/png;base64,{self.current_image}"
26                }}
27            ]))
28        else:
29            messages = self.history + [HumanMessage(content=question)]
30        
31        response = llm.invoke(messages)
32        
33        self.history = messages + [response]
34        return response.content
35
36# Usage
37vqa = VisualQA()
38vqa.set_image("product.png")
39print(vqa.ask("San pham nay la gi?"))
40print(vqa.ask("Gia co on khong?"))
41print(vqa.ask("Diem noi bat nhat?"))

Checkpoint

Bạn đã hiểu cách xây dựng Visual QA system với conversation history chưa?

Task 1

📝 Document Processing

TB5 min

Receipt/Invoice Extraction

python.py

1from pydantic import BaseModel, Field
2from typing import List, Optional
3
4class LineItem(BaseModel):
5    name: str
6    quantity: int
7    unit_price: float
8    total: float
9
10class ReceiptData(BaseModel):
11    store_name: str
12    date: str
13    items: List[LineItem]
14    subtotal: float
15    tax: Optional[float] = None
16    total: float
17    payment_method: Optional[str] = None
18
19receipt_extractor = llm.with_structured_output(ReceiptData)
20
21def extract_receipt(image_path):
22    with open(image_path, "rb") as f:
23        b64 = base64.b64encode(f.read()).decode()
24    
25    return receipt_extractor.invoke([
26        HumanMessage(content=[
27            {"type": "text", "text": "Extract tat ca thong tin tu receipt/hoa don nay."},
28            {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}"}}
29        ])
30    ])

Business Card Scanner

python.py

1class BusinessCard(BaseModel):
2    name: str
3    title: Optional[str] = None
4    company: Optional[str] = None
5    email: Optional[str] = None
6    phone: Optional[str] = None
7    address: Optional[str] = None
8    website: Optional[str] = None
9
10card_scanner = llm.with_structured_output(BusinessCard)
11
12def scan_business_card(image_path):
13    with open(image_path, "rb") as f:
14        b64 = base64.b64encode(f.read()).decode()
15    
16    return card_scanner.invoke([
17        HumanMessage(content=[
18            {"type": "text", "text": "Extract contact info tu business card nay."},
19            {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}"}}
20        ])
21    ])

Form Data Extraction

python.py

1class FormField(BaseModel):
2    field_name: str
3    value: str
4    field_type: str  # text, checkbox, date, number
5
6class FormData(BaseModel):
7    form_title: str
8    fields: List[FormField]
9    signatures: bool
10    date_filled: Optional[str] = None
11
12form_extractor = llm.with_structured_output(FormData)

Checkpoint

Bạn đã biết cách extract structured data từ receipts, business cards và forms chưa?

Task 2

📊 Table Extraction

TB5 min

python.py

1class TableData(BaseModel):
2    headers: List[str]
3    rows: List[List[str]]
4    total_rows: int
5
6table_extractor = llm.with_structured_output(TableData)
7
8def extract_table(image_path):
9    with open(image_path, "rb") as f:
10        b64 = base64.b64encode(f.read()).decode()
11    
12    result = table_extractor.invoke([
13        HumanMessage(content=[
14            {"type": "text", "text": "Extract bang du lieu tu hinh anh. Giu chinh xac tat ca cot va hang."},
15            {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}"}}
16        ])
17    ])
18    
19    # Convert to DataFrame
20    import pandas as pd
21    df = pd.DataFrame(result.rows, columns=result.headers)
22    return df

Checkpoint

Bạn đã hiểu cách extract bảng dữ liệu từ hình ảnh và convert sang DataFrame chưa?

Task 3

📐 Multi-page Document Processing

TB5 min

python.py

1from pathlib import Path
2
3async def process_document(pages_dir):
4    pages = sorted(Path(pages_dir).glob("*.png"))
5    all_data = []
6    
7    for page in pages:
8        with open(page, "rb") as f:
9            b64 = base64.b64encode(f.read()).decode()
10        
11        result = await llm.ainvoke([
12            HumanMessage(content=[
13                {"type": "text", "text": f"Extract text va data tu trang {page.name}."},
14                {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}"}}
15            ])
16        ])
17        all_data.append({"page": page.name, "content": result.content})
18    
19    return all_data

Checkpoint

Bạn đã biết cách xử lý multi-page documents bằng cách iterate qua từng trang chưa?

Task 4

🎯 Tổng kết

TB5 min

Bài tập thực hành

Hands-on Exercise

Build Visual QA chatbot cho product images
Implement receipt scanner với structured output
Extract thông tin từ business cards
Build table extractor từ screenshots

Challenge: Document processing pipeline cho multi-page PDF (convert pages to images, extract data)

Câu hỏi tự kiểm tra

Visual QA system hoạt động như thế nào để trả lời câu hỏi về nội dung hình ảnh?
Làm sao extract bảng dữ liệu (tables) từ hình ảnh một cách chính xác với structured output?
Multi-page document processing khác gì so với xử lý từng trang đơn lẻ về độ phức tạp và kỹ thuật?
Các ứng dụng thực tế của document vision trong doanh nghiệp (receipt scanning, form extraction) là gì?

🎉 Tuyệt vời! Bạn đã hoàn thành bài học Visual QA va Document Vision!

Tiếp theo: Chúng ta sẽ xây dựng Multimodal Pipelines - kết hợp text và image để tạo ứng dụng toàn diện.

Task 5

🚀 Bài tiếp theo

Multimodal Pipelines →

Visual QA và Document Vision

🎯 Mục tiêu bài học

Sau bài này, bạn sẽ:

🔍 Visual QA System

Interactive VQA

Checkpoint

📝 Document Processing

Receipt/Invoice Extraction

Business Card Scanner

Form Data Extraction

Checkpoint

📊 Table Extraction

Checkpoint

📐 Multi-page Document Processing

Checkpoint

🎯 Tổng kết

Bài tập thực hành

Câu hỏi tự kiểm tra

🚀 Bài tiếp theo

Khóa học

Mentor & Hỗ trợ

Blog

Giới thiệu