Vision Models - GPT-4V và Claude

🎯 Mục tiêu bài học

TB5 min

Vision models cho phép AI "nhìn" và hiểu hình ảnh. Từ mô tả ảnh, trả lời câu hỏi về ảnh, đến phân tích tài liệu.

Sau bài này, bạn sẽ:

✅ Sử dụng GPT-4 Vision và Claude Vision để phân tích ảnh ✅ Extract structured data từ hình ảnh với Pydantic ✅ Implement OCR và document vision ✅ So sánh và phân tích multi-image

Task 0

🔍 Vision Model Landscape

TB5 min

Diagram

Đang vẽ diagram...

Checkpoint

Bạn đã nắm được các vision models chính và khả năng của chúng chưa?

Task 1

💻 GPT-4 Vision

TB5 min

Basic Image Analysis

python.py

1from openai import OpenAI
2import base64
3
4client = OpenAI()
5
6def encode_image(image_path):
7    with open(image_path, "rb") as f:
8        return base64.b64encode(f.read()).decode("utf-8")
9
10image_b64 = encode_image("photo.png")
11
12response = client.chat.completions.create(
13    model="gpt-4o",
14    messages=[
15        {
16            "role": "user",
17            "content": [
18                {"type": "text", "text": "Mo ta chi tiet hinh anh nay."},
19                {
20                    "type": "image_url",
21                    "image_url": {
22                        "url": f"data:image/png;base64,{image_b64}",
23                        "detail": "high"  # low, high, auto
24                    }
25                }
26            ]
27        }
28    ],
29    max_tokens=1000
30)
31
32print(response.choices[0].message.content)

Image from URL

python.py

1response = client.chat.completions.create(
2    model="gpt-4o",
3    messages=[
4        {
5            "role": "user",
6            "content": [
7                {"type": "text", "text": "Hinh anh nay la gi?"},
8                {
9                    "type": "image_url",
10                    "image_url": {"url": "https://example.com/photo.jpg"}
11                }
12            ]
13        }
14    ]
15)

Checkpoint

Bạn đã thử sử dụng GPT-4 Vision API để phân tích hình ảnh chưa?

Task 2

💻 Claude Vision

TB5 min

python.py

1import anthropic
2
3client = anthropic.Anthropic()
4
5with open("image.png", "rb") as f:
6    image_data = base64.b64encode(f.read()).decode("utf-8")
7
8response = client.messages.create(
9    model="claude-3-5-sonnet-20241022",
10    max_tokens=1024,
11    messages=[
12        {
13            "role": "user",
14            "content": [
15                {
16                    "type": "image",
17                    "source": {
18                        "type": "base64",
19                        "media_type": "image/png",
20                        "data": image_data
21                    }
22                },
23                {
24                    "type": "text",
25                    "text": "Phan tich hinh anh chi tiet."
26                }
27            ]
28        }
29    ]
30)

Checkpoint

Bạn đã so sánh cách sử dụng API giữa GPT-4 Vision và Claude Vision chưa?

Task 3

🛠️ LangChain Vision

TB5 min

python.py

1from langchain_openai import ChatOpenAI
2from langchain_core.messages import HumanMessage
3
4llm = ChatOpenAI(model="gpt-4o")
5
6message = HumanMessage(
7    content=[
8        {"type": "text", "text": "Mo ta san pham trong hinh."},
9        {
10            "type": "image_url",
11            "image_url": {"url": f"data:image/png;base64,{image_b64}"}
12        }
13    ]
14)
15
16response = llm.invoke([message])
17print(response.content)

Multi-Image Analysis

python.py

1# So sanh nhieu hinh anh
2message = HumanMessage(
3    content=[
4        {"type": "text", "text": "So sanh 2 san pham nay. Uu nhuoc diem?"},
5        {
6            "type": "image_url",
7            "image_url": {"url": f"data:image/png;base64,{img1_b64}"}
8        },
9        {
10            "type": "image_url",
11            "image_url": {"url": f"data:image/png;base64,{img2_b64}"}
12        }
13    ]
14)
15
16comparison = llm.invoke([message])

Checkpoint

Bạn đã biết cách dùng LangChain để phân tích và so sánh nhiều hình ảnh chưa?

Task 4

📊 Structured Image Analysis

TB5 min

python.py

1from pydantic import BaseModel, Field
2from typing import List
3
4class ImageAnalysis(BaseModel):
5    description: str
6    objects: List[str]
7    colors: List[str]
8    mood: str
9    text_content: List[str] = Field(default_factory=list)
10    quality_score: int = Field(ge=1, le=10)
11
12structured_llm = llm.with_structured_output(ImageAnalysis)
13
14analysis = structured_llm.invoke([
15    HumanMessage(content=[
16        {"type": "text", "text": "Phan tich hinh anh chi tiet."},
17        {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_b64}"}}
18    ])
19])
20
21print(f"Objects: {analysis.objects}")
22print(f"Mood: {analysis.mood}")
23print(f"Quality: {analysis.quality_score}/10")

Checkpoint

Bạn đã hiểu cách extract structured data từ hình ảnh với Pydantic models chưa?

Task 5

📝 OCR và Document Vision

TB5 min

python.py

1class DocumentExtraction(BaseModel):
2    document_type: str
3    text_content: str
4    key_fields: dict
5    language: str
6
7doc_extractor = llm.with_structured_output(DocumentExtraction)
8
9result = doc_extractor.invoke([
10    HumanMessage(content=[
11        {"type": "text", "text": "Extract tat ca thong tin tu tai lieu nay."},
12        {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{doc_b64}"}}
13    ])
14])

Checkpoint

Bạn đã thử sử dụng vision models cho OCR và document extraction chưa?

Task 6

🎯 Tổng kết

TB5 min

Bài tập thực hành

Hands-on Exercise

Analyze 5 images với GPT-4V và structured output
Build image comparison tool
Implement OCR pipeline với vision models
Tạo product catalog analyzer

Challenge: Build visual QA chatbot có thể trả lời về images

Câu hỏi tự kiểm tra

GPT-4V và Claude Vision khác nhau như thế nào về khả năng phân tích và mô tả hình ảnh?
Structured output với Pydantic giúp extract thông tin từ ảnh theo cấu trúc như thế nào?
Vision models có thể thực hiện OCR và document extraction chính xác đến mức nào?
Làm sao encode hình ảnh sang base64 để gửi cho vision API và có những lưu ý gì về kích thước?

🎉 Tuyệt vời! Bạn đã hoàn thành bài học Vision Models - GPT-4V va Claude!

Tiếp theo: Chúng ta sẽ xây dựng Image Analysis Pipeline - batch processing, classification và auto-tagging.

Task 7

🚀 Bài tiếp theo

Image Analysis Pipeline →

Vision Models - GPT-4V và Claude

🎯 Mục tiêu bài học

Sau bài này, bạn sẽ:

🔍 Vision Model Landscape

Checkpoint

💻 GPT-4 Vision

Basic Image Analysis

Image from URL

Checkpoint

💻 Claude Vision

Checkpoint

🛠️ LangChain Vision

Multi-Image Analysis

Checkpoint

📊 Structured Image Analysis

Checkpoint

📝 OCR và Document Vision

Checkpoint

🎯 Tổng kết

Bài tập thực hành

Câu hỏi tự kiểm tra

🚀 Bài tiếp theo

Khóa học

Mentor & Hỗ trợ

Blog

Giới thiệu