Text Classification

🎯 Mục tiêu bài học

TB5 min

Text classification là task gán label cho văn bản. LLMs cho phép zero-shot classification mà không cần training data, đồng thời dễ dàng customize categories.

Sau bài này, bạn sẽ:

✅ Implement zero-shot và multi-label classification với LLMs ✅ Build customer support routing pipeline ✅ Tạo content moderation và hierarchical classification ✅ Batch classify documents và evaluate accuracy

Task 0

🔍 Classification Types

TB5 min

Diagram

Đang vẽ diagram...

Checkpoint

Bạn đã hiểu các loại text classification phổ biến chưa?

Task 1

💻 Zero-shot Classification

TB5 min

python.py

1from langchain_openai import ChatOpenAI
2from langchain_core.prompts import ChatPromptTemplate
3from pydantic import BaseModel, Field
4from typing import Literal, List
5
6llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
7
8class ClassificationResult(BaseModel):
9    category: str
10    confidence: float = Field(ge=0, le=1)
11    reasoning: str
12
13classify_chain = (
14    ChatPromptTemplate.from_messages([
15        ("system", """Phan loai text vao categories: {categories}
16        Tra ve: category, confidence (0-1), reasoning."""),
17        ("human", "{text}")
18    ])
19    | llm.with_structured_output(ClassificationResult)
20)
21
22result = classify_chain.invoke({
23    "categories": "technology, business, sports, entertainment, health",
24    "text": "Apple vua ra mat iPhone 16 voi chip A18 va camera nang cap"
25})
26print(f"{result.category} ({result.confidence:.0%})")

Checkpoint

Bạn đã hiểu cách implement zero-shot classification chưa?

Task 2

📐 Multi-label Classification

TB5 min

python.py

1class MultiLabelResult(BaseModel):
2    labels: List[str] = Field(description="All applicable labels")
3    primary_label: str
4    confidence_scores: dict
5
6multi_classify = (
7    ChatPromptTemplate.from_messages([
8        ("system", """Assign ALL applicable labels tu: {labels}
9        Mot text co the co nhieu labels.
10        Tra ve: primary label + all applicable labels + confidence."""),
11        ("human", "{text}")
12    ])
13    | llm.with_structured_output(MultiLabelResult)
14)
15
16result = multi_classify.invoke({
17    "labels": "urgent, billing, technical, complaint, feature-request, praise",
18    "text": "App bi crash lien tuc, toi da mat du lieu quan trong. Can fix gap!"
19})
20# labels: ["urgent", "technical", "complaint"]

Checkpoint

Bạn đã hiểu sự khác biệt giữa single-label và multi-label classification chưa?

Task 3

🛠️ Customer Support Routing

TB5 min

python.py

1class TicketClassification(BaseModel):
2    department: Literal["technical", "billing", "sales", "general"]
3    priority: Literal["low", "medium", "high", "critical"]
4    sentiment: Literal["positive", "negative", "neutral"]
5    auto_reply_suggested: bool
6    suggested_response: str
7
8support_router = (
9    ChatPromptTemplate.from_messages([
10        ("system", """Phan loai support ticket.
11        Xac dinh: department, priority, sentiment.
12        Goi y auto-reply neu co the."""),
13        ("human", "{ticket}")
14    ])
15    | llm.with_structured_output(TicketClassification)
16)
17
18ticket = "Toi khong the dang nhap vao tai khoan. Da thu reset password nhung khong nhan duoc email."
19
20result = support_router.invoke({"ticket": ticket})
21print(f"Route to: {result.department} (Priority: {result.priority})")

Checkpoint

Bạn đã hiểu cách build customer support routing pipeline chưa?

Task 4

🛠️ Content Moderation

TB5 min

python.py

1class ModerationResult(BaseModel):
2    is_safe: bool
3    categories: List[str]
4    severity: Literal["none", "low", "medium", "high"]
5    action: Literal["allow", "flag", "block"]
6
7moderation_chain = (
8    ChatPromptTemplate.from_messages([
9        ("system", """Content moderator. Check for:
10        - Hate speech, harassment
11        - Violence, threats
12        - Adult content
13        - Spam, scam
14        - Personal information
15        Determine: is_safe, categories violated, severity, action."""),
16        ("human", "{content}")
17    ])
18    | llm.with_structured_output(ModerationResult)
19)

Checkpoint

Bạn đã hiểu cách implement content moderation pipeline chưa?

Task 5

📐 Hierarchical Classification

TB5 min

python.py

1class HierarchicalClass(BaseModel):
2    level1: str = Field(description="Main category")
3    level2: str = Field(description="Sub category")
4    level3: str = Field(description="Specific topic")
5
6taxonomy = """
7Technology
8  - Software: Web, Mobile, Desktop, AI/ML
9  - Hardware: Laptops, Phones, Accessories
10  - Cloud: AWS, Azure, GCP
11Business
12  - Marketing: Digital, Content, SEO
13  - Finance: Investment, Banking, Crypto
14  - Management: Leadership, Strategy, HR
15"""
16
17hier_chain = (
18    ChatPromptTemplate.from_messages([
19        ("system", f"Classify text theo taxonomy:\n{taxonomy}"),
20        ("human", "{text}")
21    ])
22    | llm.with_structured_output(HierarchicalClass)
23)
24
25result = hier_chain.invoke({
26    "text": "Cach dung LangChain de build chatbot tren React app"
27})
28# level1: Technology, level2: Software, level3: AI/ML

Checkpoint

Bạn đã hiểu cách implement hierarchical classification chưa?

Task 6

⚡ Batch Classification Pipeline

TB5 min

python.py

1import pandas as pd
2
3# Classify batch of documents
4documents = [
5    "Cach nau pho bo truyen thong Ha Noi",
6    "Bitcoin tang 10% sau thong bao cua Fed",
7    "Man United thang Chelsea 2-1",
8    "OpenAI ra mat GPT-5 voi kha nang moi",
9    "Cach tap yoga tai nha cho nguoi moi",
10]
11
12results = classify_chain.batch(
13    [{"text": d, "categories": "food, finance, sports, technology, health"} 
14     for d in documents],
15    config={"max_concurrency": 5}
16)
17
18df = pd.DataFrame([
19    {"text": doc, "category": res.category, "confidence": res.confidence}
20    for doc, res in zip(documents, results)
21])
22print(df.to_string())

Checkpoint

Bạn đã hiểu cách batch classify nhiều documents chưa?

Task 7

📐 Evaluation

TB5 min

python.py

1from sklearn.metrics import classification_report
2
3# Ground truth
4true_labels = ["technology", "business", "sports", "technology", "health"]
5predicted = [r.category for r in results]
6
7print(classification_report(true_labels, predicted))

Checkpoint

Bạn đã hiểu cách evaluate accuracy của classification pipeline chưa?

Task 8

🎯 Tổng kết

TB5 min

Hands-on Exercise

Build support ticket router với auto-priority
Implement content moderation pipeline
Tạo multi-label news classifier
Batch classify 100+ documents và evaluate accuracy

Challenge: Build real-time classification API với FastAPI

Câu hỏi tự kiểm tra

Zero-shot classification với LLMs có ưu điểm gì so với traditional ML classification?
Multi-label classification khác single-label classification như thế nào?
Hierarchical classification được sử dụng trong trường hợp nào?
Làm thế nào để đánh giá accuracy của text classification pipeline bằng classification_report?

🎉 Tuyệt vời! Bạn đã hoàn thành bài học Text Classification với LLMs!

Tiếp theo: Hãy học cách xử lý văn bản hàng loạt với Batch Processing và Pipelines!

Task 9

🚀 Bài tiếp theo

Batch Processing và Pipelines →

🎯 Mục tiêu bài học

Sau bài này, bạn sẽ:

🔍 Classification Types

Checkpoint

💻 Zero-shot Classification

Checkpoint

📐 Multi-label Classification

Checkpoint

🛠️ Customer Support Routing

Checkpoint

🛠️ Content Moderation

Checkpoint

📐 Hierarchical Classification

Checkpoint

⚡ Batch Classification Pipeline

Checkpoint

📐 Evaluation

Checkpoint

🎯 Tổng kết

Câu hỏi tự kiểm tra

🚀 Bài tiếp theo

Khóa học

Mentor & Hỗ trợ

Blog

Giới thiệu