📝 Text Summarization với LLMs

Summarization là một trong những ứng dụng phổ biến nhất của LLMs. Bài này cover từ basic prompting đến chain-based summarization.

Types of Summarization

Diagram

graph TD
    S[Summarization] --> E[Extractive]
    S --> A[Abstractive]
    E --> |Select key sentences| O1[Original text excerpts]
    A --> |Generate new text| O2[Rewritten summary]

Extractive vs Abstractive

Extractive: Chọn câu quan trọng từ văn bản gốc
Abstractive: Tạo summary mới bằng ngôn ngữ riêng (LLMs)

Basic Summarization

Simple Prompt

Python

1from langchain_openai import ChatOpenAI
2from langchain_core.prompts import ChatPromptTemplate
3
4llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
5
6# Basic summarization
7template = ChatPromptTemplate.from_messages([
8    ("system", "Bạn là chuyên gia tóm tắt văn bản. Tóm tắt ngắn gọn, giữ ý chính."),
9    ("human", "Tóm tắt văn bản sau trong {word_count} từ:\n\n{text}")
10])
11
12chain = template | llm
13
14text = """
15Machine learning là một nhánh của trí tuệ nhân tạo (AI) tập trung vào việc xây dựng 
16các ứng dụng có khả năng học từ dữ liệu và cải thiện độ chính xác theo thời gian 
17mà không cần được lập trình cụ thể. Trong machine learning, các thuật toán được 
18huấn luyện để tìm ra các mẫu và tương quan trong các tập dữ liệu lớn, sau đó đưa 
19ra quyết định và dự đoán tốt nhất dựa trên phân tích đó...
20"""
21
22summary = chain.invoke({"text": text, "word_count": 50})
23print(summary.content)

Bullet Point Summary

Python

1bullet_template = ChatPromptTemplate.from_messages([
2    ("system", """Tóm tắt văn bản thành bullet points.
3    - Mỗi point là một ý chính
4    - Ngắn gọn, súc tích
5    - Tối đa {num_points} points"""),
6    ("human", "{text}")
7])
8
9bullet_chain = bullet_template | llm
10
11result = bullet_chain.invoke({
12    "text": long_text,
13    "num_points": 5
14})

Summarization Strategies

1. Stuff (All at once)

Đơn giản nhất - đưa toàn bộ text vào prompt:

Python

1from langchain.chains.summarize import load_summarize_chain
2from langchain_core.documents import Document
3
4# Create documents
5docs = [Document(page_content=text)]
6
7# Stuff chain
8stuff_chain = load_summarize_chain(llm, chain_type="stuff")
9summary = stuff_chain.invoke(docs)

Pros: Simple, fast Cons: Limited by context window

2. Map-Reduce

Xử lý từng phần, sau đó combine:

Python

1from langchain.chains.summarize import load_summarize_chain
2from langchain.text_splitter import RecursiveCharacterTextSplitter
3
4# Split long text
5text_splitter = RecursiveCharacterTextSplitter(
6    chunk_size=2000,
7    chunk_overlap=200
8)
9docs = text_splitter.create_documents([long_text])
10
11# Map-reduce chain
12map_reduce_chain = load_summarize_chain(
13    llm, 
14    chain_type="map_reduce",
15    verbose=True
16)
17
18summary = map_reduce_chain.invoke(docs)

Diagram

graph LR
    D1[Chunk 1] --> S1[Summary 1]
    D2[Chunk 2] --> S2[Summary 2]
    D3[Chunk 3] --> S3[Summary 3]
    S1 & S2 & S3 --> F[Final Summary]

3. Refine

Iteratively refine summary:

Python

1refine_chain = load_summarize_chain(
2    llm,
3    chain_type="refine",
4    verbose=True
5)
6
7summary = refine_chain.invoke(docs)

Diagram

graph LR
    D1[Chunk 1] --> S1[Summary v1]
    S1 --> |+ Chunk 2| S2[Summary v2]
    S2 --> |+ Chunk 3| S3[Final Summary]

Custom Summarization Chains

With Custom Prompts

Python

1from langchain_core.prompts import PromptTemplate
2
3# Map prompt
4map_prompt = PromptTemplate.from_template("""
5Tóm tắt đoạn văn sau, giữ lại các ý chính:
6
7{text}
8
9TÓM TẮT:
10""")
11
12# Combine prompt
13combine_prompt = PromptTemplate.from_template("""
14Dưới đây là các tóm tắt từ một văn bản dài:
15
16{text}
17
18Hãy tổng hợp thành một tóm tắt hoàn chỉnh, mạch lạc:
19
20TÓM TẮT CUỐI CÙNG:
21""")
22
23# Create chain
24from langchain.chains.summarize import load_summarize_chain
25
26chain = load_summarize_chain(
27    llm,
28    chain_type="map_reduce",
29    map_prompt=map_prompt,
30    combine_prompt=combine_prompt
31)

LCEL Custom Chain

Python

1from langchain_core.prompts import ChatPromptTemplate
2from langchain_core.output_parsers import StrOutputParser
3
4# Summarize each chunk
5chunk_summarizer = ChatPromptTemplate.from_messages([
6    ("system", "Tóm tắt ngắn gọn trong 2-3 câu."),
7    ("human", "{chunk}")
8]) | llm | StrOutputParser()
9
10# Combine summaries
11combiner = ChatPromptTemplate.from_messages([
12    ("system", "Tổng hợp các tóm tắt thành một bản tóm tắt hoàn chỉnh."),
13    ("human", "Các tóm tắt:\n{summaries}")
14]) | llm | StrOutputParser()
15
16# Process
17def summarize_long_document(text, chunk_size=2000):
18    # Split into chunks
19    chunks = [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]
20    
21    # Summarize each chunk
22    chunk_summaries = []
23    for chunk in chunks:
24        summary = chunk_summarizer.invoke({"chunk": chunk})
25        chunk_summaries.append(summary)
26    
27    # Combine
28    combined = "\n".join(chunk_summaries)
29    final_summary = combiner.invoke({"summaries": combined})
30    
31    return final_summary

Specialized Summarization

Article Summary

Python

1article_template = ChatPromptTemplate.from_messages([
2    ("system", """Tóm tắt bài báo với format:
3    
4    **Tiêu đề**: [Tiêu đề bài báo]
5    **Chủ đề chính**: [1-2 câu về topic]
6    **Các điểm chính**:
7    - Điểm 1
8    - Điểm 2
9    - Điểm 3
10    **Kết luận**: [1 câu kết luận]
11    """),
12    ("human", "{article}")
13])

Meeting Notes

Python

1meeting_template = ChatPromptTemplate.from_messages([
2    ("system", """Tóm tắt cuộc họp với format:
3    
4    �� **Cuộc họp**: [Tên/Chủ đề]
5    
6    �� **Mục tiêu**:
7    - [Mục tiêu cuộc họp]
8    
9    �� **Các quyết định**:
10    1. [Quyết định 1]
11    2. [Quyết định 2]
12    
13    ✅ **Action items**:
14    - [ ] [Task 1] - [Người phụ trách]
15    - [ ] [Task 2] - [Người phụ trách]
16    
17    �� **Next steps**:
18    - [Bước tiếp theo]
19    """),
20    ("human", "Transcript cuộc họp:\n{transcript}")
21])

Research Paper

Python

1paper_template = ChatPromptTemplate.from_messages([
2    ("system", """Tóm tắt paper khoa học với format:
3    
4    **Title**: [Tên paper]
5    **Authors**: [Tác giả]
6    
7    **Problem**: Vấn đề paper giải quyết
8    
9    **Method**: Phương pháp đề xuất
10    
11    **Results**: Kết quả chính
12    
13    **Contribution**: Đóng góp của paper
14    
15    **Limitations**: Hạn chế (nếu có)
16    """),
17    ("human", "{paper}")
18])

Evaluation

ROUGE Scores

Python

1from rouge_score import rouge_scorer
2
3scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
4
5reference = "Đây là summary chuẩn..."
6generated = "Đây là summary được tạo..."
7
8scores = scorer.score(reference, generated)
9print(f"ROUGE-1: {scores['rouge1'].fmeasure:.4f}")
10print(f"ROUGE-2: {scores['rouge2'].fmeasure:.4f}")
11print(f"ROUGE-L: {scores['rougeL'].fmeasure:.4f}")

LLM-based Evaluation

Python

1eval_template = ChatPromptTemplate.from_messages([
2    ("system", """Đánh giá chất lượng summary trên thang 1-5:
3    
4    Criteria:
5    - Accuracy: Thông tin có chính xác không?
6    - Completeness: Có đầy đủ ý chính không?
7    - Conciseness: Có ngắn gọn không?
8    - Coherence: Có mạch lạc không?
9    
10    Trả lời format JSON:
11    {{"accuracy": X, "completeness": X, "conciseness": X, "coherence": X, "overall": X, "feedback": "..."}}
12    """),
13    ("human", "Original:\n{original}\n\nSummary:\n{summary}")
14])

Best Practices

Summarization Tips

Chọn strategy phù hợp:
- Short text → Stuff
- Long text → Map-Reduce hoặc Refine
Specify output format rõ ràng
Set word/sentence limits nếu cần
Preserve key information:
- Names, dates, numbers
- Technical terms
Consider audience khi summarize

Bài tập thực hành

Hands-on Exercise

Build Document Summarizer:

Tạo summarizer cho nhiều document types:
- Articles
- Research papers
- Meeting transcripts
Implement cả 3 strategies:
- Stuff
- Map-Reduce
- Refine
Add evaluation với ROUGE
Compare results giữa các strategies

Target: Summarizer có thể xử lý documents dài với quality cao

📝 Text Summarization với LLMs

Types of Summarization

Basic Summarization

Simple Prompt

Bullet Point Summary

Summarization Strategies

1. Stuff (All at once)

2. Map-Reduce

3. Refine

Custom Summarization Chains

With Custom Prompts

LCEL Custom Chain

Specialized Summarization

Article Summary

Meeting Notes

Research Paper

Evaluation

ROUGE Scores

LLM-based Evaluation

Best Practices

Bài tập thực hành

Tài liệu tham khảo