Prompt Optimization & Testing

0

🎯 Mục tiêu bài học

TB5 min

Prompt tốt không viết được sau 1 lần. Bài này dạy bạn systematic approach để iterate, test, và đo lường prompt quality.

Sau bài này, bạn sẽ:

✅ Hiểu prompt iteration workflow ✅ Xây dựng evaluation framework ✅ A/B test prompts ✅ Tạo prompt library cho team

Task 0

1

📝 Prompt Iteration Framework

TB5 min

1.1 The IDEAL Loop

Ví dụ

1I — Identify: Xác định mục tiêu rõ ràng
2D — Draft: Viết prompt v1
3E — Evaluate: Test với nhiều inputs
4A — Analyze: Phân tích failures
5L — Level up: Cải thiện và lặp lại

1.2 Version Control cho Prompts

python.py

1# prompts/summarizer.py
2SUMMARIZER_V1 = """
3Tóm tắt văn bản sau trong 3 câu.
4Văn bản: {text}
5"""
6
7SUMMARIZER_V2 = """
8Bạn là editor chuyên tóm tắt tin tức.
9Tóm tắt văn bản sau theo cấu trúc:
10- Headline (1 câu, < 15 từ)
11- Key points (2-3 bullet points)
12- Takeaway (1 câu)
13
14Văn bản: {text}
15"""
16
17SUMMARIZER_V3 = """
18Role: Senior news editor at VnExpress
19Task: Tóm tắt bài viết cho newsletter subscriber
20
21Rules:
221. Headline: 1 câu, < 15 từ, phải có action verb
232. Body: 2-3 bullet points, mỗi point < 20 từ
243. So what?: 1 câu giải thích tại sao reader nên care
254. Tone: Professional nhưng accessible
265. Ngôn ngữ: Tiếng Việt
27
28Văn bản: {text}
29"""

Insight: V1 → V2 thêm cấu trúc, V3 thêm role + constraints. Chất lượng tăng rõ rệt.

Checkpoint

Bạn đã hiểu quy trình IDEAL Loop và version control cho prompts chưa?

Task 1

2

📊 Evaluation Framework

TB5 min

2.1 Evaluation Criteria

Tiêu chí	Mô tả	Thang điểm
Relevance	Output trả lời đúng câu hỏi?	1-5
Accuracy	Thông tin chính xác?	1-5
Completeness	Đủ chi tiết?	1-5
Format	Đúng format yêu cầu?	1-5
Tone	Đúng tone/language?	1-5

2.2 Build Evaluation Dataset

python.py

1# eval_dataset.py
2test_cases = [
3    {
4        "input": "Python list comprehension là gì?",
5        "expected_elements": ["syntax", "ví dụ", "so sánh for loop"],
6        "expected_format": "structured explanation",
7        "category": "technical_explanation"
8    },
9    {
10        "input": "So sánh React vs Vue vs Angular",
11        "expected_elements": ["3 frameworks", "pros/cons", "use cases"],
12        "expected_format": "comparison table",
13        "category": "comparison"
14    },
15    {
16        "input": "Viết email xin nghỉ phép",
17        "expected_elements": ["greeting", "reason", "dates", "closing"],
18        "expected_format": "email format",
19        "category": "writing"
20    }
21]

2.3 Automated Scoring

python.py

1from openai import OpenAI
2
3client = OpenAI()
4
5def evaluate_response(prompt, response, criteria):
6    """Use LLM-as-Judge to evaluate response quality."""
7    
8    eval_prompt = f"""
9    Evaluate the following AI response on these criteria (1-5 each):
10    
11    Original prompt: {prompt}
12    AI Response: {response}
13    
14    Criteria:
15    1. Relevance: Does it answer the question?
16    2. Accuracy: Is the information correct?
17    3. Completeness: Does it cover all key points?
18    4. Format: Is it well-structured?
19    5. Usefulness: Would a real user find this helpful?
20    
21    Respond in JSON: {{"relevance": X, "accuracy": X, "completeness": X,
22    "format": X, "usefulness": X, "total": X, "feedback": "..."}}
23    """
24    
25    result = client.chat.completions.create(
26        model="gpt-4-turbo",
27        messages=[{"role": "user", "content": eval_prompt}],
28        response_format={"type": "json_object"}
29    )
30    
31    import json
32    return json.loads(result.choices[0].message.content)
33
34# Example usage
35scores = evaluate_response(
36    prompt="Giải thích Docker",
37    response="Docker là container platform...",
38    criteria=["relevance", "accuracy", "completeness"]
39)
40print(f"Total: {scores['total']}/25")

Checkpoint

Bạn có thể sử dụng LLM-as-Judge để tự động đánh giá chất lượng response không?

Task 2

3

🧪 A/B Testing Prompts

TB5 min

3.1 Setup A/B Test

python.py

1import random
2import json
3from datetime import datetime
4
5class PromptABTest:
6    def __init__(self, name):
7        self.name = name
8        self.variants = {}
9        self.results = []
10    
11    def add_variant(self, name, prompt_template):
12        self.variants[name] = prompt_template
13    
14    def run_test(self, inputs, n_runs=1):
15        """Run each variant against all inputs."""
16        for input_text in inputs:
17            for variant_name, template in self.variants.items():
18                prompt = template.format(text=input_text)
19                
20                response = client.chat.completions.create(
21                    model="gpt-4-turbo",
22                    messages=[{"role": "user", "content": prompt}],
23                    temperature=0.7
24                )
25                
26                result = {
27                    "variant": variant_name,
28                    "input": input_text[:100],
29                    "output": response.choices[0].message.content,
30                    "tokens": response.usage.total_tokens,
31                    "timestamp": datetime.now().isoformat()
32                }
33                self.results.append(result)
34        
35        return self.results
36    
37    def analyze(self):
38        """Compare variants by evaluation scores."""
39        summary = {}
40        for result in self.results:
41            v = result["variant"]
42            if v not in summary:
43                summary[v] = {"scores": [], "tokens": []}
44            
45            score = evaluate_response(
46                result["input"], result["output"], []
47            )
48            summary[v]["scores"].append(score["total"])
49            summary[v]["tokens"].append(result["tokens"])
50        
51        for v, data in summary.items():
52            avg_score = sum(data["scores"]) / len(data["scores"])
53            avg_tokens = sum(data["tokens"]) / len(data["tokens"])
54            print(f"{v}: Avg score={avg_score:.1f}/25, Avg tokens={avg_tokens:.0f}")

3.2 Example: Testing Summarizer Variants

python.py

1test = PromptABTest("summarizer")
2
3test.add_variant("v2_structured", SUMMARIZER_V2)
4test.add_variant("v3_role_based", SUMMARIZER_V3)
5
6test_articles = [
7    "Thị trường crypto Việt Nam tăng trưởng 300%...",
8    "Apple vừa ra mắt Vision Pro 2...",
9    "Startup Việt gọi vốn Series B thành công...",
10]
11
12results = test.run_test(test_articles)
13test.analyze()

Checkpoint

Bạn đã biết cách setup và phân tích A/B test cho prompts chưa?

Task 3

4

📝 Common Prompt Failures & Fixes

TB5 min

4.1 Failure Patterns

Problem	Symptom	Fix
Vague output	Trả lời chung chung	Thêm specificity: "Cho 3 ví dụ cụ thể"
Wrong format	Không theo format	Thêm output template + example
Hallucination	Bịa thông tin	Thêm "Chỉ dùng info đã cho. Nói 'không biết' nếu không chắc"
Too long	Dài > cần thiết	Set word/sentence limit: "Trả lời trong 50 từ"
Wrong language	Mix Anh-Việt	Specify: "Trả lời hoàn toàn bằng tiếng Việt"
Inconsistent	Mỗi lần output khác	Set temperature=0, thêm strict format

4.2 Before → After

Before (weak):

Ví dụ

1Tóm tắt bài viết này

After (strong):

Ví dụ

1Role: Senior editor tại VnExpress
2Task: Tóm tắt bài viết cho newsletter
3 
4Format:
5📰 Headline: [1 câu, < 15 từ]
6📌 Key points:
7- [Point 1]
8- [Point 2]
9- [Point 3]
10💡 Takeaway: [1 câu, tại sao quan trọng]
11 
12Rules:
13- Tiếng Việt
14- Không thêm thông tin ngoài bài
15- Tone: Professional, accessible

Checkpoint

Bạn có thể nhận diện và sửa các failure patterns phổ biến không?

Task 4

5

🛠️ Prompt Library

TB5 min

5.1 Organized Structure

📁prompts/

🐍__init__.py

🐍base.py — Base templates

🐍summarizer.py — Summarization prompts

🐍analyst.py — Data analysis prompts

🐍coder.py — Code generation prompts

🐍writer.py — Content writing prompts

🐍evaluator.py — Evaluation prompts

5.2 Template Pattern

python.py

1# prompts/base.py
2class PromptTemplate:
3    def __init__(self, template, version="1.0", metadata=None):
4        self.template = template
5        self.version = version
6        self.metadata = metadata or {}
7    
8    def format(self, **kwargs):
9        return self.template.format(**kwargs)
10    
11    def __repr__(self):
12        return f"PromptTemplate(v{self.version})"
13
14# prompts/analyst.py
15DATA_ANALYST = PromptTemplate(
16    template="""
17    Role: Senior Data Analyst
18    Dataset: {dataset_description}
19    
20    Task: Phân tích dữ liệu và trả về:
21    1. Summary statistics (top 3 metrics)
22    2. Key trends (2-3 patterns)
23    3. Anomalies (nếu có)
24    4. Recommendations (2-3 actions)
25    
26    Format: Markdown with headers
27    Language: Tiếng Việt
28    
29    Data: {data}
30    """,
31    version="2.1",
32    metadata={"category": "analysis", "models": ["gpt-4", "claude-3.5"]}
33)

5.3 Version Tracking

python.py

1# prompt_registry.py
2REGISTRY = {
3    "summarizer": {
4        "current": "v3",
5        "versions": {
6            "v1": {"prompt": SUMMARIZER_V1, "score": 3.2},
7            "v2": {"prompt": SUMMARIZER_V2, "score": 4.0},
8            "v3": {"prompt": SUMMARIZER_V3, "score": 4.5},
9        }
10    },
11    "analyst": {
12        "current": "v2.1",
13        "versions": {...}
14    }
15}

Checkpoint

Bạn đã biết cách tổ chức và quản lý prompt library với version tracking chưa?

Task 5

6

💻 Hands-on Lab

TB5 min

Lab 1: Prompt Iteration (30 phút)

Bắt đầu với prompt đơn giản rồi iterate qua 4 versions:

Task: Tạo prompt cho "AI giải thích khái niệm Data Science"

V1: Viết prompt cơ bản
V2: Thêm role + format
V3: Thêm constraints (word limit, examples, audience)
V4: Thêm output template + anti-hallucination

Compare quality qua từng version.

Lab 2: Build Mini Evaluation Pipeline

python.py

1# Tạo pipeline evaluate 3 prompts khác nhau
2# cho task: "Viết product description cho 5 sản phẩm"
3
4# Steps:
5# 1. Define 3 prompt variants
6# 2. Create 5 test products
7# 3. Run all combinations (15 total)
8# 4. Score with LLM-as-Judge
9# 5. Print comparison table

Lab 3: Team Prompt Library

Tạo prompt library cho team Data Analytics:

5 prompt templates (summarize, analyze, visualize, report, email)
Version tracking
Usage examples
Performance benchmarks

Checkpoint

Bạn đã thực hành prompt iteration và xây dựng evaluation pipeline chưa?

Task 6

Prompt Optimization & Testing

🎯 Mục tiêu bài học

Sau bài này, bạn sẽ:

📝 Prompt Iteration Framework

1.1 The IDEAL Loop

1.2 Version Control cho Prompts

Checkpoint

📊 Evaluation Framework

2.1 Evaluation Criteria

2.2 Build Evaluation Dataset

2.3 Automated Scoring

Checkpoint

🧪 A/B Testing Prompts

3.1 Setup A/B Test

3.2 Example: Testing Summarizer Variants

Checkpoint

📝 Common Prompt Failures & Fixes

4.1 Failure Patterns

4.2 Before → After

Checkpoint

🛠️ Prompt Library

5.1 Organized Structure

5.2 Template Pattern

5.3 Version Tracking

Checkpoint

💻 Hands-on Lab

Lab 1: Prompt Iteration (30 phút)

Lab 2: Build Mini Evaluation Pipeline

Lab 3: Team Prompt Library

Checkpoint

🎯 Tổng kết

📝 Quiz

Những điểm quan trọng

Câu hỏi tự kiểm tra

🚀 Bài tiếp theo

Khóa học

Mentor & Hỗ trợ

Blog

Giới thiệu