Cost Optimization

🎯 Mục tiêu bài học

TB5 min

API costs có thể tăng nhanh nếu không quản lý. Bài này dạy bạn strategies để giảm chi phí 50-90% mà vẫn giữ chất lượng.

Sau bài này, bạn sẽ:

✅ Hiểu token pricing model ✅ Chọn đúng model cho đúng task ✅ Implement caching strategies ✅ Monitor và control chi phí

Task 0

📊 Understanding Costs

TB5 min

1.1 Token Pricing (2024-2025)

Model	Input ($/1M tokens)	Output ($/1M tokens)
GPT-4 Turbo	$10.00	$30.00
GPT-4o	$2.50	$10.00
GPT-4o mini	$0.15	$0.60
GPT-3.5 Turbo	$0.50	$1.50
Claude 3.5 Sonnet	$3.00	$15.00
Claude 3.5 Haiku	$0.25	$1.25
Gemini 1.5 Pro	$1.25	$5.00
Gemini 1.5 Flash	$0.075	$0.30

1.2 Token ≈ Bao nhiêu text?

Ví dụ

1English: 1 token ≈ 0.75 words (4 characters)
2Vietnamese: 1 token ≈ 0.5-0.6 words (tiếng Việt tốn nhiều tokens hơn)
3 
4Ví dụ:
5"Hello world" = 2 tokens
6"Xin chào thế giới" = 6 tokens (Vietnamese uses more)

1.3 Cost Calculator

python.py

1import tiktoken
2
3def estimate_cost(prompt, response_est=500, model="gpt-4-turbo"):
4    """Estimate API call cost."""
5    
6    pricing = {
7        "gpt-4-turbo": {"input": 10.0, "output": 30.0},
8        "gpt-4o": {"input": 2.5, "output": 10.0},
9        "gpt-4o-mini": {"input": 0.15, "output": 0.60},
10        "gpt-3.5-turbo": {"input": 0.50, "output": 1.50},
11    }
12    
13    enc = tiktoken.encoding_for_model(model)
14    input_tokens = len(enc.encode(prompt))
15    output_tokens = response_est
16    
17    price = pricing[model]
18    input_cost = (input_tokens / 1_000_000) * price["input"]
19    output_cost = (output_tokens / 1_000_000) * price["output"]
20    total = input_cost + output_cost
21    
22    return {
23        "input_tokens": input_tokens,
24        "output_tokens": output_tokens,
25        "input_cost": f"${input_cost:.6f}",
26        "output_cost": f"${output_cost:.6f}",
27        "total_cost": f"${total:.6f}"
28    }
29
30# Example
31print(estimate_cost("Phân tích dữ liệu sales Q4...", model="gpt-4-turbo"))
32print(estimate_cost("Phân tích dữ liệu sales Q4...", model="gpt-4o-mini"))

Checkpoint

Bạn đã hiểu token pricing và cách tính chi phí API calls chưa?

Task 1

📝 Model Selection Strategy

TB5 min

2.1 Task → Model Mapping

Task	Recommended Model	Cost Level
Simple Q&A	GPT-4o mini / Haiku	💰
Classification	GPT-4o mini	💰
Summarization	GPT-4o mini / Haiku	💰
Translation	GPT-4o / Sonnet	💰💰
Code generation	GPT-4o / Sonnet	💰💰
Complex reasoning	GPT-4 Turbo / Opus	💰💰💰
Creative writing	GPT-4o / Sonnet	💰💰
Data analysis	GPT-4o / Sonnet	💰💰

2.2 Model Router

python.py

1class ModelRouter:
2    """Automatically route to cheapest appropriate model."""
3    
4    TASK_MODELS = {
5        "classification": "gpt-4o-mini",
6        "summarization": "gpt-4o-mini",
7        "translation": "gpt-4o",
8        "code": "gpt-4o",
9        "reasoning": "gpt-4-turbo",
10        "creative": "gpt-4o",
11        "simple_qa": "gpt-4o-mini",
12    }
13    
14    @classmethod
15    def classify_task(cls, prompt):
16        """Simple heuristic to classify task type."""
17        prompt_lower = prompt.lower()
18        
19        if any(w in prompt_lower for w in ["classify", "phân loại", "label"]):
20            return "classification"
21        elif any(w in prompt_lower for w in ["tóm tắt", "summarize", "summary"]):
22            return "summarization"
23        elif any(w in prompt_lower for w in ["dịch", "translate"]):
24            return "translation"
25        elif any(w in prompt_lower for w in ["code", "function", "class", "viết code"]):
26            return "code"
27        elif any(w in prompt_lower for w in ["phân tích", "analyze", "so sánh", "reasoning"]):
28            return "reasoning"
29        else:
30            return "simple_qa"
31    
32    @classmethod
33    def get_model(cls, prompt):
34        task = cls.classify_task(prompt)
35        model = cls.TASK_MODELS[task]
36        return model, task
37
38# Usage
39model, task = ModelRouter.get_model("Tóm tắt bài viết này")
40print(f"Task: {task}, Model: {model}")  # summarization, gpt-4o-mini

2.3 Cascade Pattern

python.py

1def cascade_call(prompt, models=None):
2    """Try cheap model first, escalate if quality is low."""
3    if models is None:
4        models = ["gpt-4o-mini", "gpt-4o", "gpt-4-turbo"]
5    
6    for model in models:
7        response = client.chat.completions.create(
8            model=model,
9            messages=[{"role": "user", "content": prompt}],
10            temperature=0.3
11        )
12        
13        result = response.choices[0].message.content
14        
15        # Simple quality check
16        if len(result) > 50 and "I don't know" not in result:
17            print(f"✅ Answered by {model}")
18            return result
19        
20        print(f"⚠️ {model} insufficient, escalating...")
21    
22    return result  # Return last result even if not ideal

Checkpoint

Bạn có thể áp dụng Model Router và Cascade Pattern để chọn model phù hợp không?

Task 2

🛠️ Caching Strategies

TB5 min

3.1 Exact-Match Cache

python.py

1import hashlib
2import json
3import os
4
5class PromptCache:
6    def __init__(self, cache_file="cache.json"):
7        self.cache_file = cache_file
8        self.cache = self._load()
9    
10    def _load(self):
11        if os.path.exists(self.cache_file):
12            with open(self.cache_file) as f:
13                return json.load(f)
14        return {}
15    
16    def _save(self):
17        with open(self.cache_file, "w") as f:
18            json.dump(self.cache, f, ensure_ascii=False, indent=2)
19    
20    def _hash(self, prompt, model):
21        key = f"{model}:{prompt}"
22        return hashlib.md5(key.encode()).hexdigest()
23    
24    def get(self, prompt, model):
25        key = self._hash(prompt, model)
26        return self.cache.get(key)
27    
28    def set(self, prompt, model, response):
29        key = self._hash(prompt, model)
30        self.cache[key] = response
31        self._save()
32
33# Usage
34cache = PromptCache()
35
36def cached_call(prompt, model="gpt-4o-mini"):
37    cached = cache.get(prompt, model)
38    if cached:
39        print("💾 Cache hit!")
40        return cached
41    
42    response = client.chat.completions.create(
43        model=model,
44        messages=[{"role": "user", "content": prompt}]
45    )
46    result = response.choices[0].message.content
47    cache.set(prompt, model, result)
48    return result

3.2 Semantic Cache (Advanced)

python.py

1import numpy as np
2
3class SemanticCache:
4    """Cache similar (not just exact) prompts."""
5    
6    def __init__(self, client, threshold=0.95):
7        self.client = client
8        self.threshold = threshold
9        self.entries = []  # [(embedding, prompt, response)]
10    
11    def get_embedding(self, text):
12        resp = self.client.embeddings.create(
13            model="text-embedding-3-small", input=text
14        )
15        return resp.data[0].embedding
16    
17    def similarity(self, emb1, emb2):
18        return np.dot(emb1, emb2) / (np.linalg.norm(emb1) * np.linalg.norm(emb2))
19    
20    def get(self, prompt):
21        prompt_emb = self.get_embedding(prompt)
22        for emb, cached_prompt, response in self.entries:
23            if self.similarity(prompt_emb, emb) > self.threshold:
24                return response
25        return None
26    
27    def set(self, prompt, response):
28        emb = self.get_embedding(prompt)
29        self.entries.append((emb, prompt, response))

Checkpoint

Bạn đã hiểu sự khác biệt giữa exact-match và semantic caching chưa?

Task 3

⚡ Prompt Optimization for Cost

TB5 min

4.1 Reduce Input Tokens

python.py

1# ❌ Verbose prompt (150 tokens)
2prompt = """
3I would like you to please analyze the following data and provide 
4a comprehensive summary of the key findings, trends, and patterns 
5that you can identify. The data is a CSV file containing sales 
6information from the past quarter...
7"""
8
9# ✅ Concise prompt (40 tokens)
10prompt = """
11Analyze this Q4 sales CSV. Return:
121. Top 3 findings
132. Key trends
143. Anomalies
15
16Data: {csv_data}
17"""

4.2 Batch Processing

python.py

1# ❌ 10 separate API calls (10x cost)
2for item in items:
3    result = client.chat.completions.create(
4        model="gpt-4o-mini",
5        messages=[{"role": "user", "content": f"Classify: {item}"}]
6    )
7
8# ✅ 1 API call with batch (1x cost)
9batch_prompt = "Classify each item (positive/negative):\n"
10batch_prompt += "\n".join(f"{i+1}. {item}" for i, item in enumerate(items))
11batch_prompt += "\n\nReturn JSON array of results."
12
13result = client.chat.completions.create(
14    model="gpt-4o-mini",
15    messages=[{"role": "user", "content": batch_prompt}],
16    response_format={"type": "json_object"}
17)

4.3 Use max_tokens

python.py

1# ❌ No limit — model might generate 2000 tokens
2response = client.chat.completions.create(
3    model="gpt-4o",
4    messages=[{"role": "user", "content": "Summarize this article"}]
5)
6
7# ✅ Set limit — cap at 200 tokens for summary
8response = client.chat.completions.create(
9    model="gpt-4o",
10    messages=[{"role": "user", "content": "Summarize this article in 3 sentences"}],
11    max_tokens=200
12)

Checkpoint

Bạn đã thực hành giảm input tokens, batch processing và sử dụng max_tokens chưa?

Task 4

📊 Cost Monitoring

TB5 min

5.1 Usage Tracker

python.py

1class CostTracker:
2    def __init__(self, monthly_budget=10.0):
3        self.budget = monthly_budget
4        self.total_cost = 0
5        self.calls = []
6    
7    def log_call(self, model, input_tokens, output_tokens):
8        pricing = {
9            "gpt-4-turbo": (10.0, 30.0),
10            "gpt-4o": (2.5, 10.0),
11            "gpt-4o-mini": (0.15, 0.60),
12        }
13        
14        in_rate, out_rate = pricing.get(model, (1.0, 3.0))
15        cost = (input_tokens * in_rate + output_tokens * out_rate) / 1_000_000
16        
17        self.total_cost += cost
18        self.calls.append({
19            "model": model,
20            "tokens": input_tokens + output_tokens,
21            "cost": cost
22        })
23        
24        # Alerts
25        usage_pct = (self.total_cost / self.budget) * 100
26        if usage_pct > 90:
27            print(f"🚨 ALERT: {usage_pct:.0f}% of budget used!")
28        elif usage_pct > 75:
29            print(f"⚠️ WARNING: {usage_pct:.0f}% of budget used")
30        
31        return cost
32    
33    def report(self):
34        by_model = {}
35        for call in self.calls:
36            m = call["model"]
37            if m not in by_model:
38                by_model[m] = {"calls": 0, "cost": 0}
39            by_model[m]["calls"] += 1
40            by_model[m]["cost"] += call["cost"]
41        
42        print(f"\n📊 Cost Report")
43        print(f"{'Model':<20} {'Calls':>6} {'Cost':>10}")
44        print("-" * 38)
45        for model, data in by_model.items():
46            print(f"{model:<20} {data['calls']:>6} ${data['cost']:>8.4f}")
47        print(f"\nTotal: ${self.total_cost:.4f} / ${self.budget:.2f}")
48        print(f"Budget remaining: ${self.budget - self.total_cost:.4f}")
49
50# Usage
51tracker = CostTracker(monthly_budget=10.0)
52# After each API call:
53tracker.log_call("gpt-4o-mini", 100, 200)
54tracker.report()

5.2 Quick Reference: Cost Saving Checklist

Ví dụ

1✅ 1. Dùng model nhỏ nhất đáp ứng được task
2✅ 2. Cache exact responses
3✅ 3. Batch similar requests
4✅ 4. Set max_tokens limit
5✅ 5. Viết prompt ngắn gọn
6✅ 6. Dùng GPT-3.5/Haiku cho internal tasks (summarize, classify)
7✅ 7. GPT-4/Sonnet chỉ cho user-facing/complex tasks
8✅ 8. Monitor usage daily
9✅ 9. Set budget alerts
10✅ 10. Review logs weekly cho optimization opportunities

Checkpoint

Bạn đã xây dựng được CostTracker và biết các checklist tiết kiệm chi phí chưa?

Task 5

💻 Hands-on Lab

TB5 min

Lab 1: Cost Comparison

Cùng 1 task ("Summarize 10 articles"), so sánh cost giữa:

GPT-4 Turbo vs GPT-4o vs GPT-4o mini
Individual calls vs batch

Lab 2: Build Smart Router

Implement ModelRouter + CostTracker cho chatbot:

Auto-route to cheapest model
Track cost per conversation
Alert when budget > 80%

Lab 3: Cache + Router Demo

Build complete cost-optimized pipeline:

Check cache → return if hit
Route to cheapest model
If quality low → escalate
Cache response
Log cost

Checkpoint

Bạn đã thực hành xây dựng Smart Router và cost-optimized pipeline chưa?

Task 6

🚀 Bài tiếp theo

Safety & Ethics — Content moderation, bias detection, và responsible AI!

🎯 Mục tiêu bài học

Sau bài này, bạn sẽ:

📊 Understanding Costs

1.1 Token Pricing (2024-2025)

1.2 Token ≈ Bao nhiêu text?

1.3 Cost Calculator

Checkpoint

📝 Model Selection Strategy

2.1 Task → Model Mapping

2.2 Model Router

2.3 Cascade Pattern

Checkpoint

🛠️ Caching Strategies

3.1 Exact-Match Cache

3.2 Semantic Cache (Advanced)

Checkpoint

⚡ Prompt Optimization for Cost

4.1 Reduce Input Tokens

4.2 Batch Processing

4.3 Use max_tokens

Checkpoint

📊 Cost Monitoring

5.1 Usage Tracker

5.2 Quick Reference: Cost Saving Checklist

Checkpoint

💻 Hands-on Lab

Lab 1: Cost Comparison

Lab 2: Build Smart Router

Lab 3: Cache + Router Demo

Checkpoint

🎯 Tổng kết

📝 Quiz

Những điểm quan trọng

Câu hỏi tự kiểm tra

🚀 Bài tiếp theo

Khóa học

Mentor & Hỗ trợ

Blog

Giới thiệu