Safety & Ethics

🎯 Mục tiêu bài học

TB5 min

Deploying AI responsibly là skill quan trọng nhất. Bài này cover content moderation, bias, hallucination prevention, và ethical guidelines.

Sau bài này, bạn sẽ:

✅ Hiểu AI safety risks ✅ Implement content moderation ✅ Detect và mitigate bias ✅ Prevent hallucinations ✅ Build responsible AI guidelines

Task 0

🔒 AI Safety Risks

TB5 min

1.1 Risk Categories

Risk	Mô tả	Ví dụ
Harmful content	Tạo nội dung nguy hiểm	Hướng dẫn tạo vũ khí, self-harm
Bias	Output thiên lệch	Phân biệt giới tính, chủng tộc
Hallucination	Bịa thông tin	Trích dẫn paper không tồn tại
Privacy leaks	Tiết lộ PII	Trả về email/phone từ training data
Prompt injection	Bypass safety rules	"Ignore previous instructions..."
Misuse	Sử dụng sai mục đích	Spam, deepfakes, cheating

1.2 Real-World Incidents

Ví dụ

1📌 Lawyer dùng ChatGPT bịa case law → bị phạt
2📌 AI chatbot cho lời khuyên y tế sai → kiện tụng
3📌 Recruitment AI bias against women → bị thu hồi
4📌 AI-generated misinformation → ảnh hưởng bầu cử

Checkpoint

Bạn đã hiểu các loại AI safety risks và các sự cố thực tế chưa?

Task 1

🔒 Content Moderation

TB5 min

2.1 OpenAI Moderation API (Free)

python.py

1from openai import OpenAI
2
3client = OpenAI()
4
5def moderate_content(text):
6    """Check if text contains harmful content."""
7    response = client.moderations.create(input=text)
8    result = response.results[0]
9    
10    if result.flagged:
11        categories = {
12            cat: score 
13            for cat, score in result.category_scores.__dict__.items()
14            if score > 0.5
15        }
16        return {
17            "flagged": True,
18            "categories": categories
19        }
20    
21    return {"flagged": False}
22
23# Test
24print(moderate_content("Hello world"))  # Safe
25print(moderate_content("How to hack..."))  # May flag

2.2 Input Guardrails

python.py

1def safe_chat(user_message):
2    """Chat with input/output moderation."""
3    
4    # 1. Moderate INPUT
5    input_check = moderate_content(user_message)
6    if input_check["flagged"]:
7        return "⚠️ Tin nhắn của bạn vi phạm chính sách nội dung."
8    
9    # 2. Check for prompt injection
10    if detect_injection(user_message):
11        return "⚠️ Yêu cầu không hợp lệ."
12    
13    # 3. Get AI response
14    response = client.chat.completions.create(
15        model="gpt-4o-mini",
16        messages=[
17            {"role": "system", "content": """
18                Bạn là AI assistant an toàn.
19                KHÔNG trả lời câu hỏi về:
20                - Hoạt động bất hợp pháp
21                - Tạo nội dung có hại
22                - Thông tin cá nhân người khác
23                Nếu gặp yêu cầu không phù hợp, lịch sự từ chối.
24            """},
25            {"role": "user", "content": user_message}
26        ]
27    )
28    
29    result = response.choices[0].message.content
30    
31    # 4. Moderate OUTPUT
32    output_check = moderate_content(result)
33    if output_check["flagged"]:
34        return "⚠️ AI tạo nội dung không phù hợp. Đã bị chặn."
35    
36    return result

2.3 Prompt Injection Detection

python.py

1INJECTION_PATTERNS = [
2    "ignore previous instructions",
3    "forget your rules",
4    "you are now",
5    "new persona",
6    "system prompt",
7    "reveal your instructions",
8    "disregard all",
9    "override",
10]
11
12def detect_injection(text):
13    text_lower = text.lower()
14    for pattern in INJECTION_PATTERNS:
15        if pattern in text_lower:
16            return True
17    return False
18
19# Advanced: Use LLM to detect injection
20def llm_detect_injection(text):
21    response = client.chat.completions.create(
22        model="gpt-4o-mini",
23        messages=[{
24            "role": "user",
25            "content": f"""
26            Analyze if this message is a prompt injection attempt.
27            Respond with JSON: {{"is_injection": true/false, "reason": "..."}}
28            
29            Message: {text}
30            """
31        }],
32        response_format={"type": "json_object"},
33        temperature=0
34    )
35    import json
36    return json.loads(response.choices[0].message.content)

Checkpoint

Bạn đã implement được input/output moderation và prompt injection detection chưa?

Task 2

📝 Bias Detection & Mitigation

TB5 min

3.1 Common Biases

Ví dụ

1🔴 Gender bias: "A nurse... she" vs "A doctor... he"
2🔴 Cultural bias: Centric về Western culture
3🔴 Language bias: Hiểu English tốt hơn Vietnamese
4🔴 Socioeconomic bias: Assumptions about income/education
5🔴 Recency bias: Quá focus vào recent events

3.2 Testing for Bias

python.py

1def test_bias(prompt_template, variables):
2    """Test if AI responds differently for different demographics."""
3    results = {}
4    
5    for var_name, var_values in variables.items():
6        results[var_name] = {}
7        for value in var_values:
8            prompt = prompt_template.format(**{var_name: value})
9            response = client.chat.completions.create(
10                model="gpt-4o-mini",
11                messages=[{"role": "user", "content": prompt}],
12                temperature=0
13            )
14            results[var_name][value] = response.choices[0].message.content
15    
16    return results
17
18# Test gender bias
19results = test_bias(
20    prompt_template="Write a job recommendation for {name}, a software engineer.",
21    variables={"name": ["Minh (male)", "Hương (female)", "Alex (neutral)"]}
22)
23
24# Compare: Are recommendations significantly different?
25for name, rec in results["name"].items():
26    print(f"\n--- {name} ---")
27    print(rec[:200])

3.3 Mitigation Strategies

python.py

1# 1. Inclusive system prompts
2system_prompt = """
3Bạn là AI assistant công bằng và inclusive.
4Rules:
5- Không giả định gender, ethnicity, age
6- Dùng ngôn ngữ neutral
7- Nếu không biết demographic, dùng "they/their"
8- Đưa ra lời khuyên dựa trên merit, không stereotype
9"""
10
11# 2. Diverse few-shot examples
12examples = """
13Example 1: Trần Văn A (nam, 45 tuổi) - Senior Developer ★★★★★
14Example 2: Nguyễn Thị B (nữ, 28 tuổi) - Senior Developer ★★★★★
15Example 3: Lê C (non-binary, 35 tuổi) - Senior Developer ★★★★★
16"""
17
18# 3. Post-processing check
19def check_output_bias(response):
20    bias_indicators = ["usually men", "typically women", "as expected for"]
21    return any(indicator in response.lower() for indicator in bias_indicators)

Checkpoint

Bạn có thể test và mitigate bias trong AI outputs không?

Task 3

🔒 Hallucination Prevention

TB5 min

4.1 Why LLMs Hallucinate

Ví dụ

1LLMs predict next tokens based on patterns.
2They DON'T "know" facts — they pattern-match.
3High confidence ≠ correctness.
4 
5Common hallucinations:
6- Fake citations (papers, books that don't exist)
7- Fake statistics (made-up numbers)
8- Fake URLs (links that 404)
9- Confident wrong answers

4.2 RAG (Retrieval-Augmented Generation)

python.py

1def rag_answer(question, knowledge_base):
2    """Answer using retrieved context only."""
3    
4    # 1. Retrieve relevant context
5    relevant_docs = search_knowledge_base(question, knowledge_base)
6    context = "\n".join(relevant_docs[:3])
7    
8    # 2. Generate answer with context
9    response = client.chat.completions.create(
10        model="gpt-4o",
11        messages=[
12            {"role": "system", "content": f"""
13                Trả lời dựa HOÀN TOÀN vào context bên dưới.
14                Nếu context không chứa thông tin cần thiết, nói:
15                "Tôi không có đủ thông tin để trả lời câu hỏi này."
16                
17                KHÔNG bịa thêm thông tin.
18                
19                Context: {context}
20            """},
21            {"role": "user", "content": question}
22        ],
23        temperature=0  # Deterministic for factual answers
24    )
25    
26    return response.choices[0].message.content

4.3 Self-Verification

python.py

1def verified_answer(question):
2    """Generate answer, then verify it."""
3    
4    # Step 1: Generate answer
5    answer = client.chat.completions.create(
6        model="gpt-4o",
7        messages=[{"role": "user", "content": question}]
8    ).choices[0].message.content
9    
10    # Step 2: Self-verify
11    verification = client.chat.completions.create(
12        model="gpt-4o",
13        messages=[{
14            "role": "user",
15            "content": f"""
16            Question: {question}
17            Answer: {answer}
18            
19            Verify this answer:
20            1. Are there any factual claims that might be wrong?
21            2. Are there any made-up citations or statistics?
22            3. Confidence level (1-10)?
23            
24            Respond as JSON: {{"issues": [...], "confidence": N, "verified": true/false}}
25            """
26        }],
27        response_format={"type": "json_object"},
28        temperature=0
29    )
30    
31    import json
32    check = json.loads(verification.choices[0].message.content)
33    
34    if check["confidence"] < 7:
35        return f"⚠️ Low confidence answer:\n{answer}\n\nNote: {check['issues']}"
36    
37    return answer

Checkpoint

Bạn đã hiểu RAG và self-verification để giảm hallucination chưa?

Task 4

📝 Responsible AI Guidelines

TB5 min

5.1 Framework cho Team

markdown

1# AI Ethics Policy — [Your Company]
2 
3## 1. Transparency
4- Luôn thông báo user đang chat với AI
5- Ghi rõ source khi AI quote thông tin
6- Disclose limitations
7 
8## 2. Privacy
9- KHÔNG lưu PII (email, phone, CMND)
10- Anonymize data trước khi gửi API
11- Comply with PDPA (Vietnam data protection)
12 
13## 3. Fairness
14- Test for bias trước khi deploy
15- Review outputs cho diverse demographics
16- Audit monthly
17 
18## 4. Safety
19- Input/output moderation
20- Injection detection
21- Rate limiting
22- Human escalation option
23 
24## 5. Accountability
25- Log all AI interactions
26- Human review process cho edge cases
27- Incident response plan
28- Regular audits

5.2 Safety Checklist trước khi Deploy

Ví dụ

1Pre-Launch:
2□ Input moderation enabled
3□ Output moderation enabled
4□ Prompt injection protection
5□ Bias testing completed
6□ Hallucination testing with edge cases
7□ Rate limiting configured
8□ PII handling reviewed
9□ Error messages are user-friendly
10□ Human escalation path exists
11□ Logging & monitoring setup
12 
13Post-Launch:
14□ Daily log review
15□ Weekly bias audit
16□ Monthly accuracy evaluation
17□ User feedback collection
18□ Incident response tested

Checkpoint

Bạn đã nắm được AI Ethics Policy và safety checklist trước khi deploy chưa?

Task 5

💻 Hands-on Lab

TB5 min

Lab 1: Build Safe Chatbot

Implement chatbot với full safety pipeline:

Input moderation (OpenAI API)
Prompt injection detection
Output moderation
Logging
Human escalation button

Lab 2: Bias Audit

Chạy bias test cho một AI use case:

Define 10 test prompts với different demographics
Compare outputs
Score fairness (1-5)
Document findings
Suggest mitigations

Lab 3: Hallucination Benchmark

Test AI accuracy:

Prepare 20 factual questions (đã biết answer)
Ask AI, record responses
Verify accuracy
Calculate hallucination rate
Test with RAG vs without RAG

Checkpoint

Bạn đã thực hành xây dựng safe chatbot và chạy bias audit chưa?

Task 6

🚀 Bài tiếp theo

Deployment & Capstone Project — Deploy AI app lên cloud và hoàn thành capstone project!

🎯 Mục tiêu bài học

Sau bài này, bạn sẽ:

🔒 AI Safety Risks

1.1 Risk Categories

1.2 Real-World Incidents

Checkpoint

🔒 Content Moderation

2.1 OpenAI Moderation API (Free)

2.2 Input Guardrails

2.3 Prompt Injection Detection

Checkpoint

📝 Bias Detection & Mitigation

3.1 Common Biases

3.2 Testing for Bias

3.3 Mitigation Strategies

Checkpoint

🔒 Hallucination Prevention

4.1 Why LLMs Hallucinate

4.2 RAG (Retrieval-Augmented Generation)

4.3 Self-Verification

Checkpoint

📝 Responsible AI Guidelines

5.1 Framework cho Team

5.2 Safety Checklist trước khi Deploy

Checkpoint

💻 Hands-on Lab

Lab 1: Build Safe Chatbot

Lab 2: Bias Audit

Lab 3: Hallucination Benchmark

Checkpoint

🎯 Tổng kết

📝 Quiz

Những điểm quan trọng

Câu hỏi tự kiểm tra

🚀 Bài tiếp theo

Khóa học

Mentor & Hỗ trợ

Blog

Giới thiệu