RAG Evaluation

🎯 Mục tiêu bài học

TB5 min

RAG system cần đo lường chất lượng ở nhiều chiều: retrieval accuracy, answer quality, và faithfulness. RAGAS framework cung cấp các metrics chuẩn.

Sau bài này, bạn sẽ:

✅ RAGAS evaluation framework ✅ Faithfulness, relevancy, precision, recall metrics ✅ Custom evaluation functions ✅ Monitoring in production

Task 0

🔍 RAG Evaluation Dimensions

TB5 min

📊RAG Evaluation

🔍RETRIEVAL Quality

🎯Context Precision: Are retrieved docs relevant?

📋Context Recall: Did we find all relevant docs?

✍️GENERATION Quality

✅Faithfulness: Is answer grounded in context?

💬Answer Relevancy: Does answer address the question?

🏁END-TO-END Quality

☑️Answer Correctness: Is the answer factually correct?

📦Answer Completeness: Does it cover all aspects?

What to Measure

Metric	Measures	Problem it catches
Faithfulness	Answer grounded in context?	Hallucination
Answer Relevancy	Answer addresses question?	Off-topic responses
Context Precision	Retrieved docs relevant?	Noise in context
Context Recall	All relevant docs found?	Missing information

Checkpoint

Bạn đã hiểu 3 dimensions của RAG evaluation: retrieval, generation, end-to-end chưa?

Task 1

📐 RAGAS Framework

TB5 min

Setup

python.py

1# pip install ragas datasets
2
3from ragas import evaluate
4from ragas.metrics import (
5    faithfulness,
6    answer_relevancy,
7    context_precision,
8    context_recall
9)
10from datasets import Dataset
11
12# Prepare evaluation data
13eval_data = {
14    "question": [
15        "Mức lương tối thiểu vùng 1 là bao nhiêu?",
16        "Thời gian nghỉ phép năm là bao lâu?"
17    ],
18    "answer": [
19        "Mức lương tối thiểu vùng 1 là 4.680.000 đồng/tháng.",
20        "Người lao động được nghỉ 12 ngày phép/năm."
21    ],
22    "contexts": [
23        ["Nghị định 38/2022: Mức lương tối thiểu vùng 1 là 4.680.000 đồng/tháng"],
24        ["Theo Luật Lao động, NLĐ làm đủ 12 tháng được nghỉ 12 ngày phép năm"]
25    ],
26    "ground_truth": [
27        "Mức lương tối thiểu vùng 1 là 4.680.000 đồng/tháng theo Nghị định 38/2022.",
28        "Người lao động được nghỉ 12 ngày phép mỗi năm."
29    ]
30}
31
32dataset = Dataset.from_dict(eval_data)

Run Evaluation

python.py

1from langchain_openai import ChatOpenAI, OpenAIEmbeddings
2
3# RAGAS uses LLM for evaluation
4result = evaluate(
5    dataset,
6    metrics=[
7        faithfulness,
8        answer_relevancy,
9        context_precision,
10        context_recall
11    ],
12    llm=ChatOpenAI(model="gpt-4o-mini"),
13    embeddings=OpenAIEmbeddings()
14)
15
16print(result)
17# {'faithfulness': 0.95, 'answer_relevancy': 0.92,
18#  'context_precision': 0.88, 'context_recall': 0.90}

Per-Question Analysis

python.py

1df = result.to_pandas()
2print(df[["question", "faithfulness", "answer_relevancy"]])
3
4# Find worst performing questions
5low_faith = df[df["faithfulness"] < 0.7]
6print(f"\n⚠️ Low faithfulness ({len(low_faith)} questions):")
7for _, row in low_faith.iterrows():
8    print(f"  Q: {row['question']}")
9    print(f"  Score: {row['faithfulness']:.2f}")

Checkpoint

Bạn đã hiểu cách dùng RAGAS framework để evaluate RAG với 4 metrics chính chưa?

Task 2

📐 Understanding Each Metric

TB5 min

Faithfulness

python.py

1"""
2Faithfulness = Are claims in the answer supported by context?
3
4Process:
51. Extract claims from the answer
62. Check each claim against context
73. Score = supported_claims / total_claims
8
9Example:
10  Answer: "Lương tối thiểu vùng 1 là 4.680.000đ, áp dụng từ 7/2022"
11  Claims: ["Lương vùng 1 = 4.680.000đ", "Áp dụng từ 7/2022"]
12  Context supports: ["Lương vùng 1 = 4.680.000đ" ✅, "7/2022" ❌ not in context]
13  Faithfulness = 1/2 = 0.5
14"""

Context Precision & Recall

python.py

1"""
2Context Precision = What fraction of retrieved docs are relevant?
3  Precision = relevant_retrieved / total_retrieved
4  High = No noise in results
5  
6Context Recall = What fraction of relevant docs were retrieved?
7  Recall = relevant_retrieved / total_relevant
8  High = No missing information
9
10Example:
11  Retrieved: [doc1✅, doc2❌, doc3✅, doc4❌, doc5✅]
12  Relevant but missed: [doc6✅, doc7✅]
13  
14  Precision = 3/5 = 0.6
15  Recall = 3/(3+2) = 0.6
16"""

Answer Relevancy

python.py

1"""
2Answer Relevancy = Does the answer actually address the question?
3
4Process:
51. Generate N questions from the answer
62. Compute similarity between generated questions and original
73. Score = average similarity
8
9Example:
10  Question: "Lương tối thiểu vùng 1?"
11  Answer: "Lương tối thiểu vùng 1 là 4.680.000đ/tháng"
12  Generated Q: "Mức lương tối thiểu vùng 1 là bao nhiêu?"
13  Similarity: 0.95 → High relevancy ✅
14"""

Checkpoint

Bạn đã phân biệt được faithfulness, context precision, context recall, và answer relevancy chưa?

Task 3

🛠️ Custom Evaluation

TB5 min

LLM-as-Judge

python.py

1from langchain_openai import ChatOpenAI
2from langchain_core.prompts import ChatPromptTemplate
3
4judge_llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
5
6eval_prompt = ChatPromptTemplate.from_template(
7    """You are an expert evaluator. Rate the following answer on a scale of 1-5.
8    
9    Question: {question}
10    Context: {context}
11    Answer: {answer}
12    
13    Evaluate on:
14    1. Accuracy (1-5): Is the answer factually correct based on context?
15    2. Completeness (1-5): Does it cover all relevant aspects?
16    3. Clarity (1-5): Is it clear and well-structured?
17    
18    Respond in JSON format:
19    {{"accuracy": N, "completeness": N, "clarity": N, "explanation": "..."}}"""
20)
21
22def llm_judge(question, context, answer):
23    response = (eval_prompt | judge_llm).invoke({
24        "question": question,
25        "context": context,
26        "answer": answer
27    })
28    import json
29    return json.loads(response.content)
30
31# Usage
32scores = llm_judge(
33    "Lương tối thiểu vùng 1?",
34    "Nghị định 38/2022: Mức lương tối thiểu vùng 1 là 4.680.000 đồng/tháng",
35    "Mức lương tối thiểu vùng 1 là 4.680.000 đồng/tháng theo Nghị định 38/2022."
36)
37print(scores)

Retrieval Metrics (No LLM needed)

python.py

1import numpy as np
2
3def hit_rate(retrieved_ids, relevant_ids):
4    """Did we retrieve at least one relevant document?"""
5    return int(bool(set(retrieved_ids) & set(relevant_ids)))
6
7def mrr(retrieved_ids, relevant_ids):
8    """Mean Reciprocal Rank - Position of first relevant result."""
9    for i, doc_id in enumerate(retrieved_ids):
10        if doc_id in relevant_ids:
11            return 1.0 / (i + 1)
12    return 0.0
13
14def ndcg_at_k(retrieved_ids, relevant_ids, k=5):
15    """Normalized Discounted Cumulative Gain."""
16    dcg = sum(
17        1.0 / np.log2(i + 2) 
18        for i, doc_id in enumerate(retrieved_ids[:k])
19        if doc_id in relevant_ids
20    )
21    ideal_dcg = sum(
22        1.0 / np.log2(i + 2) 
23        for i in range(min(len(relevant_ids), k))
24    )
25    return dcg / ideal_dcg if ideal_dcg > 0 else 0.0
26
27# Evaluate retrieval
28retrieved = ["doc1", "doc3", "doc5", "doc2", "doc4"]
29relevant = ["doc1", "doc2", "doc6"]
30
31print(f"Hit Rate: {hit_rate(retrieved, relevant)}")      # 1
32print(f"MRR: {mrr(retrieved, relevant):.3f}")             # 1.0
33print(f"NDCG@5: {ndcg_at_k(retrieved, relevant, 5):.3f}") # 0.773

Checkpoint

Bạn đã hiểu cách viết custom evaluation functions cho RAG chưa?

Task 4

💻 Evaluation Pipeline

TB5 min

Complete RAG Evaluator

python.py

1class RAGEvaluator:
2    """Complete RAG evaluation pipeline."""
3    
4    def __init__(self, rag_pipeline, eval_data):
5        self.pipeline = rag_pipeline
6        self.eval_data = eval_data
7        self.results = []
8    
9    def evaluate(self):
10        """Run full evaluation."""
11        for item in self.eval_data:
12            # Get RAG response
13            response = self.pipeline.query(item["question"])
14            
15            # Evaluate
16            result = {
17                "question": item["question"],
18                "expected": item["expected_answer"],
19                "actual": response["answer"],
20                "retrieved_docs": len(response["contexts"]),
21                "hit_rate": hit_rate(
22                    [d.metadata.get("id") for d in response["contexts"]],
23                    item.get("relevant_doc_ids", [])
24                ),
25                "faithfulness": self._check_faithfulness(
26                    response["answer"], response["contexts"]
27                )
28            }
29            self.results.append(result)
30        
31        return self._summarize()
32    
33    def _summarize(self):
34        """Summarize evaluation results."""
35        return {
36            "total_questions": len(self.results),
37            "avg_hit_rate": np.mean([r["hit_rate"] for r in self.results]),
38            "avg_faithfulness": np.mean([r["faithfulness"] for r in self.results]),
39            "low_quality": [
40                r for r in self.results 
41                if r["faithfulness"] < 0.7
42            ]
43        }
44    
45    def _check_faithfulness(self, answer, contexts):
46        """Simple faithfulness check."""
47        context_text = " ".join([c.page_content for c in contexts])
48        # Simple word overlap score
49        answer_words = set(answer.lower().split())
50        context_words = set(context_text.lower().split())
51        overlap = len(answer_words & context_words)
52        return overlap / len(answer_words) if answer_words else 0

A/B Testing

python.py

1def compare_pipelines(pipeline_a, pipeline_b, eval_data):
2    """A/B compare two RAG pipelines."""
3    results = {"A": [], "B": []}
4    
5    for item in eval_data:
6        for name, pipeline in [("A", pipeline_a), ("B", pipeline_b)]:
7            response = pipeline.query(item["question"])
8            score = llm_judge(
9                item["question"],
10                "\n".join([c.page_content for c in response["contexts"]]),
11                response["answer"]
12            )
13            results[name].append(score)
14    
15    # Compare
16    for metric in ["accuracy", "completeness", "clarity"]:
17        avg_a = np.mean([r[metric] for r in results["A"]])
18        avg_b = np.mean([r[metric] for r in results["B"]])
19        winner = "A" if avg_a > avg_b else "B"
20        print(f"{metric}: A={avg_a:.2f} vs B={avg_b:.2f} → Winner: {winner}")

Checkpoint

Bạn đã hiểu cách xây dựng evaluation pipeline hoàn chỉnh cho RAG chưa?

Task 5

📊 Monitoring in Production

TB5 min

python.py

1import time
2from datetime import datetime
3
4class RAGMonitor:
5    """Monitor RAG system in production."""
6    
7    def __init__(self):
8        self.logs = []
9    
10    def log_query(self, query, response, contexts, latency_ms):
11        self.logs.append({
12            "timestamp": datetime.now().isoformat(),
13            "query": query,
14            "response_length": len(response),
15            "num_contexts": len(contexts),
16            "latency_ms": latency_ms,
17            "has_answer": len(response) > 10
18        })
19    
20    def get_metrics(self, last_n=100):
21        recent = self.logs[-last_n:]
22        return {
23            "total_queries": len(recent),
24            "avg_latency_ms": np.mean([l["latency_ms"] for l in recent]),
25            "p95_latency_ms": np.percentile([l["latency_ms"] for l in recent], 95),
26            "avg_contexts": np.mean([l["num_contexts"] for l in recent]),
27            "answer_rate": np.mean([l["has_answer"] for l in recent]),
28        }
29    
30    def alert_check(self):
31        metrics = self.get_metrics()
32        alerts = []
33        if metrics["avg_latency_ms"] > 3000:
34            alerts.append(f"⚠️ High latency: {metrics['avg_latency_ms']:.0f}ms")
35        if metrics["answer_rate"] < 0.8:
36            alerts.append(f"⚠️ Low answer rate: {metrics['answer_rate']:.1%}")
37        return alerts
38
39# Usage
40monitor = RAGMonitor()
41
42# Wrap RAG pipeline
43def monitored_query(pipeline, query):
44    start = time.time()
45    response = pipeline.query(query)
46    latency = (time.time() - start) * 1000
47    
48    monitor.log_query(query, response["answer"], response["contexts"], latency)
49    return response
50
51# Check health
52print(monitor.get_metrics())
53print(monitor.alert_check())

Checkpoint

Bạn đã hiểu cách monitor RAG system trong production chưa?

Task 6

🚀 Bài tiếp theo

Capstone Project — Xây dựng Document Q&A System hoàn chỉnh!

🎯 Mục tiêu bài học

Sau bài này, bạn sẽ:

🔍 RAG Evaluation Dimensions

What to Measure

Checkpoint

📐 RAGAS Framework

Setup

Run Evaluation

Per-Question Analysis

Checkpoint

📐 Understanding Each Metric

Faithfulness

Context Precision & Recall

Answer Relevancy

Checkpoint

🛠️ Custom Evaluation

LLM-as-Judge

Retrieval Metrics (No LLM needed)

Checkpoint

💻 Evaluation Pipeline

Complete RAG Evaluator

A/B Testing

Checkpoint

📊 Monitoring in Production

Checkpoint

🎯 Tổng kết

📝 Quiz

Key Takeaways

Câu hỏi tự kiểm tra

🚀 Bài tiếp theo

Khóa học

Mentor & Hỗ trợ

Blog

Giới thiệu