MinAI - Về trang chủ
Lý thuyết
11/1340 phút
Đang tải...

RAG Evaluation

RAGAS framework, custom metrics, và monitoring cho RAG systems

0

🎯 Mục tiêu bài học

TB5 min

RAG system cần đo lường chất lượng ở nhiều chiều: retrieval accuracy, answer quality, và faithfulness. RAGAS framework cung cấp các metrics chuẩn.

Sau bài này, bạn sẽ:

✅ RAGAS evaluation framework ✅ Faithfulness, relevancy, precision, recall metrics ✅ Custom evaluation functions ✅ Monitoring in production

1

🔍 RAG Evaluation Dimensions

TB5 min
📊RAG Evaluation
🔍RETRIEVAL Quality
🎯Context Precision: Are retrieved docs relevant?
📋Context Recall: Did we find all relevant docs?
✍️GENERATION Quality
Faithfulness: Is answer grounded in context?
💬Answer Relevancy: Does answer address the question?
🏁END-TO-END Quality
☑️Answer Correctness: Is the answer factually correct?
📦Answer Completeness: Does it cover all aspects?

What to Measure

MetricMeasuresProblem it catches
FaithfulnessAnswer grounded in context?Hallucination
Answer RelevancyAnswer addresses question?Off-topic responses
Context PrecisionRetrieved docs relevant?Noise in context
Context RecallAll relevant docs found?Missing information

Checkpoint

Bạn đã hiểu 3 dimensions của RAG evaluation: retrieval, generation, end-to-end chưa?

2

📐 RAGAS Framework

TB5 min

Setup

python.py
1# pip install ragas datasets
2
3from ragas import evaluate
4from ragas.metrics import (
5 faithfulness,
6 answer_relevancy,
7 context_precision,
8 context_recall
9)
10from datasets import Dataset
11
12# Prepare evaluation data
13eval_data = {
14 "question": [
15 "Mức lương tối thiểu vùng 1 là bao nhiêu?",
16 "Thời gian nghỉ phép năm là bao lâu?"
17 ],
18 "answer": [
19 "Mức lương tối thiểu vùng 1 là 4.680.000 đồng/tháng.",
20 "Người lao động được nghỉ 12 ngày phép/năm."
21 ],
22 "contexts": [
23 ["Nghị định 38/2022: Mức lương tối thiểu vùng 1 là 4.680.000 đồng/tháng"],
24 ["Theo Luật Lao động, NLĐ làm đủ 12 tháng được nghỉ 12 ngày phép năm"]
25 ],
26 "ground_truth": [
27 "Mức lương tối thiểu vùng 1 là 4.680.000 đồng/tháng theo Nghị định 38/2022.",
28 "Người lao động được nghỉ 12 ngày phép mỗi năm."
29 ]
30}
31
32dataset = Dataset.from_dict(eval_data)

Run Evaluation

python.py
1from langchain_openai import ChatOpenAI, OpenAIEmbeddings
2
3# RAGAS uses LLM for evaluation
4result = evaluate(
5 dataset,
6 metrics=[
7 faithfulness,
8 answer_relevancy,
9 context_precision,
10 context_recall
11 ],
12 llm=ChatOpenAI(model="gpt-4o-mini"),
13 embeddings=OpenAIEmbeddings()
14)
15
16print(result)
17# {'faithfulness': 0.95, 'answer_relevancy': 0.92,
18# 'context_precision': 0.88, 'context_recall': 0.90}

Per-Question Analysis

python.py
1df = result.to_pandas()
2print(df[["question", "faithfulness", "answer_relevancy"]])
3
4# Find worst performing questions
5low_faith = df[df["faithfulness"] < 0.7]
6print(f"\n⚠️ Low faithfulness ({len(low_faith)} questions):")
7for _, row in low_faith.iterrows():
8 print(f" Q: {row['question']}")
9 print(f" Score: {row['faithfulness']:.2f}")

Checkpoint

Bạn đã hiểu cách dùng RAGAS framework để evaluate RAG với 4 metrics chính chưa?

3

📐 Understanding Each Metric

TB5 min

Faithfulness

python.py
1"""
2Faithfulness = Are claims in the answer supported by context?
3
4Process:
51. Extract claims from the answer
62. Check each claim against context
73. Score = supported_claims / total_claims
8
9Example:
10 Answer: "Lương tối thiểu vùng 1 là 4.680.000đ, áp dụng từ 7/2022"
11 Claims: ["Lương vùng 1 = 4.680.000đ", "Áp dụng từ 7/2022"]
12 Context supports: ["Lương vùng 1 = 4.680.000đ" , "7/2022" not in context]
13 Faithfulness = 1/2 = 0.5
14"""

Context Precision & Recall

python.py
1"""
2Context Precision = What fraction of retrieved docs are relevant?
3 Precision = relevant_retrieved / total_retrieved
4 High = No noise in results
5
6Context Recall = What fraction of relevant docs were retrieved?
7 Recall = relevant_retrieved / total_relevant
8 High = No missing information
9
10Example:
11 Retrieved: [doc1, doc2, doc3, doc4, doc5]
12 Relevant but missed: [doc6, doc7]
13
14 Precision = 3/5 = 0.6
15 Recall = 3/(3+2) = 0.6
16"""

Answer Relevancy

python.py
1"""
2Answer Relevancy = Does the answer actually address the question?
3
4Process:
51. Generate N questions from the answer
62. Compute similarity between generated questions and original
73. Score = average similarity
8
9Example:
10 Question: "Lương tối thiểu vùng 1?"
11 Answer: "Lương tối thiểu vùng 1 là 4.680.000đ/tháng"
12 Generated Q: "Mức lương tối thiểu vùng 1 là bao nhiêu?"
13 Similarity: 0.95 High relevancy
14"""

Checkpoint

Bạn đã phân biệt được faithfulness, context precision, context recall, và answer relevancy chưa?

4

🛠️ Custom Evaluation

TB5 min

LLM-as-Judge

python.py
1from langchain_openai import ChatOpenAI
2from langchain_core.prompts import ChatPromptTemplate
3
4judge_llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
5
6eval_prompt = ChatPromptTemplate.from_template(
7 """You are an expert evaluator. Rate the following answer on a scale of 1-5.
8
9 Question: {question}
10 Context: {context}
11 Answer: {answer}
12
13 Evaluate on:
14 1. Accuracy (1-5): Is the answer factually correct based on context?
15 2. Completeness (1-5): Does it cover all relevant aspects?
16 3. Clarity (1-5): Is it clear and well-structured?
17
18 Respond in JSON format:
19 {{"accuracy": N, "completeness": N, "clarity": N, "explanation": "..."}}"""
20)
21
22def llm_judge(question, context, answer):
23 response = (eval_prompt | judge_llm).invoke({
24 "question": question,
25 "context": context,
26 "answer": answer
27 })
28 import json
29 return json.loads(response.content)
30
31# Usage
32scores = llm_judge(
33 "Lương tối thiểu vùng 1?",
34 "Nghị định 38/2022: Mức lương tối thiểu vùng 1 là 4.680.000 đồng/tháng",
35 "Mức lương tối thiểu vùng 1 là 4.680.000 đồng/tháng theo Nghị định 38/2022."
36)
37print(scores)

Retrieval Metrics (No LLM needed)

python.py
1import numpy as np
2
3def hit_rate(retrieved_ids, relevant_ids):
4 """Did we retrieve at least one relevant document?"""
5 return int(bool(set(retrieved_ids) & set(relevant_ids)))
6
7def mrr(retrieved_ids, relevant_ids):
8 """Mean Reciprocal Rank - Position of first relevant result."""
9 for i, doc_id in enumerate(retrieved_ids):
10 if doc_id in relevant_ids:
11 return 1.0 / (i + 1)
12 return 0.0
13
14def ndcg_at_k(retrieved_ids, relevant_ids, k=5):
15 """Normalized Discounted Cumulative Gain."""
16 dcg = sum(
17 1.0 / np.log2(i + 2)
18 for i, doc_id in enumerate(retrieved_ids[:k])
19 if doc_id in relevant_ids
20 )
21 ideal_dcg = sum(
22 1.0 / np.log2(i + 2)
23 for i in range(min(len(relevant_ids), k))
24 )
25 return dcg / ideal_dcg if ideal_dcg > 0 else 0.0
26
27# Evaluate retrieval
28retrieved = ["doc1", "doc3", "doc5", "doc2", "doc4"]
29relevant = ["doc1", "doc2", "doc6"]
30
31print(f"Hit Rate: {hit_rate(retrieved, relevant)}") # 1
32print(f"MRR: {mrr(retrieved, relevant):.3f}") # 1.0
33print(f"NDCG@5: {ndcg_at_k(retrieved, relevant, 5):.3f}") # 0.773

Checkpoint

Bạn đã hiểu cách viết custom evaluation functions cho RAG chưa?

5

💻 Evaluation Pipeline

TB5 min

Complete RAG Evaluator

python.py
1class RAGEvaluator:
2 """Complete RAG evaluation pipeline."""
3
4 def __init__(self, rag_pipeline, eval_data):
5 self.pipeline = rag_pipeline
6 self.eval_data = eval_data
7 self.results = []
8
9 def evaluate(self):
10 """Run full evaluation."""
11 for item in self.eval_data:
12 # Get RAG response
13 response = self.pipeline.query(item["question"])
14
15 # Evaluate
16 result = {
17 "question": item["question"],
18 "expected": item["expected_answer"],
19 "actual": response["answer"],
20 "retrieved_docs": len(response["contexts"]),
21 "hit_rate": hit_rate(
22 [d.metadata.get("id") for d in response["contexts"]],
23 item.get("relevant_doc_ids", [])
24 ),
25 "faithfulness": self._check_faithfulness(
26 response["answer"], response["contexts"]
27 )
28 }
29 self.results.append(result)
30
31 return self._summarize()
32
33 def _summarize(self):
34 """Summarize evaluation results."""
35 return {
36 "total_questions": len(self.results),
37 "avg_hit_rate": np.mean([r["hit_rate"] for r in self.results]),
38 "avg_faithfulness": np.mean([r["faithfulness"] for r in self.results]),
39 "low_quality": [
40 r for r in self.results
41 if r["faithfulness"] < 0.7
42 ]
43 }
44
45 def _check_faithfulness(self, answer, contexts):
46 """Simple faithfulness check."""
47 context_text = " ".join([c.page_content for c in contexts])
48 # Simple word overlap score
49 answer_words = set(answer.lower().split())
50 context_words = set(context_text.lower().split())
51 overlap = len(answer_words & context_words)
52 return overlap / len(answer_words) if answer_words else 0

A/B Testing

python.py
1def compare_pipelines(pipeline_a, pipeline_b, eval_data):
2 """A/B compare two RAG pipelines."""
3 results = {"A": [], "B": []}
4
5 for item in eval_data:
6 for name, pipeline in [("A", pipeline_a), ("B", pipeline_b)]:
7 response = pipeline.query(item["question"])
8 score = llm_judge(
9 item["question"],
10 "\n".join([c.page_content for c in response["contexts"]]),
11 response["answer"]
12 )
13 results[name].append(score)
14
15 # Compare
16 for metric in ["accuracy", "completeness", "clarity"]:
17 avg_a = np.mean([r[metric] for r in results["A"]])
18 avg_b = np.mean([r[metric] for r in results["B"]])
19 winner = "A" if avg_a > avg_b else "B"
20 print(f"{metric}: A={avg_a:.2f} vs B={avg_b:.2f} → Winner: {winner}")

Checkpoint

Bạn đã hiểu cách xây dựng evaluation pipeline hoàn chỉnh cho RAG chưa?

6

📊 Monitoring in Production

TB5 min
python.py
1import time
2from datetime import datetime
3
4class RAGMonitor:
5 """Monitor RAG system in production."""
6
7 def __init__(self):
8 self.logs = []
9
10 def log_query(self, query, response, contexts, latency_ms):
11 self.logs.append({
12 "timestamp": datetime.now().isoformat(),
13 "query": query,
14 "response_length": len(response),
15 "num_contexts": len(contexts),
16 "latency_ms": latency_ms,
17 "has_answer": len(response) > 10
18 })
19
20 def get_metrics(self, last_n=100):
21 recent = self.logs[-last_n:]
22 return {
23 "total_queries": len(recent),
24 "avg_latency_ms": np.mean([l["latency_ms"] for l in recent]),
25 "p95_latency_ms": np.percentile([l["latency_ms"] for l in recent], 95),
26 "avg_contexts": np.mean([l["num_contexts"] for l in recent]),
27 "answer_rate": np.mean([l["has_answer"] for l in recent]),
28 }
29
30 def alert_check(self):
31 metrics = self.get_metrics()
32 alerts = []
33 if metrics["avg_latency_ms"] > 3000:
34 alerts.append(f"⚠️ High latency: {metrics['avg_latency_ms']:.0f}ms")
35 if metrics["answer_rate"] < 0.8:
36 alerts.append(f"⚠️ Low answer rate: {metrics['answer_rate']:.1%}")
37 return alerts
38
39# Usage
40monitor = RAGMonitor()
41
42# Wrap RAG pipeline
43def monitored_query(pipeline, query):
44 start = time.time()
45 response = pipeline.query(query)
46 latency = (time.time() - start) * 1000
47
48 monitor.log_query(query, response["answer"], response["contexts"], latency)
49 return response
50
51# Check health
52print(monitor.get_metrics())
53print(monitor.alert_check())

Checkpoint

Bạn đã hiểu cách monitor RAG system trong production chưa?

7

🎯 Tổng kết

TB5 min

📝 Quiz

  1. Faithfulness metric đo gì?

    • Tốc độ response
    • Answer có dựa trên context hay hallucinate
    • Số documents retrieved
    • User satisfaction
  2. Context Recall đo gì?

    • Tỷ lệ relevant docs được retrieve (không bỏ sót)
    • Tốc độ retrieval
    • Số tokens trong context
    • Similarity score
  3. MRR (Mean Reciprocal Rank) đo gì?

    • Vị trí của relevant result đầu tiên trong kết quả (1/rank)
    • Trung bình số kết quả
    • Tổng số relevant documents
    • Cosine similarity

Key Takeaways

  1. RAGAS — Framework chuẩn cho RAG evaluation
  2. 4 Core Metrics — Faithfulness, Relevancy, Precision, Recall
  3. Custom Evaluation — LLM-as-Judge cho domain-specific metrics
  4. Retrieval Metrics — Hit Rate, MRR, NDCG không cần LLM
  5. Monitoring — Track latency, answer rate, quality in production

Câu hỏi tự kiểm tra

  1. RAGAS framework đo lường RAG quality ở những dimensions nào?
  2. Faithfulness score thấp có nghĩa gì và cách khắc phục?
  3. So sánh Context Precision vs Context Recall — khi nào ưu tiên metric nào?
  4. Làm thế nào để monitoring RAG system trong production?

🎉 Tuyệt vời! Bạn đã hoàn thành bài học RAG Evaluation!

Tiếp theo: Capstone Project — Xây dựng production RAG system hoàn chỉnh!


🚀 Bài tiếp theo

Capstone Project — Xây dựng Document Q&A System hoàn chỉnh!