MinAI - Về trang chủ
Lý thuyết
10/1340 phút
Đang tải...

Hybrid Search & Reranking

Kết hợp keyword search, semantic search và reranking để tối ưu retrieval

0

🎯 Mục tiêu bài học

TB5 min

Semantic search tốt nhưng không hoàn hảo. Kết hợp với keyword search (BM25) và reranking tạo ra retrieval pipeline mạnh mẽ hơn.

Sau bài này, bạn sẽ:

✅ BM25 keyword search ✅ Hybrid search (BM25 + semantic) ✅ Reranking với Cross-Encoder ✅ Production retrieval pipeline

1

🔍 Why Hybrid Search?

TB5 min
Ví dụ
1Semantic Search alone:
2 ✅ Understands meaning ("car" ≈ "automobile")
3 ❌ May miss exact terms ("Nghị định 38/2022/NĐ-CP")
4 ❌ Poor with proper nouns, codes, numbers
5
6Keyword Search alone:
7 ✅ Exact matching ("Nghị định 38/2022/NĐ-CP")
8 ❌ Misses synonyms ("xe hơi" ≠ "ô tô")
9
10Hybrid = Best of both worlds!

How Hybrid Search Works

Hybrid Search Pipeline

Query "Nghị định 38 về lương tối thiểu"
🔤BM25 Keyword Search (exact match)
🧠Semantic Search (meaning match)
🔀Reciprocal Rank Fusion Merge results
📊Reranker Cross-encoder scoring
🏆Top-K Results

Checkpoint

Bạn đã hiểu tại sao hybrid search kết hợp keyword + semantic tốt hơn dùng riêng lẻ chưa?

2

📐 BM25 Keyword Search

TB5 min

BM25 Implementation

python.py
1# pip install rank-bm25
2
3from rank_bm25 import BM25Okapi
4import numpy as np
5from typing import List
6from underthesea import word_tokenize # Vietnamese tokenizer
7
8class BM25Retriever:
9 def __init__(self, documents: List[str]):
10 # Tokenize documents (important for Vietnamese!)
11 self.documents = documents
12 self.tokenized_docs = [
13 word_tokenize(doc, format="text").split()
14 for doc in documents
15 ]
16 self.bm25 = BM25Okapi(self.tokenized_docs)
17
18 def search(self, query: str, k: int = 5) -> List[dict]:
19 tokenized_query = word_tokenize(query, format="text").split()
20 scores = self.bm25.get_scores(tokenized_query)
21
22 # Get top-k indices
23 top_indices = np.argsort(scores)[-k:][::-1]
24
25 results = []
26 for idx in top_indices:
27 if scores[idx] > 0:
28 results.append({
29 "content": self.documents[idx],
30 "score": float(scores[idx]),
31 "index": int(idx)
32 })
33 return results
34
35# Usage
36documents = [
37 "Nghị định 38/2022/NĐ-CP quy định mức lương tối thiểu",
38 "Luật Lao động 2019 về thời gian làm việc",
39 "Mức lương tối thiểu vùng 1 là 4.680.000 đồng/tháng"
40]
41
42bm25 = BM25Retriever(documents)
43results = bm25.search("lương tối thiểu Nghị định 38")
44for r in results:
45 print(f"Score: {r['score']:.2f} | {r['content'][:60]}")

Checkpoint

Bạn đã hiểu BM25 keyword search hoạt động dựa trên term frequency chưa?

3

💻 Hybrid Search Implementation

TB5 min

Reciprocal Rank Fusion

python.py
1def reciprocal_rank_fusion(results_list, k=60):
2 """Merge multiple result lists using RRF.
3
4 RRF score = sum(1 / (k + rank_i)) for each result list
5 """
6 fused_scores = {}
7 doc_map = {}
8
9 for results in results_list:
10 for rank, doc in enumerate(results):
11 doc_id = hash(doc["content"])
12 if doc_id not in fused_scores:
13 fused_scores[doc_id] = 0
14 doc_map[doc_id] = doc
15 fused_scores[doc_id] += 1 / (k + rank + 1)
16
17 # Sort by fused score
18 sorted_docs = sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)
19
20 return [
21 {**doc_map[doc_id], "rrf_score": score}
22 for doc_id, score in sorted_docs
23 ]

HybridRetriever Class

python.py
1from langchain_openai import OpenAIEmbeddings
2from langchain_chroma import Chroma
3
4class HybridRetriever:
5 def __init__(self, documents, vectorstore):
6 self.bm25 = BM25Retriever(documents)
7 self.vectorstore = vectorstore
8
9 def search(self, query, k=5, bm25_weight=0.5, semantic_weight=0.5):
10 # Get BM25 results
11 bm25_results = self.bm25.search(query, k=k*2)
12
13 # Get semantic results
14 semantic_docs = self.vectorstore.similarity_search_with_score(query, k=k*2)
15 semantic_results = [
16 {"content": doc.page_content, "score": float(score), "metadata": doc.metadata}
17 for doc, score in semantic_docs
18 ]
19
20 # Merge with RRF
21 fused = reciprocal_rank_fusion([bm25_results, semantic_results])
22
23 return fused[:k]

LangChain EnsembleRetriever

python.py
1from langchain.retrievers import EnsembleRetriever
2from langchain_community.retrievers import BM25Retriever as LCBm25
3
4# Setup retrievers
5bm25_retriever = LCBm25.from_documents(documents, k=5)
6chroma_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
7
8# Ensemble with weights
9ensemble_retriever = EnsembleRetriever(
10 retrievers=[bm25_retriever, chroma_retriever],
11 weights=[0.4, 0.6] # Adjust based on your use case
12)
13
14results = ensemble_retriever.invoke("lương tối thiểu 2024")

Checkpoint

Bạn đã hiểu Reciprocal Rank Fusion merge kết quả từ BM25 và semantic search chưa?

4

⚡ Reranking

TB5 min

Why Rerank?

Ví dụ
1Initial retrieval (fast, less accurate):
2 → 20 candidates from hybrid search
3
4Reranking (slower, more accurate):
5 → Score each candidate with cross-encoder
6 → Return top 5
7
8Cross-Encoder vs Bi-Encoder:
9 Bi-Encoder: encode(query) · encode(doc) → fast, separate
10 Cross-Encoder: encode(query + doc) → slow, joint attention, more accurate

Cohere Reranker (API)

python.py
1# pip install cohere
2
3import cohere
4
5co = cohere.Client("your-api-key")
6
7def rerank_cohere(query, documents, top_n=5):
8 results = co.rerank(
9 query=query,
10 documents=documents,
11 top_n=top_n,
12 model="rerank-multilingual-v3.0" # Supports Vietnamese!
13 )
14
15 reranked = []
16 for r in results.results:
17 reranked.append({
18 "content": documents[r.index],
19 "relevance_score": r.relevance_score,
20 "index": r.index
21 })
22 return reranked

Cross-Encoder Reranker (Free)

python.py
1# pip install sentence-transformers
2
3from sentence_transformers import CrossEncoder
4
5# Multilingual cross-encoder
6model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
7
8def rerank_cross_encoder(query, documents, top_n=5):
9 # Score each (query, document) pair
10 pairs = [(query, doc) for doc in documents]
11 scores = model.predict(pairs)
12
13 # Sort by score
14 scored_docs = list(zip(documents, scores))
15 scored_docs.sort(key=lambda x: x[1], reverse=True)
16
17 return [
18 {"content": doc, "score": float(score)}
19 for doc, score in scored_docs[:top_n]
20 ]

LangChain + Reranker

python.py
1from langchain.retrievers import ContextualCompressionRetriever
2from langchain_cohere import CohereRerank
3
4# Setup reranker
5compressor = CohereRerank(
6 model="rerank-multilingual-v3.0",
7 top_n=5
8)
9
10# Wrap retriever with reranker
11compression_retriever = ContextualCompressionRetriever(
12 base_compressor=compressor,
13 base_retriever=ensemble_retriever
14)
15
16results = compression_retriever.invoke("lương tối thiểu theo nghị định mới")

Checkpoint

Bạn đã hiểu Cross-Encoder reranking chính xác hơn Bi-Encoder nhưng chậm hơn chưa?

5

💻 Complete Pipeline

TB5 min
python.py
1class ProductionRetriever:
2 """Production-grade retrieval pipeline."""
3
4 def __init__(self, vectorstore, documents, cross_encoder_model=None):
5 self.vectorstore = vectorstore
6 self.bm25 = BM25Retriever(documents)
7 self.cross_encoder = cross_encoder_model or CrossEncoder(
8 "cross-encoder/ms-marco-MiniLM-L-6-v2"
9 )
10
11 def retrieve(self, query, top_k=5, initial_k=20):
12 """Full retrieval pipeline: Hybrid Search → Rerank → Top-K"""
13
14 # Stage 1: BM25 search
15 bm25_results = self.bm25.search(query, k=initial_k)
16
17 # Stage 2: Semantic search
18 semantic_results = self.vectorstore.similarity_search_with_score(
19 query, k=initial_k
20 )
21 semantic_formatted = [
22 {"content": doc.page_content, "score": score, "metadata": doc.metadata}
23 for doc, score in semantic_results
24 ]
25
26 # Stage 3: Reciprocal Rank Fusion
27 fused = reciprocal_rank_fusion([bm25_results, semantic_formatted])
28
29 # Stage 4: Rerank with cross-encoder
30 if len(fused) > 0:
31 contents = [doc["content"] for doc in fused]
32 pairs = [(query, content) for content in contents]
33 scores = self.cross_encoder.predict(pairs)
34
35 for doc, score in zip(fused, scores):
36 doc["rerank_score"] = float(score)
37
38 fused.sort(key=lambda x: x["rerank_score"], reverse=True)
39
40 return fused[:top_k]
41
42# Usage
43retriever = ProductionRetriever(vectorstore, documents)
44results = retriever.retrieve("lương tối thiểu vùng 1 năm 2024")
45
46for i, r in enumerate(results):
47 print(f"\n--- Result {i+1} (score: {r['rerank_score']:.3f}) ---")
48 print(r["content"][:200])

Checkpoint

Bạn đã hiểu full pipeline: Hybrid Search → RRF → Rerank → Top-K chưa?

6

🎯 Tổng kết

TB5 min

📝 Quiz

  1. Hybrid search kết hợp gì?

    • Hai LLM models
    • Keyword search (BM25) + Semantic search
    • Hai databases
    • Query + Document
  2. Reciprocal Rank Fusion dùng để?

    • Merge kết quả từ nhiều retrievers dựa trên ranking position
    • Tạo query mới
    • Train model mới
    • Xóa duplicate documents
  3. Cross-Encoder reranker hoạt động thế nào?

    • Encode (query, document) pair cùng nhau qua attention → relevance score
    • Encode query và document riêng biệt
    • Chỉ count keyword matches
    • Random ranking

Key Takeaways

  1. Hybrid Search — BM25 + Semantic covers cả exact match lẫn meaning
  2. RRF — Simple yet effective fusion algorithm
  3. Reranking — Cross-Encoder improves precision significantly
  4. Pipeline — Retrieve many → Fuse → Rerank → Return top-k

Câu hỏi tự kiểm tra

  1. Tại sao BM25 keyword search vẫn quan trọng khi đã có semantic search?
  2. Reciprocal Rank Fusion hoạt động như thế nào để merge kết quả?
  3. Cross-Encoder reranker khác Bi-Encoder ở điểm nào và trade-off là gì?
  4. Mô tả full production retrieval pipeline từ query đến kết quả cuối cùng.

🎉 Tuyệt vời! Bạn đã hoàn thành bài học Hybrid Search & Reranking!

Tiếp theo: Hãy cùng tìm hiểu về RAG Evaluation để đo lường chất lượng!


🚀 Bài tiếp theo

RAG Evaluation — RAGAS framework, custom metrics, và monitoring pipeline!