Hybrid Search & Reranking

0

🎯 Mục tiêu bài học

TB5 min

Semantic search tốt nhưng không hoàn hảo. Kết hợp với keyword search (BM25) và reranking tạo ra retrieval pipeline mạnh mẽ hơn.

Sau bài này, bạn sẽ:

✅ BM25 keyword search ✅ Hybrid search (BM25 + semantic) ✅ Reranking với Cross-Encoder ✅ Production retrieval pipeline

Task 0

1

🔍 Why Hybrid Search?

TB5 min

Ví dụ

1Semantic Search alone:
2  ✅ Understands meaning ("car" ≈ "automobile")
3  ❌ May miss exact terms ("Nghị định 38/2022/NĐ-CP")
4  ❌ Poor with proper nouns, codes, numbers
5 
6Keyword Search alone:
7  ✅ Exact matching ("Nghị định 38/2022/NĐ-CP")
8  ❌ Misses synonyms ("xe hơi" ≠ "ô tô")
9 
10Hybrid = Best of both worlds!

How Hybrid Search Works

Hybrid Search Pipeline

❓Query "Nghị định 38 về lương tối thiểu"

🔤BM25 Keyword Search (exact match)

🧠Semantic Search (meaning match)

🔀Reciprocal Rank Fusion Merge results

📊Reranker Cross-encoder scoring

🏆Top-K Results

Checkpoint

Bạn đã hiểu tại sao hybrid search kết hợp keyword + semantic tốt hơn dùng riêng lẻ chưa?

Task 1

2

📐 BM25 Keyword Search

TB5 min

BM25 Implementation

python.py

1# pip install rank-bm25
2
3from rank_bm25 import BM25Okapi
4import numpy as np
5from typing import List
6from underthesea import word_tokenize  # Vietnamese tokenizer
7
8class BM25Retriever:
9    def __init__(self, documents: List[str]):
10        # Tokenize documents (important for Vietnamese!)
11        self.documents = documents
12        self.tokenized_docs = [
13            word_tokenize(doc, format="text").split()
14            for doc in documents
15        ]
16        self.bm25 = BM25Okapi(self.tokenized_docs)
17    
18    def search(self, query: str, k: int = 5) -> List[dict]:
19        tokenized_query = word_tokenize(query, format="text").split()
20        scores = self.bm25.get_scores(tokenized_query)
21        
22        # Get top-k indices
23        top_indices = np.argsort(scores)[-k:][::-1]
24        
25        results = []
26        for idx in top_indices:
27            if scores[idx] > 0:
28                results.append({
29                    "content": self.documents[idx],
30                    "score": float(scores[idx]),
31                    "index": int(idx)
32                })
33        return results
34
35# Usage
36documents = [
37    "Nghị định 38/2022/NĐ-CP quy định mức lương tối thiểu",
38    "Luật Lao động 2019 về thời gian làm việc",
39    "Mức lương tối thiểu vùng 1 là 4.680.000 đồng/tháng"
40]
41
42bm25 = BM25Retriever(documents)
43results = bm25.search("lương tối thiểu Nghị định 38")
44for r in results:
45    print(f"Score: {r['score']:.2f} | {r['content'][:60]}")

Checkpoint

Bạn đã hiểu BM25 keyword search hoạt động dựa trên term frequency chưa?

Task 2

3

💻 Hybrid Search Implementation

TB5 min

Reciprocal Rank Fusion

python.py

1def reciprocal_rank_fusion(results_list, k=60):
2    """Merge multiple result lists using RRF.
3    
4    RRF score = sum(1 / (k + rank_i)) for each result list
5    """
6    fused_scores = {}
7    doc_map = {}
8    
9    for results in results_list:
10        for rank, doc in enumerate(results):
11            doc_id = hash(doc["content"])
12            if doc_id not in fused_scores:
13                fused_scores[doc_id] = 0
14                doc_map[doc_id] = doc
15            fused_scores[doc_id] += 1 / (k + rank + 1)
16    
17    # Sort by fused score
18    sorted_docs = sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)
19    
20    return [
21        {**doc_map[doc_id], "rrf_score": score}
22        for doc_id, score in sorted_docs
23    ]

HybridRetriever Class

python.py

1from langchain_openai import OpenAIEmbeddings
2from langchain_chroma import Chroma
3
4class HybridRetriever:
5    def __init__(self, documents, vectorstore):
6        self.bm25 = BM25Retriever(documents)
7        self.vectorstore = vectorstore
8    
9    def search(self, query, k=5, bm25_weight=0.5, semantic_weight=0.5):
10        # Get BM25 results
11        bm25_results = self.bm25.search(query, k=k*2)
12        
13        # Get semantic results
14        semantic_docs = self.vectorstore.similarity_search_with_score(query, k=k*2)
15        semantic_results = [
16            {"content": doc.page_content, "score": float(score), "metadata": doc.metadata}
17            for doc, score in semantic_docs
18        ]
19        
20        # Merge with RRF
21        fused = reciprocal_rank_fusion([bm25_results, semantic_results])
22        
23        return fused[:k]

LangChain EnsembleRetriever

python.py

1from langchain.retrievers import EnsembleRetriever
2from langchain_community.retrievers import BM25Retriever as LCBm25
3
4# Setup retrievers
5bm25_retriever = LCBm25.from_documents(documents, k=5)
6chroma_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
7
8# Ensemble with weights
9ensemble_retriever = EnsembleRetriever(
10    retrievers=[bm25_retriever, chroma_retriever],
11    weights=[0.4, 0.6]  # Adjust based on your use case
12)
13
14results = ensemble_retriever.invoke("lương tối thiểu 2024")

Checkpoint

Bạn đã hiểu Reciprocal Rank Fusion merge kết quả từ BM25 và semantic search chưa?

Task 3

4

⚡ Reranking

TB5 min

Why Rerank?

Ví dụ

1Initial retrieval (fast, less accurate):
2  → 20 candidates from hybrid search
3 
4Reranking (slower, more accurate):
5  → Score each candidate with cross-encoder
6  → Return top 5
7 
8Cross-Encoder vs Bi-Encoder:
9  Bi-Encoder:    encode(query) · encode(doc) → fast, separate
10  Cross-Encoder: encode(query + doc) → slow, joint attention, more accurate

Cohere Reranker (API)

python.py

1# pip install cohere
2
3import cohere
4
5co = cohere.Client("your-api-key")
6
7def rerank_cohere(query, documents, top_n=5):
8    results = co.rerank(
9        query=query,
10        documents=documents,
11        top_n=top_n,
12        model="rerank-multilingual-v3.0"  # Supports Vietnamese!
13    )
14    
15    reranked = []
16    for r in results.results:
17        reranked.append({
18            "content": documents[r.index],
19            "relevance_score": r.relevance_score,
20            "index": r.index
21        })
22    return reranked

Cross-Encoder Reranker (Free)

python.py

1# pip install sentence-transformers
2
3from sentence_transformers import CrossEncoder
4
5# Multilingual cross-encoder
6model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
7
8def rerank_cross_encoder(query, documents, top_n=5):
9    # Score each (query, document) pair
10    pairs = [(query, doc) for doc in documents]
11    scores = model.predict(pairs)
12    
13    # Sort by score
14    scored_docs = list(zip(documents, scores))
15    scored_docs.sort(key=lambda x: x[1], reverse=True)
16    
17    return [
18        {"content": doc, "score": float(score)}
19        for doc, score in scored_docs[:top_n]
20    ]

LangChain + Reranker

python.py

1from langchain.retrievers import ContextualCompressionRetriever
2from langchain_cohere import CohereRerank
3
4# Setup reranker
5compressor = CohereRerank(
6    model="rerank-multilingual-v3.0",
7    top_n=5
8)
9
10# Wrap retriever with reranker
11compression_retriever = ContextualCompressionRetriever(
12    base_compressor=compressor,
13    base_retriever=ensemble_retriever
14)
15
16results = compression_retriever.invoke("lương tối thiểu theo nghị định mới")

Checkpoint

Bạn đã hiểu Cross-Encoder reranking chính xác hơn Bi-Encoder nhưng chậm hơn chưa?

Task 4

5

💻 Complete Pipeline

TB5 min

python.py

1class ProductionRetriever:
2    """Production-grade retrieval pipeline."""
3    
4    def __init__(self, vectorstore, documents, cross_encoder_model=None):
5        self.vectorstore = vectorstore
6        self.bm25 = BM25Retriever(documents)
7        self.cross_encoder = cross_encoder_model or CrossEncoder(
8            "cross-encoder/ms-marco-MiniLM-L-6-v2"
9        )
10    
11    def retrieve(self, query, top_k=5, initial_k=20):
12        """Full retrieval pipeline: Hybrid Search → Rerank → Top-K"""
13        
14        # Stage 1: BM25 search
15        bm25_results = self.bm25.search(query, k=initial_k)
16        
17        # Stage 2: Semantic search
18        semantic_results = self.vectorstore.similarity_search_with_score(
19            query, k=initial_k
20        )
21        semantic_formatted = [
22            {"content": doc.page_content, "score": score, "metadata": doc.metadata}
23            for doc, score in semantic_results
24        ]
25        
26        # Stage 3: Reciprocal Rank Fusion
27        fused = reciprocal_rank_fusion([bm25_results, semantic_formatted])
28        
29        # Stage 4: Rerank with cross-encoder
30        if len(fused) > 0:
31            contents = [doc["content"] for doc in fused]
32            pairs = [(query, content) for content in contents]
33            scores = self.cross_encoder.predict(pairs)
34            
35            for doc, score in zip(fused, scores):
36                doc["rerank_score"] = float(score)
37            
38            fused.sort(key=lambda x: x["rerank_score"], reverse=True)
39        
40        return fused[:top_k]
41
42# Usage
43retriever = ProductionRetriever(vectorstore, documents)
44results = retriever.retrieve("lương tối thiểu vùng 1 năm 2024")
45
46for i, r in enumerate(results):
47    print(f"\n--- Result {i+1} (score: {r['rerank_score']:.3f}) ---")
48    print(r["content"][:200])

Checkpoint

Bạn đã hiểu full pipeline: Hybrid Search → RRF → Rerank → Top-K chưa?

Task 5

🎯 Mục tiêu bài học

Sau bài này, bạn sẽ:

🔍 Why Hybrid Search?

How Hybrid Search Works

Hybrid Search Pipeline

Checkpoint

📐 BM25 Keyword Search

BM25 Implementation

Checkpoint

💻 Hybrid Search Implementation

Reciprocal Rank Fusion

HybridRetriever Class

LangChain EnsembleRetriever

Checkpoint

⚡ Reranking

Why Rerank?

Cohere Reranker (API)

Cross-Encoder Reranker (Free)

LangChain + Reranker

Checkpoint

💻 Complete Pipeline

Checkpoint

🎯 Tổng kết

📝 Quiz

Key Takeaways

Câu hỏi tự kiểm tra

🚀 Bài tiếp theo

Khóa học

Mentor & Hỗ trợ

Blog

Giới thiệu