🎯 Mục tiêu bài học
Semantic search tốt nhưng không hoàn hảo. Kết hợp với keyword search (BM25) và reranking tạo ra retrieval pipeline mạnh mẽ hơn.
Sau bài này, bạn sẽ:
✅ BM25 keyword search ✅ Hybrid search (BM25 + semantic) ✅ Reranking với Cross-Encoder ✅ Production retrieval pipeline
🔍 Why Hybrid Search?
1Semantic Search alone:2 ✅ Understands meaning ("car" ≈ "automobile")3 ❌ May miss exact terms ("Nghị định 38/2022/NĐ-CP")4 ❌ Poor with proper nouns, codes, numbers5 6Keyword Search alone:7 ✅ Exact matching ("Nghị định 38/2022/NĐ-CP")8 ❌ Misses synonyms ("xe hơi" ≠ "ô tô")9 10Hybrid = Best of both worlds!How Hybrid Search Works
Hybrid Search Pipeline
Checkpoint
Bạn đã hiểu tại sao hybrid search kết hợp keyword + semantic tốt hơn dùng riêng lẻ chưa?
📐 BM25 Keyword Search
BM25 Implementation
1# pip install rank-bm2523from rank_bm25 import BM25Okapi4import numpy as np5from typing import List6from underthesea import word_tokenize # Vietnamese tokenizer78class BM25Retriever:9 def __init__(self, documents: List[str]):10 # Tokenize documents (important for Vietnamese!)11 self.documents = documents12 self.tokenized_docs = [13 word_tokenize(doc, format="text").split()14 for doc in documents15 ]16 self.bm25 = BM25Okapi(self.tokenized_docs)17 18 def search(self, query: str, k: int = 5) -> List[dict]:19 tokenized_query = word_tokenize(query, format="text").split()20 scores = self.bm25.get_scores(tokenized_query)21 22 # Get top-k indices23 top_indices = np.argsort(scores)[-k:][::-1]24 25 results = []26 for idx in top_indices:27 if scores[idx] > 0:28 results.append({29 "content": self.documents[idx],30 "score": float(scores[idx]),31 "index": int(idx)32 })33 return results3435# Usage36documents = [37 "Nghị định 38/2022/NĐ-CP quy định mức lương tối thiểu",38 "Luật Lao động 2019 về thời gian làm việc",39 "Mức lương tối thiểu vùng 1 là 4.680.000 đồng/tháng"40]4142bm25 = BM25Retriever(documents)43results = bm25.search("lương tối thiểu Nghị định 38")44for r in results:45 print(f"Score: {r['score']:.2f} | {r['content'][:60]}")Checkpoint
Bạn đã hiểu BM25 keyword search hoạt động dựa trên term frequency chưa?
💻 Hybrid Search Implementation
Reciprocal Rank Fusion
1def reciprocal_rank_fusion(results_list, k=60):2 """Merge multiple result lists using RRF.3 4 RRF score = sum(1 / (k + rank_i)) for each result list5 """6 fused_scores = {}7 doc_map = {}8 9 for results in results_list:10 for rank, doc in enumerate(results):11 doc_id = hash(doc["content"])12 if doc_id not in fused_scores:13 fused_scores[doc_id] = 014 doc_map[doc_id] = doc15 fused_scores[doc_id] += 1 / (k + rank + 1)16 17 # Sort by fused score18 sorted_docs = sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)19 20 return [21 {**doc_map[doc_id], "rrf_score": score}22 for doc_id, score in sorted_docs23 ]HybridRetriever Class
1from langchain_openai import OpenAIEmbeddings2from langchain_chroma import Chroma34class HybridRetriever:5 def __init__(self, documents, vectorstore):6 self.bm25 = BM25Retriever(documents)7 self.vectorstore = vectorstore8 9 def search(self, query, k=5, bm25_weight=0.5, semantic_weight=0.5):10 # Get BM25 results11 bm25_results = self.bm25.search(query, k=k*2)12 13 # Get semantic results14 semantic_docs = self.vectorstore.similarity_search_with_score(query, k=k*2)15 semantic_results = [16 {"content": doc.page_content, "score": float(score), "metadata": doc.metadata}17 for doc, score in semantic_docs18 ]19 20 # Merge with RRF21 fused = reciprocal_rank_fusion([bm25_results, semantic_results])22 23 return fused[:k]LangChain EnsembleRetriever
1from langchain.retrievers import EnsembleRetriever2from langchain_community.retrievers import BM25Retriever as LCBm2534# Setup retrievers5bm25_retriever = LCBm25.from_documents(documents, k=5)6chroma_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})78# Ensemble with weights9ensemble_retriever = EnsembleRetriever(10 retrievers=[bm25_retriever, chroma_retriever],11 weights=[0.4, 0.6] # Adjust based on your use case12)1314results = ensemble_retriever.invoke("lương tối thiểu 2024")Checkpoint
Bạn đã hiểu Reciprocal Rank Fusion merge kết quả từ BM25 và semantic search chưa?
⚡ Reranking
Why Rerank?
1Initial retrieval (fast, less accurate):2 → 20 candidates from hybrid search3 4Reranking (slower, more accurate):5 → Score each candidate with cross-encoder6 → Return top 57 8Cross-Encoder vs Bi-Encoder:9 Bi-Encoder: encode(query) · encode(doc) → fast, separate10 Cross-Encoder: encode(query + doc) → slow, joint attention, more accurateCohere Reranker (API)
1# pip install cohere23import cohere45co = cohere.Client("your-api-key")67def rerank_cohere(query, documents, top_n=5):8 results = co.rerank(9 query=query,10 documents=documents,11 top_n=top_n,12 model="rerank-multilingual-v3.0" # Supports Vietnamese!13 )14 15 reranked = []16 for r in results.results:17 reranked.append({18 "content": documents[r.index],19 "relevance_score": r.relevance_score,20 "index": r.index21 })22 return rerankedCross-Encoder Reranker (Free)
1# pip install sentence-transformers23from sentence_transformers import CrossEncoder45# Multilingual cross-encoder6model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")78def rerank_cross_encoder(query, documents, top_n=5):9 # Score each (query, document) pair10 pairs = [(query, doc) for doc in documents]11 scores = model.predict(pairs)12 13 # Sort by score14 scored_docs = list(zip(documents, scores))15 scored_docs.sort(key=lambda x: x[1], reverse=True)16 17 return [18 {"content": doc, "score": float(score)}19 for doc, score in scored_docs[:top_n]20 ]LangChain + Reranker
1from langchain.retrievers import ContextualCompressionRetriever2from langchain_cohere import CohereRerank34# Setup reranker5compressor = CohereRerank(6 model="rerank-multilingual-v3.0",7 top_n=58)910# Wrap retriever with reranker11compression_retriever = ContextualCompressionRetriever(12 base_compressor=compressor,13 base_retriever=ensemble_retriever14)1516results = compression_retriever.invoke("lương tối thiểu theo nghị định mới")Checkpoint
Bạn đã hiểu Cross-Encoder reranking chính xác hơn Bi-Encoder nhưng chậm hơn chưa?
💻 Complete Pipeline
1class ProductionRetriever:2 """Production-grade retrieval pipeline."""3 4 def __init__(self, vectorstore, documents, cross_encoder_model=None):5 self.vectorstore = vectorstore6 self.bm25 = BM25Retriever(documents)7 self.cross_encoder = cross_encoder_model or CrossEncoder(8 "cross-encoder/ms-marco-MiniLM-L-6-v2"9 )10 11 def retrieve(self, query, top_k=5, initial_k=20):12 """Full retrieval pipeline: Hybrid Search → Rerank → Top-K"""13 14 # Stage 1: BM25 search15 bm25_results = self.bm25.search(query, k=initial_k)16 17 # Stage 2: Semantic search18 semantic_results = self.vectorstore.similarity_search_with_score(19 query, k=initial_k20 )21 semantic_formatted = [22 {"content": doc.page_content, "score": score, "metadata": doc.metadata}23 for doc, score in semantic_results24 ]25 26 # Stage 3: Reciprocal Rank Fusion27 fused = reciprocal_rank_fusion([bm25_results, semantic_formatted])28 29 # Stage 4: Rerank with cross-encoder30 if len(fused) > 0:31 contents = [doc["content"] for doc in fused]32 pairs = [(query, content) for content in contents]33 scores = self.cross_encoder.predict(pairs)34 35 for doc, score in zip(fused, scores):36 doc["rerank_score"] = float(score)37 38 fused.sort(key=lambda x: x["rerank_score"], reverse=True)39 40 return fused[:top_k]4142# Usage43retriever = ProductionRetriever(vectorstore, documents)44results = retriever.retrieve("lương tối thiểu vùng 1 năm 2024")4546for i, r in enumerate(results):47 print(f"\n--- Result {i+1} (score: {r['rerank_score']:.3f}) ---")48 print(r["content"][:200])Checkpoint
Bạn đã hiểu full pipeline: Hybrid Search → RRF → Rerank → Top-K chưa?
🎯 Tổng kết
📝 Quiz
-
Hybrid search kết hợp gì?
- Hai LLM models
- Keyword search (BM25) + Semantic search
- Hai databases
- Query + Document
-
Reciprocal Rank Fusion dùng để?
- Merge kết quả từ nhiều retrievers dựa trên ranking position
- Tạo query mới
- Train model mới
- Xóa duplicate documents
-
Cross-Encoder reranker hoạt động thế nào?
- Encode (query, document) pair cùng nhau qua attention → relevance score
- Encode query và document riêng biệt
- Chỉ count keyword matches
- Random ranking
Key Takeaways
- Hybrid Search — BM25 + Semantic covers cả exact match lẫn meaning
- RRF — Simple yet effective fusion algorithm
- Reranking — Cross-Encoder improves precision significantly
- Pipeline — Retrieve many → Fuse → Rerank → Return top-k
Câu hỏi tự kiểm tra
- Tại sao BM25 keyword search vẫn quan trọng khi đã có semantic search?
- Reciprocal Rank Fusion hoạt động như thế nào để merge kết quả?
- Cross-Encoder reranker khác Bi-Encoder ở điểm nào và trade-off là gì?
- Mô tả full production retrieval pipeline từ query đến kết quả cuối cùng.
🎉 Tuyệt vời! Bạn đã hoàn thành bài học Hybrid Search & Reranking!
Tiếp theo: Hãy cùng tìm hiểu về RAG Evaluation để đo lường chất lượng!
🚀 Bài tiếp theo
RAG Evaluation — RAGAS framework, custom metrics, và monitoring pipeline!
