🎯 Mục tiêu bài học
ChromaDB tốt cho development, nhưng production cần cloud-managed solutions. Bài này cover Pinecone, Weaviate, và cách chọn DB phù hợp.
Sau bài này, bạn sẽ:
✅ Pinecone: setup, indexing, querying ✅ Weaviate: hybrid search, schema design ✅ So sánh và chọn vector DB cho use case
🛠️ Pinecone
Setup
1# pip install pinecone-client2from pinecone import Pinecone, ServerlessSpec34pc = Pinecone(api_key="your-api-key")56# Create index7pc.create_index(8 name="rag-knowledge-base",9 dimension=1536, # OpenAI text-embedding-3-small10 metric="cosine",11 spec=ServerlessSpec(12 cloud="aws",13 region="us-east-1"14 )15)1617# Connect to index18index = pc.Index("rag-knowledge-base")19print(index.describe_index_stats())Upsert Vectors
1from openai import OpenAI23openai_client = OpenAI()45def get_embedding(text, model="text-embedding-3-small"):6 response = openai_client.embeddings.create(input=text, model=model)7 return response.data[0].embedding89# Prepare documents10documents = [11 {"id": "doc1", "text": "RAG architecture combines retrieval and generation.", "source": "rag-guide"},12 {"id": "doc2", "text": "Pinecone is a managed vector database service.", "source": "pinecone-docs"},13 {"id": "doc3", "text": "Embeddings represent text as numerical vectors.", "source": "ml-basics"},14]1516# Upsert with embeddings + metadata17vectors = []18for doc in documents:19 embedding = get_embedding(doc["text"])20 vectors.append({21 "id": doc["id"],22 "values": embedding,23 "metadata": {24 "text": doc["text"],25 "source": doc["source"]26 }27 })2829index.upsert(vectors=vectors)30print(f"Upserted {len(vectors)} vectors")Query
1# Semantic search2query = "How does RAG work?"3query_embedding = get_embedding(query)45results = index.query(6 vector=query_embedding,7 top_k=3,8 include_metadata=True9)1011for match in results['matches']:12 print(f"Score: {match['score']:.4f}")13 print(f"Text: {match['metadata']['text']}")14 print(f"Source: {match['metadata']['source']}")15 print()Namespaces
1# Namespaces = logical partitions (multi-tenant)2# Upsert to specific namespace3index.upsert(vectors=vectors, namespace="company-a")4index.upsert(vectors=vectors, namespace="company-b")56# Query within namespace7results = index.query(8 vector=query_embedding,9 top_k=5,10 namespace="company-a",11 include_metadata=True12)Checkpoint
Bạn đã biết cách setup Pinecone, upsert vectors và query với namespaces chưa?
🛠️ Weaviate
Setup
1# pip install weaviate-client2import weaviate3from weaviate.classes.config import Configure, Property, DataType45# Connect to Weaviate Cloud6client = weaviate.connect_to_weaviate_cloud(7 cluster_url="your-cluster-url",8 auth_credentials=weaviate.auth.AuthApiKey("your-api-key")9)1011# Or local Docker instance12# client = weaviate.connect_to_local()Schema & Collection
1# Create collection with vectorizer2client.collections.create(3 name="Document",4 vectorizer_config=Configure.Vectorizer.text2vec_openai(5 model="text-embedding-3-small"6 ),7 properties=[8 Property(name="content", data_type=DataType.TEXT),9 Property(name="source", data_type=DataType.TEXT),10 Property(name="category", data_type=DataType.TEXT),11 Property(name="page_number", data_type=DataType.INT),12 ]13)1415collection = client.collections.get("Document")Add & Query
1# Add objects (auto-vectorized)2collection.data.insert({3 "content": "RAG pipelines use vector search to find relevant documents.",4 "source": "rag-handbook",5 "category": "architecture",6 "page_number": 157})89# Batch insert10with collection.batch.dynamic() as batch:11 for doc in documents:12 batch.add_object(properties=doc)1314# Semantic search15response = collection.query.near_text(16 query="vector database comparison",17 limit=5,18 return_metadata=["distance"]19)2021for obj in response.objects:22 print(f"Distance: {obj.metadata.distance:.4f}")23 print(f"Content: {obj.properties['content'][:100]}")Hybrid Search (Keyword + Semantic)
1# Weaviate's killer feature: hybrid search2response = collection.query.hybrid(3 query="RAG chunking strategy",4 alpha=0.5, # 0 = pure keyword (BM25), 1 = pure vector5 limit=5,6 return_metadata=["score"]7)89for obj in response.objects:10 print(f"Score: {obj.metadata.score:.4f}")11 print(f"Content: {obj.properties['content'][:100]}")12 print()Checkpoint
Bạn đã hiểu Weaviate hybrid search và cách tạo schema/collection chưa?
📊 Vector DB Comparison
Feature Matrix
| Feature | ChromaDB | Pinecone | Weaviate | Qdrant |
|---|---|---|---|---|
| Type | Local/embedded | Cloud managed | Self-host/cloud | Self-host/cloud |
| Pricing | Free | Free tier + paid | Free (open-source) | Free (open-source) |
| Setup | pip install | API key | Docker/cloud | Docker/cloud |
| Hybrid Search | No | No | Yes (BM25 + vector) | Yes |
| Scalability | Small-medium | High | High | High |
| Best For | Prototyping | Production SaaS | Hybrid search | Performance |
Decision Matrix
| Use Case | Recommended | Why |
|---|---|---|
| Learning/prototyping | ChromaDB | Zero setup, free |
| Multi-tenant SaaS | Pinecone | Namespaces, managed |
| Keyword + semantic search | Weaviate | Native hybrid search |
| High performance on-prem | Qdrant | Rust-based, fast |
| Cost-sensitive production | Weaviate/Qdrant | Open-source, self-host |
| Vietnamese text search | Weaviate | Hybrid search handles Vietnamese well |
Embedding Model Comparison
| Model | Dims | Speed | Quality | Cost | Vietnamese |
|---|---|---|---|---|---|
| text-embedding-3-small | 1536 | Fast | Good | $ | Good |
| text-embedding-3-large | 3072 | Medium | Best | $$ | Best |
| all-MiniLM-L6-v2 | 384 | Fastest | OK | Free | Limited |
| multilingual-e5-large | 1024 | Medium | Great | Free | Great |
| paraphrase-multilingual | 384 | Fast | Good | Free | Good |
Checkpoint
Bạn đã so sánh được các cloud vector databases và biết cách chọn phù hợp chưa?
💻 LangChain Integration
ChromaDB with LangChain
1from langchain_community.vectorstores import Chroma2from langchain_openai import OpenAIEmbeddings34embeddings = OpenAIEmbeddings(model="text-embedding-3-small")56# Create vector store7vectorstore = Chroma.from_texts(8 texts=["doc1 content", "doc2 content", "doc3 content"],9 embedding=embeddings,10 metadatas=[{"source": "a"}, {"source": "b"}, {"source": "c"}],11 persist_directory="./chroma_langchain"12)1314# Search15docs = vectorstore.similarity_search("query text", k=3)16for doc in docs:17 print(doc.page_content[:100])Pinecone with LangChain
1from langchain_pinecone import PineconeVectorStore23vectorstore = PineconeVectorStore(4 index_name="rag-knowledge-base",5 embedding=embeddings,6 pinecone_api_key="your-key"7)89# Add documents10vectorstore.add_texts(11 texts=["content1", "content2"],12 metadatas=[{"source": "a"}, {"source": "b"}]13)1415# Search16docs = vectorstore.similarity_search("query", k=5)As Retriever (for RAG chain)
1# Convert to retriever2retriever = vectorstore.as_retriever(3 search_type="similarity",4 search_kwargs={"k": 5}5)67# Or MMR (Maximal Marginal Relevance) for diversity8retriever = vectorstore.as_retriever(9 search_type="mmr",10 search_kwargs={"k": 5, "fetch_k": 20, "lambda_mult": 0.7}11)1213# Use in RAG chain14relevant_docs = retriever.invoke("How to implement RAG?")Checkpoint
Bạn đã biết cách tích hợp vector databases với LangChain và sử dụng MMR retrieval chưa?
🎯 Tổng kết
📝 Quiz
-
Pinecone namespaces dùng cho?
- Tăng tốc query
- Logical partitioning (multi-tenant, separate data)
- Backup data
- Thay thế metadata
-
Weaviate hybrid search kết hợp?
- Hai vector models
- BM25 keyword search + vector semantic search
- Semantic + graph search
- Không thực sự "hybrid"
-
Khi nào chọn self-hosted vector DB?
- Cost-sensitive, cần control data, on-premise requirement
- Khi mới bắt đầu học
- Luôn luôn tốt hơn cloud
- Khi data ít
Key Takeaways
- Pinecone — Easiest managed cloud, tốt cho SaaS
- Weaviate — Best hybrid search, open-source
- Choose based on: scale, cost, search type, hosting preference
- LangChain — Unified API cho tất cả vector stores
- MMR retrieval — Cân bằng relevance và diversity
Câu hỏi tự kiểm tra
- Pinecone namespaces dùng để làm gì và khi nào nên sử dụng chúng?
- Weaviate hybrid search kết hợp BM25 và vector search như thế nào?
- Khi nào nên chọn self-hosted vector database thay vì cloud managed service?
- MMR (Maximal Marginal Relevance) retrieval hoạt động ra sao và tại sao nó cần thiết?
🎉 Tuyệt vời! Bạn đã hoàn thành bài học Cloud Vector Databases!
Tiếp theo: Hãy cùng tìm hiểu về Document Loaders & Formats trong bài tiếp theo!
🚀 Bài tiếp theo
Document Loaders & Formats — Load PDF, Word, Web, và xử lý multiple document formats!
