Bài 2: Vector Databases Deep Dive
1. Understanding Vector Databases
1.1 Architecture Overview
Text
1┌─────────────────────────────────────────────────────────────┐2│ VECTOR DATABASE ARCHITECTURE │3├─────────────────────────────────────────────────────────────┤4│ │5│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │6│ │ INGESTION │ → │ INDEXING │ → │ STORAGE │ │7│ │ │ │ │ │ │ │8│ │ Documents │ │ HNSW/IVF/ │ │ Vectors + │ │9│ │ → Vectors │ │ PQ Indexes │ │ Metadata │ │10│ └─────────────┘ └─────────────┘ └─────────────┘ │11│ │ │12│ ▼ │13│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │14│ │ RESULTS │ ← │ RANKING │ ← │ SEARCH │ │15│ │ │ │ │ │ │ │16│ │ Top K docs │ │ Re-ranking/ │ │ ANN Search │ │17│ │ + scores │ │ Filtering │ │ + Filters │ │18│ └─────────────┘ └─────────────┘ └─────────────┘ │19│ │20└─────────────────────────────────────────────────────────────┘1.2 Key Concepts
ANN (Approximate Nearest Neighbor):
- Exact search: O(n) - check every vector
- ANN: O(log n) - use smart indexing
- Trade accuracy for speed (99%+ recall typical)
Index Types:
| Index | Speed | Memory | Accuracy | Best For |
|---|---|---|---|---|
| Flat | Slow | Low | 100% | Small datasets (under 10K) |
| IVF | Medium | Medium | ~95% | Medium datasets |
| HNSW | Fast | High | ~99% | Production systems |
| PQ | Fast | Very Low | ~90% | Memory constrained |
2. ChromaDB
2.1 Overview
Best for: Prototyping, local development, small projects
Text
1Pros:2✅ Zero config - just pip install3✅ Runs locally, no cloud needed4✅ Python-native API5✅ Free and open source6✅ Persistent storage option7 8Cons:9❌ Not designed for scale10❌ Limited filtering capabilities11❌ Single-node only12❌ No managed cloud option2.2 Quick Start
Python
1import chromadb2from chromadb.config import Settings34# In-memory (development)5client = chromadb.Client()67# Persistent storage8client = chromadb.PersistentClient(path="./chroma_db")910# Create collection11collection = client.create_collection(12 name="documents",13 metadata={"hnsw:space": "cosine"} # or "l2", "ip"14)1516# Add documents17collection.add(18 ids=["doc1", "doc2", "doc3"],19 documents=[20 "Machine learning is a subset of AI",21 "Deep learning uses neural networks",22 "Python is popular for data science"23 ],24 metadatas=[25 {"category": "AI", "year": 2024},26 {"category": "AI", "year": 2024},27 {"category": "Programming", "year": 2023}28 ]29)3031# Query with filters32results = collection.query(33 query_texts=["What is AI?"],34 n_results=2,35 where={"category": "AI"}36)3738print(results["documents"])39print(results["distances"])2.3 Using with External Embeddings
Python
1from openai import OpenAI23openai_client = OpenAI()45def get_embeddings(texts):6 response = openai_client.embeddings.create(7 model="text-embedding-3-small",8 input=texts9 )10 return [item.embedding for item in response.data]1112# Create collection without auto-embedding13collection = client.create_collection("custom_embeddings")1415# Add with your own embeddings16texts = ["Document 1", "Document 2"]17embeddings = get_embeddings(texts)1819collection.add(20 ids=["id1", "id2"],21 embeddings=embeddings,22 documents=texts23)2425# Query with your embedding26query_embedding = get_embeddings(["My query"])[0]27results = collection.query(28 query_embeddings=[query_embedding],29 n_results=530)3. Pinecone
3.1 Overview
Best for: Production RAG, enterprise scale, serverless deployment
Text
1Pros:2✅ Fully managed, serverless3✅ Scales to billions of vectors4✅ Real-time updates5✅ Hybrid search (vector + keyword)6✅ Excellent documentation7✅ Free tier available8 9Cons:10❌ Vendor lock-in11❌ Can get expensive at scale12❌ Data leaves your infrastructure13❌ Limited self-hosted option3.2 Quick Start
Python
1from pinecone import Pinecone, ServerlessSpec23# Initialize4pc = Pinecone(api_key="your-api-key")56# Create index7pc.create_index(8 name="my-index",9 dimension=1536, # Match your embedding model10 metric="cosine",11 spec=ServerlessSpec(12 cloud="aws",13 region="us-east-1"14 )15)1617# Connect to index18index = pc.Index("my-index")1920# Upsert vectors21index.upsert(22 vectors=[23 {24 "id": "doc1",25 "values": embedding1, # 1536-dim vector26 "metadata": {"category": "AI", "source": "blog"}27 },28 {29 "id": "doc2",30 "values": embedding2,31 "metadata": {"category": "ML", "source": "paper"}32 }33 ],34 namespace="documents" # Optional logical partition35)3637# Query38results = index.query(39 vector=query_embedding,40 top_k=5,41 include_metadata=True,42 filter={"category": {"$eq": "AI"}}43)4445for match in results.matches:46 print(f"Score: {match.score}, ID: {match.id}")3.3 Hybrid Search
Python
1# Pinecone with sparse vectors for keyword matching2from pinecone_text.sparse import BM25Encoder34# Initialize BM25 for keyword search5bm25 = BM25Encoder()6bm25.fit(corpus) # Fit on your documents78# Create hybrid vectors9def create_hybrid_vector(text, dense_embedding):10 sparse = bm25.encode_documents([text])[0]11 return {12 "values": dense_embedding,13 "sparse_values": sparse14 }1516# Query with hybrid search17results = index.query(18 vector=dense_query_embedding,19 sparse_vector=bm25.encode_queries([query])[0],20 top_k=10,21 alpha=0.5 # Balance between dense and sparse22)4. Weaviate
4.1 Overview
Best for: Multimodal search, complex filtering, self-hosted production
Text
1Pros:2✅ Multi-modal (text, images, etc.)3✅ GraphQL API4✅ Built-in vectorizers5✅ Hybrid search out-of-box6✅ Self-hosted or cloud7✅ Strong filtering8 9Cons:10❌ More complex setup11❌ Steeper learning curve12❌ Resource intensive13❌ GraphQL can be verbose4.2 Quick Start
Python
1import weaviate2from weaviate.classes.config import Configure, Property, DataType3from weaviate.classes.query import MetadataQuery45# Connect to Weaviate Cloud6client = weaviate.connect_to_weaviate_cloud(7 cluster_url="your-cluster-url",8 auth_credentials=weaviate.auth.AuthApiKey("your-api-key")9)1011# Create schema12client.collections.create(13 name="Document",14 vectorizer_config=Configure.Vectorizer.text2vec_openai(),15 properties=[16 Property(name="content", data_type=DataType.TEXT),17 Property(name="category", data_type=DataType.TEXT),18 Property(name="year", data_type=DataType.INT)19 ]20)2122# Get collection23documents = client.collections.get("Document")2425# Insert data26documents.data.insert_many([27 {"content": "AI is transforming industries", "category": "AI", "year": 2024},28 {"content": "Machine learning models improve predictions", "category": "ML", "year": 2024}29])3031# Vector search32response = documents.query.near_text(33 query="artificial intelligence applications",34 limit=5,35 return_metadata=MetadataQuery(distance=True)36)3738for obj in response.objects:39 print(obj.properties["content"])40 print(f"Distance: {obj.metadata.distance}")4.3 Hybrid Search in Weaviate
Python
1from weaviate.classes.query import HybridFusion23response = documents.query.hybrid(4 query="machine learning applications",5 alpha=0.5, # 0=keyword only, 1=vector only6 fusion_type=HybridFusion.RELATIVE_SCORE,7 limit=5,8 filters=weaviate.classes.query.Filter.by_property("year").greater_than(2023)9)5. Qdrant
5.1 Overview
Best for: High performance, advanced filtering, Rust-powered speed
Text
1Pros:2✅ Extremely fast (written in Rust)3✅ Advanced payload filtering4✅ Quantization for memory efficiency5✅ Self-hosted or cloud6✅ Active development7 8Cons:9❌ Newer, smaller community10❌ Less documentation11❌ Fewer integrations5.2 Quick Start
Python
1from qdrant_client import QdrantClient2from qdrant_client.models import Distance, VectorParams, PointStruct34# Connect (local or cloud)5client = QdrantClient(path="./qdrant_db") # Local6# client = QdrantClient(url="https://xxx.qdrant.io", api_key="...")78# Create collection9client.create_collection(10 collection_name="documents",11 vectors_config=VectorParams(size=1536, distance=Distance.COSINE)12)1314# Upsert points15client.upsert(16 collection_name="documents",17 points=[18 PointStruct(19 id=1,20 vector=embedding1,21 payload={"content": "AI document", "category": "AI", "year": 2024}22 ),23 PointStruct(24 id=2,25 vector=embedding2,26 payload={"content": "ML document", "category": "ML", "year": 2024}27 )28 ]29)3031# Search with filters32from qdrant_client.models import Filter, FieldCondition, MatchValue3334results = client.search(35 collection_name="documents",36 query_vector=query_embedding,37 limit=5,38 query_filter=Filter(39 must=[40 FieldCondition(key="category", match=MatchValue(value="AI"))41 ]42 )43)4445for result in results:46 print(f"Score: {result.score}, Content: {result.payload['content']}")5.3 Batch Operations
Python
1from qdrant_client.models import Batch23# Efficient batch upsert4client.upsert(5 collection_name="documents",6 points=Batch(7 ids=list(range(1000)),8 vectors=[embedding for _ in range(1000)],9 payloads=[{"content": f"Doc {i}"} for i in range(1000)]10 )11)6. pgvector (PostgreSQL)
6.1 Overview
Best for: Existing PostgreSQL users, transactional data + vectors
Text
1Pros:2✅ Use existing PostgreSQL skills3✅ ACID transactions4✅ Combine with relational data5✅ No new infrastructure6✅ SQL joins with vector search7 8Cons:9❌ Not optimized for vector-only workloads10❌ Scaling more complex11❌ Fewer vector-specific features12❌ Performance ceiling6.2 Quick Start
SQL
1-- Enable extension2CREATE EXTENSION vector;34-- Create table with vector column5CREATE TABLE documents (6 id SERIAL PRIMARY KEY,7 content TEXT,8 category VARCHAR(50),9 embedding vector(1536)10);1112-- Create index (HNSW recommended)13CREATE INDEX ON documents 14USING hnsw (embedding vector_cosine_ops);1516-- Insert data17INSERT INTO documents (content, category, embedding)18VALUES ('AI document', 'AI', '[0.1, 0.2, ...]'::vector);1920-- Vector search21SELECT content, category, 22 embedding <=> '[query_vector]'::vector AS distance23FROM documents24WHERE category = 'AI'25ORDER BY embedding <=> '[query_vector]'::vector26LIMIT 5;6.3 Python Integration
Python
1import psycopg22from pgvector.psycopg2 import register_vector34conn = psycopg2.connect(...)5register_vector(conn)67cursor = conn.cursor()89# Insert10cursor.execute(11 "INSERT INTO documents (content, embedding) VALUES (%s, %s)",12 ("My document", embedding)13)1415# Search16cursor.execute("""17 SELECT content, embedding <=> %s AS distance18 FROM documents19 ORDER BY embedding <=> %s20 LIMIT 521""", (query_embedding, query_embedding))2223results = cursor.fetchall()7. Comparison Matrix
7.1 Feature Comparison
| Feature | Chroma | Pinecone | Weaviate | Qdrant | pgvector |
|---|---|---|---|---|---|
| Deployment | Local | Cloud | Both | Both | Self-host |
| Max Vectors | 1M | Billions | Billions | Billions | 10M+ |
| Hybrid Search | ❌ | ✅ | ✅ | ✅ | Limited |
| Multi-modal | ❌ | ❌ | ✅ | ❌ | ❌ |
| Free Tier | ✅ | ✅ | ✅ | ✅ | ✅ |
| Filtering | Basic | Advanced | Advanced | Advanced | SQL |
| Ease of Use | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐ |
7.2 Performance Comparison (Approximate)
| Database | QPS (1M vectors) | Latency p99 | Memory/1M |
|---|---|---|---|
| Chroma | ~100 | ~50ms | ~2GB |
| Pinecone | ~1000 | ~10ms | Managed |
| Weaviate | ~500 | ~20ms | ~4GB |
| Qdrant | ~1500 | ~5ms | ~1.5GB |
| pgvector | ~200 | ~30ms | ~3GB |
7.3 Cost Comparison (Approximate)
| Database | Free Tier | Production (1M vectors) |
|---|---|---|
| Chroma | Unlimited (local) | Self-host costs |
| Pinecone | 100K vectors | ~$70/month |
| Weaviate | 1M vectors | ~$50/month (cloud) |
| Qdrant | 1M vectors | ~$40/month (cloud) |
| pgvector | N/A | Postgres hosting cost |
8. Decision Framework
8.1 Flowchart
Text
1START2 │3 ▼4Prototyping/Learning?5 │6 ├── YES → ChromaDB7 │8 NO9 │10 ▼11Need managed cloud?12 │13 ├── YES → Pinecone or Weaviate Cloud14 │15 NO16 │17 ▼18Existing PostgreSQL?19 │20 ├── YES → pgvector21 │22 NO23 │24 ▼25Maximum performance needed?26 │27 ├── YES → Qdrant28 │29 NO30 │31 ▼32Multi-modal data?33 │34 ├── YES → Weaviate35 │36 NO37 │38 ▼39Default → Qdrant or Pinecone8.2 Use Case Recommendations
| Use Case | Recommended | Why |
|---|---|---|
| Learning RAG | Chroma | Zero config, easy |
| Production RAG | Pinecone | Managed, scalable |
| Self-hosted prod | Qdrant | Performance, features |
| Image + Text search | Weaviate | Multi-modal |
| Existing Postgres app | pgvector | No new infra |
| Cost sensitive | Qdrant/Chroma | Self-host |
9. Hands-on: Multi-DB Comparison
Setup
Python
1# Install all clients2# pip install chromadb pinecone-client weaviate-client qdrant-client34from time import time5import numpy as np67# Generate test data8num_vectors = 100009dim = 153610vectors = np.random.rand(num_vectors, dim).astype(np.float32)11query = np.random.rand(dim).astype(np.float32)Benchmark Function
Python
1def benchmark_db(name, insert_fn, search_fn, n_searches=100):2 # Measure insert time3 start = time()4 insert_fn()5 insert_time = time() - start6 7 # Measure search time8 start = time()9 for _ in range(n_searches):10 search_fn()11 search_time = (time() - start) / n_searches12 13 print(f"{name}:")14 print(f" Insert {num_vectors} vectors: {insert_time:.2f}s")15 print(f" Average search latency: {search_time*1000:.2f}ms")Compare Results
Python
1# Run benchmarks for each database2# (Implement insert_fn and search_fn for each)34# Expected output:5# Chroma:6# Insert 10000 vectors: 5.23s7# Average search latency: 15.42ms8# 9# Qdrant:10# Insert 10000 vectors: 2.15s11# Average search latency: 3.21ms10. Bài tập về nhà
Bài 1: Local Comparison
- Install Chroma và Qdrant locally
- Insert 1000 vectors vào mỗi database
- Compare search latency
- So sánh API ease-of-use
Bài 2: Cloud Setup
- Sign up Pinecone free tier
- Create an index
- Build simple RAG với Pinecone
- Test with 100 queries
Bài 3: Decision Making
Cho scenario sau, chọn database phù hợp:
- Startup MVP, budget limited
- Enterprise with 10M documents
- Research project cần multimodal
- E-commerce với existing PostgreSQL
Summary
Trong bài này bạn đã học:
- ✅ Vector database architecture và index types
- ✅ ChromaDB cho prototyping
- ✅ Pinecone cho production scale
- ✅ Weaviate cho multi-modal và flexibility
- ✅ Qdrant cho maximum performance
- ✅ pgvector cho PostgreSQL users
- ✅ Feature, performance, cost comparison
- ✅ Decision framework để chọn đúng database
Next: Bài 3 - Chunking Strategies - Optimal ways to split documents
