Lý thuyết
Bài 2/5

Vector Databases Deep Dive

So sánh Pinecone, Weaviate, Chroma, Qdrant và cách chọn đúng database

Bài 2: Vector Databases Deep Dive

1. Understanding Vector Databases

1.1 Architecture Overview

Text
1┌─────────────────────────────────────────────────────────────┐
2│ VECTOR DATABASE ARCHITECTURE │
3├─────────────────────────────────────────────────────────────┤
4│ │
5│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
6│ │ INGESTION │ → │ INDEXING │ → │ STORAGE │ │
7│ │ │ │ │ │ │ │
8│ │ Documents │ │ HNSW/IVF/ │ │ Vectors + │ │
9│ │ → Vectors │ │ PQ Indexes │ │ Metadata │ │
10│ └─────────────┘ └─────────────┘ └─────────────┘ │
11│ │ │
12│ ▼ │
13│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
14│ │ RESULTS │ ← │ RANKING │ ← │ SEARCH │ │
15│ │ │ │ │ │ │ │
16│ │ Top K docs │ │ Re-ranking/ │ │ ANN Search │ │
17│ │ + scores │ │ Filtering │ │ + Filters │ │
18│ └─────────────┘ └─────────────┘ └─────────────┘ │
19│ │
20└─────────────────────────────────────────────────────────────┘

1.2 Key Concepts

ANN (Approximate Nearest Neighbor):

  • Exact search: O(n) - check every vector
  • ANN: O(log n) - use smart indexing
  • Trade accuracy for speed (99%+ recall typical)

Index Types:

IndexSpeedMemoryAccuracyBest For
FlatSlowLow100%Small datasets (under 10K)
IVFMediumMedium~95%Medium datasets
HNSWFastHigh~99%Production systems
PQFastVery Low~90%Memory constrained

2. ChromaDB

2.1 Overview

Best for: Prototyping, local development, small projects

Text
1Pros:
2✅ Zero config - just pip install
3✅ Runs locally, no cloud needed
4✅ Python-native API
5✅ Free and open source
6✅ Persistent storage option
7
8Cons:
9❌ Not designed for scale
10❌ Limited filtering capabilities
11❌ Single-node only
12❌ No managed cloud option

2.2 Quick Start

Python
1import chromadb
2from chromadb.config import Settings
3
4# In-memory (development)
5client = chromadb.Client()
6
7# Persistent storage
8client = chromadb.PersistentClient(path="./chroma_db")
9
10# Create collection
11collection = client.create_collection(
12 name="documents",
13 metadata={"hnsw:space": "cosine"} # or "l2", "ip"
14)
15
16# Add documents
17collection.add(
18 ids=["doc1", "doc2", "doc3"],
19 documents=[
20 "Machine learning is a subset of AI",
21 "Deep learning uses neural networks",
22 "Python is popular for data science"
23 ],
24 metadatas=[
25 {"category": "AI", "year": 2024},
26 {"category": "AI", "year": 2024},
27 {"category": "Programming", "year": 2023}
28 ]
29)
30
31# Query with filters
32results = collection.query(
33 query_texts=["What is AI?"],
34 n_results=2,
35 where={"category": "AI"}
36)
37
38print(results["documents"])
39print(results["distances"])

2.3 Using with External Embeddings

Python
1from openai import OpenAI
2
3openai_client = OpenAI()
4
5def get_embeddings(texts):
6 response = openai_client.embeddings.create(
7 model="text-embedding-3-small",
8 input=texts
9 )
10 return [item.embedding for item in response.data]
11
12# Create collection without auto-embedding
13collection = client.create_collection("custom_embeddings")
14
15# Add with your own embeddings
16texts = ["Document 1", "Document 2"]
17embeddings = get_embeddings(texts)
18
19collection.add(
20 ids=["id1", "id2"],
21 embeddings=embeddings,
22 documents=texts
23)
24
25# Query with your embedding
26query_embedding = get_embeddings(["My query"])[0]
27results = collection.query(
28 query_embeddings=[query_embedding],
29 n_results=5
30)

3. Pinecone

3.1 Overview

Best for: Production RAG, enterprise scale, serverless deployment

Text
1Pros:
2✅ Fully managed, serverless
3✅ Scales to billions of vectors
4✅ Real-time updates
5✅ Hybrid search (vector + keyword)
6✅ Excellent documentation
7✅ Free tier available
8
9Cons:
10❌ Vendor lock-in
11❌ Can get expensive at scale
12❌ Data leaves your infrastructure
13❌ Limited self-hosted option

3.2 Quick Start

Python
1from pinecone import Pinecone, ServerlessSpec
2
3# Initialize
4pc = Pinecone(api_key="your-api-key")
5
6# Create index
7pc.create_index(
8 name="my-index",
9 dimension=1536, # Match your embedding model
10 metric="cosine",
11 spec=ServerlessSpec(
12 cloud="aws",
13 region="us-east-1"
14 )
15)
16
17# Connect to index
18index = pc.Index("my-index")
19
20# Upsert vectors
21index.upsert(
22 vectors=[
23 {
24 "id": "doc1",
25 "values": embedding1, # 1536-dim vector
26 "metadata": {"category": "AI", "source": "blog"}
27 },
28 {
29 "id": "doc2",
30 "values": embedding2,
31 "metadata": {"category": "ML", "source": "paper"}
32 }
33 ],
34 namespace="documents" # Optional logical partition
35)
36
37# Query
38results = index.query(
39 vector=query_embedding,
40 top_k=5,
41 include_metadata=True,
42 filter={"category": {"$eq": "AI"}}
43)
44
45for match in results.matches:
46 print(f"Score: {match.score}, ID: {match.id}")

3.3 Hybrid Search

Python
1# Pinecone with sparse vectors for keyword matching
2from pinecone_text.sparse import BM25Encoder
3
4# Initialize BM25 for keyword search
5bm25 = BM25Encoder()
6bm25.fit(corpus) # Fit on your documents
7
8# Create hybrid vectors
9def create_hybrid_vector(text, dense_embedding):
10 sparse = bm25.encode_documents([text])[0]
11 return {
12 "values": dense_embedding,
13 "sparse_values": sparse
14 }
15
16# Query with hybrid search
17results = index.query(
18 vector=dense_query_embedding,
19 sparse_vector=bm25.encode_queries([query])[0],
20 top_k=10,
21 alpha=0.5 # Balance between dense and sparse
22)

4. Weaviate

4.1 Overview

Best for: Multimodal search, complex filtering, self-hosted production

Text
1Pros:
2✅ Multi-modal (text, images, etc.)
3✅ GraphQL API
4✅ Built-in vectorizers
5✅ Hybrid search out-of-box
6✅ Self-hosted or cloud
7✅ Strong filtering
8
9Cons:
10❌ More complex setup
11❌ Steeper learning curve
12❌ Resource intensive
13❌ GraphQL can be verbose

4.2 Quick Start

Python
1import weaviate
2from weaviate.classes.config import Configure, Property, DataType
3from weaviate.classes.query import MetadataQuery
4
5# Connect to Weaviate Cloud
6client = weaviate.connect_to_weaviate_cloud(
7 cluster_url="your-cluster-url",
8 auth_credentials=weaviate.auth.AuthApiKey("your-api-key")
9)
10
11# Create schema
12client.collections.create(
13 name="Document",
14 vectorizer_config=Configure.Vectorizer.text2vec_openai(),
15 properties=[
16 Property(name="content", data_type=DataType.TEXT),
17 Property(name="category", data_type=DataType.TEXT),
18 Property(name="year", data_type=DataType.INT)
19 ]
20)
21
22# Get collection
23documents = client.collections.get("Document")
24
25# Insert data
26documents.data.insert_many([
27 {"content": "AI is transforming industries", "category": "AI", "year": 2024},
28 {"content": "Machine learning models improve predictions", "category": "ML", "year": 2024}
29])
30
31# Vector search
32response = documents.query.near_text(
33 query="artificial intelligence applications",
34 limit=5,
35 return_metadata=MetadataQuery(distance=True)
36)
37
38for obj in response.objects:
39 print(obj.properties["content"])
40 print(f"Distance: {obj.metadata.distance}")

4.3 Hybrid Search in Weaviate

Python
1from weaviate.classes.query import HybridFusion
2
3response = documents.query.hybrid(
4 query="machine learning applications",
5 alpha=0.5, # 0=keyword only, 1=vector only
6 fusion_type=HybridFusion.RELATIVE_SCORE,
7 limit=5,
8 filters=weaviate.classes.query.Filter.by_property("year").greater_than(2023)
9)

5. Qdrant

5.1 Overview

Best for: High performance, advanced filtering, Rust-powered speed

Text
1Pros:
2✅ Extremely fast (written in Rust)
3✅ Advanced payload filtering
4✅ Quantization for memory efficiency
5✅ Self-hosted or cloud
6✅ Active development
7
8Cons:
9❌ Newer, smaller community
10❌ Less documentation
11❌ Fewer integrations

5.2 Quick Start

Python
1from qdrant_client import QdrantClient
2from qdrant_client.models import Distance, VectorParams, PointStruct
3
4# Connect (local or cloud)
5client = QdrantClient(path="./qdrant_db") # Local
6# client = QdrantClient(url="https://xxx.qdrant.io", api_key="...")
7
8# Create collection
9client.create_collection(
10 collection_name="documents",
11 vectors_config=VectorParams(size=1536, distance=Distance.COSINE)
12)
13
14# Upsert points
15client.upsert(
16 collection_name="documents",
17 points=[
18 PointStruct(
19 id=1,
20 vector=embedding1,
21 payload={"content": "AI document", "category": "AI", "year": 2024}
22 ),
23 PointStruct(
24 id=2,
25 vector=embedding2,
26 payload={"content": "ML document", "category": "ML", "year": 2024}
27 )
28 ]
29)
30
31# Search with filters
32from qdrant_client.models import Filter, FieldCondition, MatchValue
33
34results = client.search(
35 collection_name="documents",
36 query_vector=query_embedding,
37 limit=5,
38 query_filter=Filter(
39 must=[
40 FieldCondition(key="category", match=MatchValue(value="AI"))
41 ]
42 )
43)
44
45for result in results:
46 print(f"Score: {result.score}, Content: {result.payload['content']}")

5.3 Batch Operations

Python
1from qdrant_client.models import Batch
2
3# Efficient batch upsert
4client.upsert(
5 collection_name="documents",
6 points=Batch(
7 ids=list(range(1000)),
8 vectors=[embedding for _ in range(1000)],
9 payloads=[{"content": f"Doc {i}"} for i in range(1000)]
10 )
11)

6. pgvector (PostgreSQL)

6.1 Overview

Best for: Existing PostgreSQL users, transactional data + vectors

Text
1Pros:
2✅ Use existing PostgreSQL skills
3✅ ACID transactions
4✅ Combine with relational data
5✅ No new infrastructure
6✅ SQL joins with vector search
7
8Cons:
9❌ Not optimized for vector-only workloads
10❌ Scaling more complex
11❌ Fewer vector-specific features
12❌ Performance ceiling

6.2 Quick Start

SQL
1-- Enable extension
2CREATE EXTENSION vector;
3
4-- Create table with vector column
5CREATE TABLE documents (
6 id SERIAL PRIMARY KEY,
7 content TEXT,
8 category VARCHAR(50),
9 embedding vector(1536)
10);
11
12-- Create index (HNSW recommended)
13CREATE INDEX ON documents
14USING hnsw (embedding vector_cosine_ops);
15
16-- Insert data
17INSERT INTO documents (content, category, embedding)
18VALUES ('AI document', 'AI', '[0.1, 0.2, ...]'::vector);
19
20-- Vector search
21SELECT content, category,
22 embedding <=> '[query_vector]'::vector AS distance
23FROM documents
24WHERE category = 'AI'
25ORDER BY embedding <=> '[query_vector]'::vector
26LIMIT 5;

6.3 Python Integration

Python
1import psycopg2
2from pgvector.psycopg2 import register_vector
3
4conn = psycopg2.connect(...)
5register_vector(conn)
6
7cursor = conn.cursor()
8
9# Insert
10cursor.execute(
11 "INSERT INTO documents (content, embedding) VALUES (%s, %s)",
12 ("My document", embedding)
13)
14
15# Search
16cursor.execute("""
17 SELECT content, embedding <=> %s AS distance
18 FROM documents
19 ORDER BY embedding <=> %s
20 LIMIT 5
21""", (query_embedding, query_embedding))
22
23results = cursor.fetchall()

7. Comparison Matrix

7.1 Feature Comparison

FeatureChromaPineconeWeaviateQdrantpgvector
DeploymentLocalCloudBothBothSelf-host
Max Vectors1MBillionsBillionsBillions10M+
Hybrid SearchLimited
Multi-modal
Free Tier
FilteringBasicAdvancedAdvancedAdvancedSQL
Ease of Use⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐

7.2 Performance Comparison (Approximate)

DatabaseQPS (1M vectors)Latency p99Memory/1M
Chroma~100~50ms~2GB
Pinecone~1000~10msManaged
Weaviate~500~20ms~4GB
Qdrant~1500~5ms~1.5GB
pgvector~200~30ms~3GB

7.3 Cost Comparison (Approximate)

DatabaseFree TierProduction (1M vectors)
ChromaUnlimited (local)Self-host costs
Pinecone100K vectors~$70/month
Weaviate1M vectors~$50/month (cloud)
Qdrant1M vectors~$40/month (cloud)
pgvectorN/APostgres hosting cost

8. Decision Framework

8.1 Flowchart

Text
1START
2
3
4Prototyping/Learning?
5
6 ├── YES → ChromaDB
7
8 NO
9
10
11Need managed cloud?
12
13 ├── YES → Pinecone or Weaviate Cloud
14
15 NO
16
17
18Existing PostgreSQL?
19
20 ├── YES → pgvector
21
22 NO
23
24
25Maximum performance needed?
26
27 ├── YES → Qdrant
28
29 NO
30
31
32Multi-modal data?
33
34 ├── YES → Weaviate
35
36 NO
37
38
39Default → Qdrant or Pinecone

8.2 Use Case Recommendations

Use CaseRecommendedWhy
Learning RAGChromaZero config, easy
Production RAGPineconeManaged, scalable
Self-hosted prodQdrantPerformance, features
Image + Text searchWeaviateMulti-modal
Existing Postgres apppgvectorNo new infra
Cost sensitiveQdrant/ChromaSelf-host

9. Hands-on: Multi-DB Comparison

Setup

Python
1# Install all clients
2# pip install chromadb pinecone-client weaviate-client qdrant-client
3
4from time import time
5import numpy as np
6
7# Generate test data
8num_vectors = 10000
9dim = 1536
10vectors = np.random.rand(num_vectors, dim).astype(np.float32)
11query = np.random.rand(dim).astype(np.float32)

Benchmark Function

Python
1def benchmark_db(name, insert_fn, search_fn, n_searches=100):
2 # Measure insert time
3 start = time()
4 insert_fn()
5 insert_time = time() - start
6
7 # Measure search time
8 start = time()
9 for _ in range(n_searches):
10 search_fn()
11 search_time = (time() - start) / n_searches
12
13 print(f"{name}:")
14 print(f" Insert {num_vectors} vectors: {insert_time:.2f}s")
15 print(f" Average search latency: {search_time*1000:.2f}ms")

Compare Results

Python
1# Run benchmarks for each database
2# (Implement insert_fn and search_fn for each)
3
4# Expected output:
5# Chroma:
6# Insert 10000 vectors: 5.23s
7# Average search latency: 15.42ms
8#
9# Qdrant:
10# Insert 10000 vectors: 2.15s
11# Average search latency: 3.21ms

10. Bài tập về nhà

Bài 1: Local Comparison

  1. Install Chroma và Qdrant locally
  2. Insert 1000 vectors vào mỗi database
  3. Compare search latency
  4. So sánh API ease-of-use

Bài 2: Cloud Setup

  1. Sign up Pinecone free tier
  2. Create an index
  3. Build simple RAG với Pinecone
  4. Test with 100 queries

Bài 3: Decision Making

Cho scenario sau, chọn database phù hợp:

  1. Startup MVP, budget limited
  2. Enterprise with 10M documents
  3. Research project cần multimodal
  4. E-commerce với existing PostgreSQL

Summary

Trong bài này bạn đã học:

  • ✅ Vector database architecture và index types
  • ✅ ChromaDB cho prototyping
  • ✅ Pinecone cho production scale
  • ✅ Weaviate cho multi-modal và flexibility
  • ✅ Qdrant cho maximum performance
  • ✅ pgvector cho PostgreSQL users
  • ✅ Feature, performance, cost comparison
  • ✅ Decision framework để chọn đúng database

Next: Bài 3 - Chunking Strategies - Optimal ways to split documents