Bài 2: Vector Databases Deep Dive

1. Understanding Vector Databases

1.1 Architecture Overview

Text

1┌─────────────────────────────────────────────────────────────┐
2│                  VECTOR DATABASE ARCHITECTURE                │
3├─────────────────────────────────────────────────────────────┤
4│                                                             │
5│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐     │
6│  │   INGESTION │ →  │   INDEXING  │ →  │   STORAGE   │     │
7│  │             │    │             │    │             │     │
8│  │ Documents   │    │ HNSW/IVF/   │    │ Vectors +   │     │
9│  │ → Vectors   │    │ PQ Indexes  │    │ Metadata    │     │
10│  └─────────────┘    └─────────────┘    └─────────────┘     │
11│                                              │               │
12│                                              ▼               │
13│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐     │
14│  │   RESULTS   │ ←  │   RANKING   │ ←  │   SEARCH    │     │
15│  │             │    │             │    │             │     │
16│  │ Top K docs  │    │ Re-ranking/ │    │ ANN Search  │     │
17│  │ + scores    │    │ Filtering   │    │ + Filters   │     │
18│  └─────────────┘    └─────────────┘    └─────────────┘     │
19│                                                             │
20└─────────────────────────────────────────────────────────────┘

1.2 Key Concepts

ANN (Approximate Nearest Neighbor):

Exact search: O(n) - check every vector
ANN: O(log n) - use smart indexing
Trade accuracy for speed (99%+ recall typical)

Index Types:

Index	Speed	Memory	Accuracy	Best For
Flat	Slow	Low	100%	Small datasets (under 10K)
IVF	Medium	Medium	~95%	Medium datasets
HNSW	Fast	High	~99%	Production systems
PQ	Fast	Very Low	~90%	Memory constrained

2. ChromaDB

2.1 Overview

Best for: Prototyping, local development, small projects

Text

1Pros:
2✅ Zero config - just pip install
3✅ Runs locally, no cloud needed
4✅ Python-native API
5✅ Free and open source
6✅ Persistent storage option
7 
8Cons:
9❌ Not designed for scale
10❌ Limited filtering capabilities
11❌ Single-node only
12❌ No managed cloud option

2.2 Quick Start

Python

1import chromadb
2from chromadb.config import Settings
3
4# In-memory (development)
5client = chromadb.Client()
6
7# Persistent storage
8client = chromadb.PersistentClient(path="./chroma_db")
9
10# Create collection
11collection = client.create_collection(
12    name="documents",
13    metadata={"hnsw:space": "cosine"}  # or "l2", "ip"
14)
15
16# Add documents
17collection.add(
18    ids=["doc1", "doc2", "doc3"],
19    documents=[
20        "Machine learning is a subset of AI",
21        "Deep learning uses neural networks",
22        "Python is popular for data science"
23    ],
24    metadatas=[
25        {"category": "AI", "year": 2024},
26        {"category": "AI", "year": 2024},
27        {"category": "Programming", "year": 2023}
28    ]
29)
30
31# Query with filters
32results = collection.query(
33    query_texts=["What is AI?"],
34    n_results=2,
35    where={"category": "AI"}
36)
37
38print(results["documents"])
39print(results["distances"])

2.3 Using with External Embeddings

Python

1from openai import OpenAI
2
3openai_client = OpenAI()
4
5def get_embeddings(texts):
6    response = openai_client.embeddings.create(
7        model="text-embedding-3-small",
8        input=texts
9    )
10    return [item.embedding for item in response.data]
11
12# Create collection without auto-embedding
13collection = client.create_collection("custom_embeddings")
14
15# Add with your own embeddings
16texts = ["Document 1", "Document 2"]
17embeddings = get_embeddings(texts)
18
19collection.add(
20    ids=["id1", "id2"],
21    embeddings=embeddings,
22    documents=texts
23)
24
25# Query with your embedding
26query_embedding = get_embeddings(["My query"])[0]
27results = collection.query(
28    query_embeddings=[query_embedding],
29    n_results=5
30)

3. Pinecone

3.1 Overview

Best for: Production RAG, enterprise scale, serverless deployment

Text

1Pros:
2✅ Fully managed, serverless
3✅ Scales to billions of vectors
4✅ Real-time updates
5✅ Hybrid search (vector + keyword)
6✅ Excellent documentation
7✅ Free tier available
8 
9Cons:
10❌ Vendor lock-in
11❌ Can get expensive at scale
12❌ Data leaves your infrastructure
13❌ Limited self-hosted option

3.2 Quick Start

Python

1from pinecone import Pinecone, ServerlessSpec
2
3# Initialize
4pc = Pinecone(api_key="your-api-key")
5
6# Create index
7pc.create_index(
8    name="my-index",
9    dimension=1536,  # Match your embedding model
10    metric="cosine",
11    spec=ServerlessSpec(
12        cloud="aws",
13        region="us-east-1"
14    )
15)
16
17# Connect to index
18index = pc.Index("my-index")
19
20# Upsert vectors
21index.upsert(
22    vectors=[
23        {
24            "id": "doc1",
25            "values": embedding1,  # 1536-dim vector
26            "metadata": {"category": "AI", "source": "blog"}
27        },
28        {
29            "id": "doc2",
30            "values": embedding2,
31            "metadata": {"category": "ML", "source": "paper"}
32        }
33    ],
34    namespace="documents"  # Optional logical partition
35)
36
37# Query
38results = index.query(
39    vector=query_embedding,
40    top_k=5,
41    include_metadata=True,
42    filter={"category": {"$eq": "AI"}}
43)
44
45for match in results.matches:
46    print(f"Score: {match.score}, ID: {match.id}")

3.3 Hybrid Search

Python

1# Pinecone with sparse vectors for keyword matching
2from pinecone_text.sparse import BM25Encoder
3
4# Initialize BM25 for keyword search
5bm25 = BM25Encoder()
6bm25.fit(corpus)  # Fit on your documents
7
8# Create hybrid vectors
9def create_hybrid_vector(text, dense_embedding):
10    sparse = bm25.encode_documents([text])[0]
11    return {
12        "values": dense_embedding,
13        "sparse_values": sparse
14    }
15
16# Query with hybrid search
17results = index.query(
18    vector=dense_query_embedding,
19    sparse_vector=bm25.encode_queries([query])[0],
20    top_k=10,
21    alpha=0.5  # Balance between dense and sparse
22)

4. Weaviate

4.1 Overview

Best for: Multimodal search, complex filtering, self-hosted production

Text

1Pros:
2✅ Multi-modal (text, images, etc.)
3✅ GraphQL API
4✅ Built-in vectorizers
5✅ Hybrid search out-of-box
6✅ Self-hosted or cloud
7✅ Strong filtering
8 
9Cons:
10❌ More complex setup
11❌ Steeper learning curve
12❌ Resource intensive
13❌ GraphQL can be verbose

4.2 Quick Start

Python

1import weaviate
2from weaviate.classes.config import Configure, Property, DataType
3from weaviate.classes.query import MetadataQuery
4
5# Connect to Weaviate Cloud
6client = weaviate.connect_to_weaviate_cloud(
7    cluster_url="your-cluster-url",
8    auth_credentials=weaviate.auth.AuthApiKey("your-api-key")
9)
10
11# Create schema
12client.collections.create(
13    name="Document",
14    vectorizer_config=Configure.Vectorizer.text2vec_openai(),
15    properties=[
16        Property(name="content", data_type=DataType.TEXT),
17        Property(name="category", data_type=DataType.TEXT),
18        Property(name="year", data_type=DataType.INT)
19    ]
20)
21
22# Get collection
23documents = client.collections.get("Document")
24
25# Insert data
26documents.data.insert_many([
27    {"content": "AI is transforming industries", "category": "AI", "year": 2024},
28    {"content": "Machine learning models improve predictions", "category": "ML", "year": 2024}
29])
30
31# Vector search
32response = documents.query.near_text(
33    query="artificial intelligence applications",
34    limit=5,
35    return_metadata=MetadataQuery(distance=True)
36)
37
38for obj in response.objects:
39    print(obj.properties["content"])
40    print(f"Distance: {obj.metadata.distance}")

4.3 Hybrid Search in Weaviate

Python

1from weaviate.classes.query import HybridFusion
2
3response = documents.query.hybrid(
4    query="machine learning applications",
5    alpha=0.5,  # 0=keyword only, 1=vector only
6    fusion_type=HybridFusion.RELATIVE_SCORE,
7    limit=5,
8    filters=weaviate.classes.query.Filter.by_property("year").greater_than(2023)
9)

5. Qdrant

5.1 Overview

Best for: High performance, advanced filtering, Rust-powered speed

Text

1Pros:
2✅ Extremely fast (written in Rust)
3✅ Advanced payload filtering
4✅ Quantization for memory efficiency
5✅ Self-hosted or cloud
6✅ Active development
7 
8Cons:
9❌ Newer, smaller community
10❌ Less documentation
11❌ Fewer integrations

5.2 Quick Start

Python

1from qdrant_client import QdrantClient
2from qdrant_client.models import Distance, VectorParams, PointStruct
3
4# Connect (local or cloud)
5client = QdrantClient(path="./qdrant_db")  # Local
6# client = QdrantClient(url="https://xxx.qdrant.io", api_key="...")
7
8# Create collection
9client.create_collection(
10    collection_name="documents",
11    vectors_config=VectorParams(size=1536, distance=Distance.COSINE)
12)
13
14# Upsert points
15client.upsert(
16    collection_name="documents",
17    points=[
18        PointStruct(
19            id=1,
20            vector=embedding1,
21            payload={"content": "AI document", "category": "AI", "year": 2024}
22        ),
23        PointStruct(
24            id=2,
25            vector=embedding2,
26            payload={"content": "ML document", "category": "ML", "year": 2024}
27        )
28    ]
29)
30
31# Search with filters
32from qdrant_client.models import Filter, FieldCondition, MatchValue
33
34results = client.search(
35    collection_name="documents",
36    query_vector=query_embedding,
37    limit=5,
38    query_filter=Filter(
39        must=[
40            FieldCondition(key="category", match=MatchValue(value="AI"))
41        ]
42    )
43)
44
45for result in results:
46    print(f"Score: {result.score}, Content: {result.payload['content']}")

5.3 Batch Operations

Python

1from qdrant_client.models import Batch
2
3# Efficient batch upsert
4client.upsert(
5    collection_name="documents",
6    points=Batch(
7        ids=list(range(1000)),
8        vectors=[embedding for _ in range(1000)],
9        payloads=[{"content": f"Doc {i}"} for i in range(1000)]
10    )
11)

6. pgvector (PostgreSQL)

6.1 Overview

Best for: Existing PostgreSQL users, transactional data + vectors

Text

1Pros:
2✅ Use existing PostgreSQL skills
3✅ ACID transactions
4✅ Combine with relational data
5✅ No new infrastructure
6✅ SQL joins with vector search
7 
8Cons:
9❌ Not optimized for vector-only workloads
10❌ Scaling more complex
11❌ Fewer vector-specific features
12❌ Performance ceiling

6.2 Quick Start

SQL

1-- Enable extension
2CREATE EXTENSION vector;
3
4-- Create table with vector column
5CREATE TABLE documents (
6    id SERIAL PRIMARY KEY,
7    content TEXT,
8    category VARCHAR(50),
9    embedding vector(1536)
10);
11
12-- Create index (HNSW recommended)
13CREATE INDEX ON documents 
14USING hnsw (embedding vector_cosine_ops);
15
16-- Insert data
17INSERT INTO documents (content, category, embedding)
18VALUES ('AI document', 'AI', '[0.1, 0.2, ...]'::vector);
19
20-- Vector search
21SELECT content, category, 
22       embedding <=> '[query_vector]'::vector AS distance
23FROM documents
24WHERE category = 'AI'
25ORDER BY embedding <=> '[query_vector]'::vector
26LIMIT 5;

6.3 Python Integration

Python

1import psycopg2
2from pgvector.psycopg2 import register_vector
3
4conn = psycopg2.connect(...)
5register_vector(conn)
6
7cursor = conn.cursor()
8
9# Insert
10cursor.execute(
11    "INSERT INTO documents (content, embedding) VALUES (%s, %s)",
12    ("My document", embedding)
13)
14
15# Search
16cursor.execute("""
17    SELECT content, embedding <=> %s AS distance
18    FROM documents
19    ORDER BY embedding <=> %s
20    LIMIT 5
21""", (query_embedding, query_embedding))
22
23results = cursor.fetchall()

7. Comparison Matrix

7.1 Feature Comparison

Feature	Chroma	Pinecone	Weaviate	Qdrant	pgvector
Deployment	Local	Cloud	Both	Both	Self-host
Max Vectors	1M	Billions	Billions	Billions	10M+
Hybrid Search	❌	✅	✅	✅	Limited
Multi-modal	❌	❌	✅	❌	❌
Free Tier	✅	✅	✅	✅	✅
Filtering	Basic	Advanced	Advanced	Advanced	SQL
Ease of Use	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐

7.2 Performance Comparison (Approximate)

Database	QPS (1M vectors)	Latency p99	Memory/1M
Chroma	~100	~50ms	~2GB
Pinecone	~1000	~10ms	Managed
Weaviate	~500	~20ms	~4GB
Qdrant	~1500	~5ms	~1.5GB
pgvector	~200	~30ms	~3GB

7.3 Cost Comparison (Approximate)

Database	Free Tier	Production (1M vectors)
Chroma	Unlimited (local)	Self-host costs
Pinecone	100K vectors	~$70/month
Weaviate	1M vectors	~$50/month (cloud)
Qdrant	1M vectors	~$40/month (cloud)
pgvector	N/A	Postgres hosting cost

8. Decision Framework

8.1 Flowchart

Text

1START
2  │
3  ▼
4Prototyping/Learning?
5  │
6  ├── YES → ChromaDB
7  │
8  NO
9  │
10  ▼
11Need managed cloud?
12  │
13  ├── YES → Pinecone or Weaviate Cloud
14  │
15  NO
16  │
17  ▼
18Existing PostgreSQL?
19  │
20  ├── YES → pgvector
21  │
22  NO
23  │
24  ▼
25Maximum performance needed?
26  │
27  ├── YES → Qdrant
28  │
29  NO
30  │
31  ▼
32Multi-modal data?
33  │
34  ├── YES → Weaviate
35  │
36  NO
37  │
38  ▼
39Default → Qdrant or Pinecone

8.2 Use Case Recommendations

Use Case	Recommended	Why
Learning RAG	Chroma	Zero config, easy
Production RAG	Pinecone	Managed, scalable
Self-hosted prod	Qdrant	Performance, features
Image + Text search	Weaviate	Multi-modal
Existing Postgres app	pgvector	No new infra
Cost sensitive	Qdrant/Chroma	Self-host

9. Hands-on: Multi-DB Comparison

Setup

Python

1# Install all clients
2# pip install chromadb pinecone-client weaviate-client qdrant-client
3
4from time import time
5import numpy as np
6
7# Generate test data
8num_vectors = 10000
9dim = 1536
10vectors = np.random.rand(num_vectors, dim).astype(np.float32)
11query = np.random.rand(dim).astype(np.float32)

Benchmark Function

Python

1def benchmark_db(name, insert_fn, search_fn, n_searches=100):
2    # Measure insert time
3    start = time()
4    insert_fn()
5    insert_time = time() - start
6    
7    # Measure search time
8    start = time()
9    for _ in range(n_searches):
10        search_fn()
11    search_time = (time() - start) / n_searches
12    
13    print(f"{name}:")
14    print(f"  Insert {num_vectors} vectors: {insert_time:.2f}s")
15    print(f"  Average search latency: {search_time*1000:.2f}ms")

Compare Results

Python

1# Run benchmarks for each database
2# (Implement insert_fn and search_fn for each)
3
4# Expected output:
5# Chroma:
6#   Insert 10000 vectors: 5.23s
7#   Average search latency: 15.42ms
8# 
9# Qdrant:
10#   Insert 10000 vectors: 2.15s
11#   Average search latency: 3.21ms

10. Bài tập về nhà

Bài 1: Local Comparison

Install Chroma và Qdrant locally
Insert 1000 vectors vào mỗi database
Compare search latency
So sánh API ease-of-use

Bài 2: Cloud Setup

Sign up Pinecone free tier
Create an index
Build simple RAG với Pinecone
Test with 100 queries

Bài 3: Decision Making

Cho scenario sau, chọn database phù hợp:

Startup MVP, budget limited
Enterprise with 10M documents
Research project cần multimodal
E-commerce với existing PostgreSQL

Summary

Trong bài này bạn đã học:

✅ Vector database architecture và index types
✅ ChromaDB cho prototyping
✅ Pinecone cho production scale
✅ Weaviate cho multi-modal và flexibility
✅ Qdrant cho maximum performance
✅ pgvector cho PostgreSQL users
✅ Feature, performance, cost comparison
✅ Decision framework để chọn đúng database

Next: Bài 3 - Chunking Strategies - Optimal ways to split documents