ChromaDB - Local Vector Database

0

🎯 Mục tiêu bài học

TB5 min

ChromaDB là open-source vector database phổ biến nhất cho development và prototyping. Nhẹ, nhanh, chạy local — perfect cho RAG applications.

Sau bài này, bạn sẽ:

✅ Setup ChromaDB (in-memory & persistent) ✅ CRUD operations trên collections ✅ Embedding functions (OpenAI, Sentence Transformers) ✅ Metadata filtering & advanced queries ✅ Build complete search pipeline

Task 0

1

🛠️ Setup & Quick Start

TB5 min

Installation

python.py

1# pip install chromadb
2import chromadb
3
4# In-memory (development)
5client = chromadb.Client()
6
7# Persistent (production-like)
8client = chromadb.PersistentClient(path="./chroma_db")
9
10print("ChromaDB version:", chromadb.__version__)

Collections

python.py

1# Create collection
2collection = client.create_collection(
3    name="documents",
4    metadata={"description": "Company knowledge base"}
5)
6
7# Or get existing
8collection = client.get_or_create_collection("documents")
9
10# List all collections
11print(client.list_collections())

Checkpoint

Bạn đã setup ChromaDB với Client và PersistentClient chưa?

Task 1

2

📝 Adding Documents

TB5 min

Basic Add

python.py

1# Add documents with auto-generated embeddings
2collection.add(
3    documents=[
4        "RAG kết hợp retrieval và generation để trả lời câu hỏi.",
5        "Vector database lưu trữ embeddings cho semantic search.",
6        "LangChain là framework phổ biến để xây dựng LLM applications.",
7        "Chunking chia documents thành các phần nhỏ hơn để indexing.",
8        "Embeddings chuyển text thành vectors số trong không gian nhiều chiều."
9    ],
10    ids=["doc1", "doc2", "doc3", "doc4", "doc5"],
11    metadatas=[
12        {"source": "rag-guide", "category": "concept", "language": "vi"},
13        {"source": "vector-db-101", "category": "database", "language": "vi"},
14        {"source": "langchain-docs", "category": "framework", "language": "vi"},
15        {"source": "rag-guide", "category": "technique", "language": "vi"},
16        {"source": "ml-basics", "category": "concept", "language": "vi"}
17    ]
18)
19
20print(f"Collection count: {collection.count()}")

Custom Embedding Functions

python.py

1# Option 1: Default (all-MiniLM-L6-v2 — free, local)
2collection_default = client.create_collection("default_embed")
3
4# Option 2: OpenAI Embeddings
5from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction
6
7openai_ef = OpenAIEmbeddingFunction(
8    api_key="your-openai-key",
9    model_name="text-embedding-3-small"
10)
11
12collection_openai = client.create_collection(
13    name="openai_embed",
14    embedding_function=openai_ef
15)
16
17# Option 3: Sentence Transformers (free, local, multilingual)
18from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction
19
20st_ef = SentenceTransformerEmbeddingFunction(
21    model_name="paraphrase-multilingual-MiniLM-L12-v2"
22)
23
24collection_multi = client.create_collection(
25    name="multilingual",
26    embedding_function=st_ef
27)

Add with Pre-computed Embeddings

python.py

1import numpy as np
2
3# If you already have embeddings
4collection.add(
5    ids=["custom1", "custom2"],
6    embeddings=[
7        [0.1, 0.2, 0.3, 0.4],  # pre-computed embedding
8        [0.5, 0.6, 0.7, 0.8]
9    ],
10    documents=["Document A", "Document B"],
11    metadatas=[{"type": "custom"}, {"type": "custom"}]
12)

Checkpoint

Bạn đã biết cách add documents với auto-embedding và custom embedding functions chưa?

Task 2

3

🔍 Querying

TB5 min

Semantic Search

python.py

1# Query by text (auto-embedded)
2results = collection.query(
3    query_texts=["vector database là gì?"],
4    n_results=3
5)
6
7for doc, distance, metadata in zip(
8    results['documents'][0],
9    results['distances'][0],
10    results['metadatas'][0]
11):
12    print(f"[{distance:.4f}] {doc[:80]}...")
13    print(f"  Metadata: {metadata}")
14    print()

Metadata Filtering

python.py

1# Filter by metadata
2results = collection.query(
3    query_texts=["retrieval techniques"],
4    n_results=5,
5    where={"category": "concept"}  # Exact match
6)
7
8# Complex filters
9results = collection.query(
10    query_texts=["database setup"],
11    n_results=5,
12    where={
13        "$and": [
14            {"category": {"$in": ["database", "concept"]}},
15            {"language": {"$eq": "vi"}}
16        ]
17    }
18)
19
20# Filter by document content
21results = collection.query(
22    query_texts=["RAG"],
23    n_results=5,
24    where_document={"$contains": "LangChain"}
25)

Filter Operators

Operator	Meaning	Example
`$eq`	Equals	`{"field": {"$eq": "value"}}`
`$ne`	Not equals	`{"field": {"$ne": "value"}}`
`$gt`	Greater than	`{"page": {"$gt": 5}}`
`$gte`	Greater or equal	`{"score": {"$gte": 0.8}}`
`$lt`	Less than	`{"price": {"$lt": 100}}`
`$in`	In list	`{"cat": {"$in": ["a", "b"]}}`
`$nin`	Not in list	`{"cat": {"$nin": ["x"]}}`
`$and`	All conditions	`{"$and": [...]}`
`$or`	Any condition	`{"$or": [...]}`

Checkpoint

Bạn đã biết cách query với semantic search và metadata filtering chưa?

Task 3

4

🛠️ Update & Delete

TB5 min

Update Documents

python.py

1# Update document content
2collection.update(
3    ids=["doc1"],
4    documents=["RAG (Retrieval-Augmented Generation) kết hợp search và LLM để trả lời chính xác."],
5    metadatas=[{"source": "rag-guide-v2", "category": "concept", "language": "vi", "version": 2}]
6)
7
8# Upsert (insert if not exists, update if exists)
9collection.upsert(
10    ids=["doc1", "doc6"],
11    documents=[
12        "Updated: RAG architecture overview",
13        "New: Fine-tuning vs RAG comparison"
14    ],
15    metadatas=[
16        {"category": "concept", "version": 3},
17        {"category": "comparison", "version": 1}
18    ]
19)

Delete

python.py

1# Delete by ID
2collection.delete(ids=["doc5"])
3
4# Delete by filter
5collection.delete(where={"category": "temporary"})
6
7# Delete collection
8client.delete_collection("old_collection")

Checkpoint

Bạn đã biết cách update, upsert và delete documents trong ChromaDB chưa?

Task 4

5

💻 Complete RAG Search Pipeline

TB5 min

python.py

1import chromadb
2from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction
3
4class VectorSearchPipeline:
5    def __init__(self, collection_name="knowledge_base", persist_dir="./chroma_db"):
6        self.client = chromadb.PersistentClient(path=persist_dir)
7        self.embedding_fn = SentenceTransformerEmbeddingFunction(
8            model_name="paraphrase-multilingual-MiniLM-L12-v2"
9        )
10        self.collection = self.client.get_or_create_collection(
11            name=collection_name,
12            embedding_function=self.embedding_fn,
13            metadata={"hnsw:space": "cosine"}  # Use cosine similarity
14        )
15    
16    def add_documents(self, documents, metadatas=None):
17        """Add documents to the collection."""
18        ids = [f"doc_{self.collection.count() + i}" for i in range(len(documents))]
19        self.collection.add(
20            documents=documents,
21            ids=ids,
22            metadatas=metadatas
23        )
24        print(f"Added {len(documents)} documents. Total: {self.collection.count()}")
25    
26    def search(self, query, n_results=5, category=None):
27        """Semantic search with optional filtering."""
28        where_filter = None
29        if category:
30            where_filter = {"category": category}
31        
32        results = self.collection.query(
33            query_texts=[query],
34            n_results=n_results,
35            where=where_filter
36        )
37        
38        output = []
39        for i in range(len(results['documents'][0])):
40            output.append({
41                "content": results['documents'][0][i],
42                "distance": results['distances'][0][i],
43                "metadata": results['metadatas'][0][i],
44                "id": results['ids'][0][i]
45            })
46        return output
47    
48    def get_context(self, query, n_results=3):
49        """Get formatted context for LLM prompt."""
50        results = self.search(query, n_results=n_results)
51        
52        context_parts = []
53        for i, r in enumerate(results, 1):
54            source = r['metadata'].get('source', 'unknown')
55            context_parts.append(
56                f"[Source {i}: {source}]\n{r['content']}"
57            )
58        
59        return "\n\n---\n\n".join(context_parts)
60
61# Usage
62pipeline = VectorSearchPipeline()
63pipeline.add_documents(
64    documents=["...", "...", "..."],
65    metadatas=[{"source": "doc1"}, {"source": "doc2"}, {"source": "doc3"}]
66)
67
68context = pipeline.get_context("Cách setup RAG pipeline?")
69print(context)

Checkpoint

Bạn đã xây dựng được complete search pipeline với ChromaDB chưa?

Task 5

ChromaDB - Local Vector Database

🎯 Mục tiêu bài học

Sau bài này, bạn sẽ:

🛠️ Setup & Quick Start

Installation

Collections

Checkpoint

📝 Adding Documents

Basic Add

Custom Embedding Functions

Add with Pre-computed Embeddings

Checkpoint

🔍 Querying

Semantic Search

Metadata Filtering

Filter Operators

Checkpoint

🛠️ Update & Delete

Update Documents

Delete

Checkpoint

💻 Complete RAG Search Pipeline

Checkpoint

🎯 Tổng kết

📝 Quiz

Key Takeaways

Câu hỏi tự kiểm tra

🚀 Bài tiếp theo

Khóa học

Mentor & Hỗ trợ

Blog

Giới thiệu