Vector Database Guide
Table of Contents
- What is a Vector Database?
- Core Concepts
- How Vector Search Works
- Choosing a Vector Database
- Getting Started
- Embedding Models
- Indexing & Storage
- Querying & Filtering
- RAG: Retrieval-Augmented Generation
- Performance Tuning
- Security Considerations
- Real-World Examples
What is a Vector Database?
A vector database is a database optimized for storing and searching high-dimensional numerical vectors — called embeddings — that represent the semantic meaning of data (text, images, audio, etc.).
Unlike traditional databases that match exact values, vector databases find records that are semantically similar to a query, even if no words overlap.
Why Vector Databases?
Traditional keyword search fails at meaning:
| Query | Traditional Search | Vector Search |
|---|---|---|
| “affordable car” | Finds “affordable car” only | Also finds “cheap vehicle”, “budget sedan” |
| “heart attack symptoms” | Misses “myocardial infarction” | Matches medical synonyms |
| An image of a dog | Cannot search images | Finds visually similar dog photos |
Vector databases power the semantic layer behind modern AI applications: RAG pipelines, recommendation engines, duplicate detection, image search, and more.
Core Concepts
| Concept | Description |
|---|---|
| Embedding | A fixed-length array of floats representing an item’s meaning in high-dimensional space |
| Embedding Model | A ML model (e.g., OpenAI text-embedding-3-small) that converts raw data into embeddings |
| Dimension | The length of the embedding vector (e.g., 384, 1536, 3072) |
| Index | A data structure enabling fast approximate nearest-neighbor (ANN) search |
| Collection / Namespace | A logical group of vectors, like a table in a relational database |
| Metadata | Structured fields stored alongside each vector for filtering (e.g., author, date) |
| Similarity Metric | How “closeness” is measured: cosine similarity, dot product, or Euclidean distance |
| ANN | Approximate Nearest Neighbor — trades tiny accuracy loss for massive speed gains |
How Vector Search Works
Step 1: Embed Your Data
Every piece of data is converted into a vector by an embedding model.
"The quick brown fox" ──► embedding model ──► [0.12, -0.87, 0.34, ... 1536 values]
Step 2: Store Vectors in an Index
Vectors are stored in a specialized index (e.g., HNSW, IVF) that organizes them spatially for fast retrieval.
Step 3: Query by Similarity
At query time, embed the search query and find the k vectors closest to it in the index.
Query: "fast animal" ──► [0.10, -0.85, 0.31, ...]
│
ANN search
│
┌───────────▼───────────┐
│ "The quick brown fox" │ similarity: 0.94
│ "A swift gazelle runs" │ similarity: 0.91
│ "Slow turtle crossing" │ similarity: 0.42
└───────────────────────┘
Similarity Metrics
Cosine Similarity — measures the angle between vectors; ignores magnitude. Best for text.
cosine(A, B) = (A · B) / (|A| × |B|) range: -1 to 1
Dot Product — like cosine but magnitude-sensitive. Good when embeddings are normalized.
Euclidean Distance (L2) — straight-line distance. Better for image embeddings.
Choosing a Vector Database
Comparison of Popular Options
| Database | Best For | Hosting | Highlights |
|---|---|---|---|
| Pinecone | Production SaaS, minimal ops | Managed cloud | Fully managed, simple API, auto-scaling |
| Weaviate | Hybrid search, GraphQL API | Self-hosted / Cloud | Built-in BM25 + vector hybrid search |
| Qdrant | High performance, filtering | Self-hosted / Cloud | Payload filtering, Rust core, fast |
| Chroma | Local dev, prototyping | Self-hosted / In-memory | Zero-config, great for RAG prototypes |
| Milvus | Billion-scale deployments | Self-hosted / Cloud | Enterprise-grade, highly scalable |
| pgvector | Already using PostgreSQL | Self-hosted | Adds vector search to your existing Postgres DB |
| Redis VSS | Low-latency caching + search | Self-hosted / Cloud | Sub-millisecond, good for real-time apps |
Decision Guide
Do you already use PostgreSQL?
└─ Yes ──► pgvector (least friction)
└─ No
├─ Need fully managed, zero ops? ──► Pinecone
├─ Prototyping locally? ──► Chroma
├─ Need hybrid (keyword + vector)? ──► Weaviate
├─ Need ultra-fast filtering? ──► Qdrant
└─ Billion-scale? ──► Milvus
Getting Started
Chroma (Local / Prototype)
pip install chromadb openai
import chromadb
from openai import OpenAI
client = OpenAI()
chroma = chromadb.Client()
collection = chroma.create_collection("docs")
def embed(text: str) -> list[float]:
res = client.embeddings.create(model="text-embedding-3-small", input=text)
return res.data[0].embedding
# Add documents
texts = ["The sky is blue.", "Python is a programming language.", "Dogs are loyal pets."]
collection.add(
ids=[f"doc_{i}" for i in range(len(texts))],
embeddings=[embed(t) for t in texts],
documents=texts,
)
# Query
results = collection.query(query_embeddings=[embed("What color is the sky?")], n_results=2)
print(results["documents"])
# [['The sky is blue.', 'Dogs are loyal pets.']]
Pinecone (Managed Cloud)
pip install pinecone-client openai
from pinecone import Pinecone, ServerlessSpec
pc = Pinecone(api_key="YOUR_API_KEY")
pc.create_index(
name="my-index",
dimension=1536,
metric="cosine",
spec=ServerlessSpec(cloud="aws", region="us-east-1"),
)
index = pc.Index("my-index")
# Upsert vectors
index.upsert(vectors=[
{"id": "doc1", "values": embed("Hello world"), "metadata": {"source": "blog"}},
{"id": "doc2", "values": embed("Vector databases are fast"), "metadata": {"source": "docs"}},
])
# Query
results = index.query(vector=embed("fast database"), top_k=3, include_metadata=True)
for match in results["matches"]:
print(match["id"], match["score"], match["metadata"])
Qdrant (Self-Hosted)
docker run -p 6333:6333 qdrant/qdrant
pip install qdrant-client
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
client = QdrantClient("localhost", port=6333)
client.create_collection(
collection_name="articles",
vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)
client.upsert(
collection_name="articles",
points=[
PointStruct(id=1, vector=embed("AI is transforming industries"), payload={"category": "tech"}),
PointStruct(id=2, vector=embed("Recipe for chocolate cake"), payload={"category": "food"}),
],
)
results = client.search(
collection_name="articles",
query_vector=embed("machine learning applications"),
limit=5,
)
Embedding Models
Choosing an Embedding Model
| Model | Dimensions | Best For | Cost |
|---|---|---|---|
text-embedding-3-small (OpenAI) |
1536 | General text, good price/perf | Low |
text-embedding-3-large (OpenAI) |
3072 | Higher accuracy tasks | Medium |
all-MiniLM-L6-v2 (Sentence Transformers) |
384 | Local, fast, free | Free |
bge-large-en-v1.5 (BAAI) |
1024 | High accuracy, free | Free |
embed-english-v3.0 (Cohere) |
1024 | Production, multilingual | Low |
nomic-embed-text (Nomic) |
768 | Local open-source | Free |
Running Embeddings Locally
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
texts = ["Vector databases are fast.", "I love machine learning."]
embeddings = model.encode(texts) # shape: (2, 384)
Batching for Efficiency
Always batch embedding calls to avoid rate limits and reduce latency:
def embed_batch(texts: list[str], batch_size: int = 100) -> list[list[float]]:
all_embeddings = []
for i in range(0, len(texts), batch_size):
batch = texts[i : i + batch_size]
res = openai_client.embeddings.create(model="text-embedding-3-small", input=batch)
all_embeddings.extend([r.embedding for r in res.data])
return all_embeddings
Indexing & Storage
Index Algorithms
HNSW (Hierarchical Navigable Small World) — the most widely used ANN algorithm. Builds a multi-layer graph for fast greedy search. High recall, fast queries, higher memory use.
IVF (Inverted File Index) — clusters vectors into Voronoi cells; searches only nearby clusters. Lower memory than HNSW, slightly lower recall.
Flat (Brute Force) — exact nearest neighbor by comparing every vector. Perfect recall, slow at scale. Good for datasets under ~100k vectors.
Dataset Size Recommended Index
─────────────────────────────────
< 100k Flat (exact)
100k – 10M HNSW
> 10M IVF-PQ or HNSW + quantization
Chunking Strategies for Text
Long documents must be split into chunks before embedding. Chunk size significantly affects retrieval quality.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512, # characters per chunk
chunk_overlap=64, # overlap to preserve context across chunks
separators=["\n\n", "\n", ".", " "],
)
chunks = splitter.split_text(long_document)
Chunking tips:
- Use overlapping chunks (10–15%) to avoid cutting context at boundaries
- Smaller chunks (256–512 tokens) improve precision; larger chunks (1024+) improve recall
- Store the chunk’s parent document ID in metadata for context retrieval
Querying & Filtering
Basic Similarity Search
# Pinecone
results = index.query(vector=query_embedding, top_k=10)
# Qdrant
results = client.search(collection_name="docs", query_vector=query_embedding, limit=10)
# Chroma
results = collection.query(query_embeddings=[query_embedding], n_results=10)
Metadata Filtering
Filter by structured fields while searching by vector — crucial for multi-tenant apps or time-scoped search.
# Pinecone: filter during query
results = index.query(
vector=query_embedding,
top_k=10,
filter={"source": {"$eq": "blog"}, "year": {"$gte": 2023}},
)
# Qdrant: filter with payload conditions
from qdrant_client.models import Filter, FieldCondition, MatchValue
results = client.search(
collection_name="articles",
query_vector=query_embedding,
query_filter=Filter(
must=[FieldCondition(key="category", match=MatchValue(value="tech"))]
),
limit=10,
)
Hybrid Search (Vector + Keyword)
Combine dense vector search with sparse BM25 keyword search for best-of-both-worlds results.
# Weaviate hybrid search
result = (
client.query
.get("Article", ["title", "body"])
.with_hybrid(query="transformer neural network", alpha=0.5) # 0=BM25, 1=vector
.with_limit(10)
.do()
)
alpha controls the blend: 0.0 = pure keyword, 1.0 = pure vector, 0.5 = equal mix.
RAG: Retrieval-Augmented Generation
RAG is the most common use case for vector databases: retrieve relevant context from a knowledge base, then pass it to an LLM to generate a grounded answer.
RAG Pipeline
User Question
│
▼
[Embed Question]
│
▼
[Vector Search] ──► Top-K relevant chunks
│
▼
[Build Prompt] ──► "Answer using this context: {chunks}\n\nQuestion: {question}"
│
▼
[LLM Generate] ──► Grounded Answer
End-to-End RAG Example
from openai import OpenAI
import chromadb
openai_client = OpenAI()
chroma_client = chromadb.Client()
collection = chroma_client.get_or_create_collection("knowledge_base")
def embed(text: str) -> list[float]:
return openai_client.embeddings.create(
model="text-embedding-3-small", input=text
).data[0].embedding
def index_documents(docs: list[dict]):
"""docs: list of {"id": str, "text": str, "metadata": dict}"""
collection.upsert(
ids=[d["id"] for d in docs],
embeddings=[embed(d["text"]) for d in docs],
documents=[d["text"] for d in docs],
metadatas=[d["metadata"] for d in docs],
)
def rag_query(question: str, k: int = 5) -> str:
# Retrieve
results = collection.query(query_embeddings=[embed(question)], n_results=k)
context = "\n\n".join(results["documents"][0])
# Generate
response = openai_client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": "Answer using only the provided context. If unsure, say so.",
},
{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {question}",
},
],
)
return response.choices[0].message.content
# Usage
index_documents([
{"id": "1", "text": "MCP was created by Anthropic in 2024.", "metadata": {"source": "blog"}},
{"id": "2", "text": "Vector databases store high-dimensional embeddings.", "metadata": {"source": "docs"}},
])
answer = rag_query("Who created MCP?")
print(answer)
Performance Tuning
Reducing Embedding Dimensions
OpenAI’s text-embedding-3 models support dimension reduction with minimal accuracy loss:
res = openai_client.embeddings.create(
model="text-embedding-3-small",
input="Hello world",
dimensions=256, # reduce from 1536 → 256
)
Quantization
Quantization compresses vectors from 32-bit floats to 8-bit integers, reducing memory by 4x with ~1% recall loss.
# Qdrant scalar quantization
from qdrant_client.models import ScalarQuantization, ScalarQuantizationConfig, ScalarType
client.create_collection(
collection_name="compressed",
vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
quantization_config=ScalarQuantization(
scalar=ScalarQuantizationConfig(type=ScalarType.INT8, quantile=0.99, always_ram=True)
),
)
HNSW Tuning Parameters
| Parameter | Effect | Trade-off |
|---|---|---|
m (connections per node) |
Higher = better recall | More memory |
ef_construction |
Higher = better index quality | Slower indexing |
ef (search) |
Higher = better recall at query time | Slower queries |
Start with m=16, ef_construction=200, ef=100 and tune from there.
Caching Embeddings
Avoid re-embedding the same text repeatedly — cache results by content hash:
import hashlib
_cache: dict[str, list[float]] = {}
def cached_embed(text: str) -> list[float]:
key = hashlib.sha256(text.encode()).hexdigest()
if key not in _cache:
_cache[key] = embed(text)
return _cache[key]
For production, back the cache with Redis or a persistent key-value store.
Security Considerations
Multi-Tenancy
Never mix vectors from different users or tenants in the same namespace without isolation:
# Use per-tenant namespaces (Pinecone) or collections (Qdrant/Chroma)
index.upsert(vectors=vectors, namespace=f"tenant_{user_id}")
results = index.query(vector=query_vec, top_k=10, namespace=f"tenant_{user_id}")
Metadata Sanitization
Metadata is returned directly to callers — never store sensitive PII in vector metadata:
# Bad: storing sensitive data in metadata
payload = {"user_email": "alice@example.com", "ssn": "123-45-6789", "text": chunk}
# Good: store only non-sensitive, lookupable fields
payload = {"document_id": "doc_42", "source": "handbook", "section": "onboarding"}
Access Control
Vector databases themselves have limited built-in RBAC. Enforce access at the application layer:
- Validate user permissions before issuing queries
- Use separate collections or namespaces per access level
- Rotate API keys regularly and store them in a secrets manager (never in code)
Prompt Injection via Retrieved Content
Malicious content in your vector store can hijack your RAG pipeline:
# Sanitize retrieved chunks before inserting into prompts
import html
def safe_context(chunks: list[str]) -> str:
sanitized = [html.escape(c) for c in chunks]
return "\n\n---\n\n".join(sanitized)
Consider adding a content moderation step when indexing user-generated content.
Real-World Examples
Example 1: Semantic Document Search
from pathlib import Path
from openai import OpenAI
import chromadb
openai_client = OpenAI()
chroma = chromadb.PersistentClient(path="./chroma_store")
collection = chroma.get_or_create_collection("documents")
def ingest_folder(folder: str):
for path in Path(folder).rglob("*.txt"):
text = path.read_text()
chunks = [text[i:i+500] for i in range(0, len(text), 450)]
for i, chunk in enumerate(chunks):
vec = openai_client.embeddings.create(
model="text-embedding-3-small", input=chunk
).data[0].embedding
collection.upsert(
ids=[f"{path.stem}_{i}"],
embeddings=[vec],
documents=[chunk],
metadatas=[{"file": path.name, "chunk": i}],
)
def search(query: str, k: int = 5):
vec = openai_client.embeddings.create(
model="text-embedding-3-small", input=query
).data[0].embedding
return collection.query(query_embeddings=[vec], n_results=k)
ingest_folder("./my_docs")
results = search("quarterly revenue targets")
for doc, meta in zip(results["documents"][0], results["metadatas"][0]):
print(f"[{meta['file']}] {doc[:120]}...")
Example 2: Image Similarity Search
from PIL import Image
import torch
from transformers import CLIPProcessor, CLIPModel
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
def embed_image(image_path: str) -> list[float]:
image = Image.open(image_path)
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
features = model.get_image_features(**inputs)
return features[0].numpy().tolist()
def embed_text_clip(text: str) -> list[float]:
inputs = processor(text=[text], return_tensors="pt", padding=True)
with torch.no_grad():
features = model.get_text_features(**inputs)
return features[0].numpy().tolist()
# Index images into Qdrant
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
qdrant = QdrantClient(":memory:")
qdrant.create_collection("images", vectors_config=VectorParams(size=512, distance=Distance.COSINE))
image_paths = ["cat.jpg", "dog.jpg", "car.jpg"]
qdrant.upsert("images", points=[
PointStruct(id=i, vector=embed_image(p), payload={"path": p})
for i, p in enumerate(image_paths)
])
# Text-to-image search
results = qdrant.search("images", query_vector=embed_text_clip("a fluffy animal"), limit=3)
for r in results:
print(r.payload["path"], r.score)
Example 3: Duplicate Detection
Find near-duplicate records in a dataset using vector similarity:
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer("all-MiniLM-L6-v2")
records = [
"Customer complaint: product arrived broken",
"Feedback: item was damaged on delivery",
"New feature request: dark mode support",
"Bug report: app crashes on startup",
"Issue: received a broken product",
]
embeddings = model.encode(records, normalize_embeddings=True)
similarity_matrix = np.dot(embeddings, embeddings.T)
threshold = 0.85
for i in range(len(records)):
for j in range(i + 1, len(records)):
if similarity_matrix[i][j] >= threshold:
score = float(similarity_matrix[i][j])
print(f"Duplicate (score={score:.2f}):")
print(f" A: {records[i]}")
print(f" B: {records[j]}\n")
Further Reading
- Pinecone Learning Center
- Weaviate Documentation
- Qdrant Documentation
- Chroma Documentation
- pgvector on GitHub
- HNSW Paper (Malkov & Yashunin, 2016)
- Sentence Transformers
- LangChain Vector Store Integrations
Guide covers vector database tooling as of early 2025. Embedding models and database APIs evolve quickly — always check official docs for the latest versions.
Vector databases don’t replace traditional databases—they sit beside them, acting as a semantic layer over raw data. Once that clicks, the design space opens up: systems stop being rigid and start behaving more like adaptive retrieval engines, where meaning, not structure, drives everything. And just to tie it back to the bigger picture—the rise of vector databases is tightly connected to systems like MCP , where context is no longer static. It’s retrieved, ranked, and injected dynamically. Vector search ends up being the quiet engine behind that entire loop, even if it rarely gets the spotlight.