Vector Database Guide

March 27, 2026

What is a Vector Database?
Core Concepts
How Vector Search Works
Choosing a Vector Database
Getting Started
Embedding Models
Indexing & Storage
Querying & Filtering
RAG: Retrieval-Augmented Generation
Performance Tuning
Security Considerations
Real-World Examples

What is a Vector Database?

A vector database is a database optimized for storing and searching high-dimensional numerical vectors — called embeddings — that represent the semantic meaning of data (text, images, audio, etc.).

Unlike traditional databases that match exact values, vector databases find records that are semantically similar to a query, even if no words overlap.

Why Vector Databases?

Traditional keyword search fails at meaning:

Query	Traditional Search	Vector Search
“affordable car”	Finds “affordable car” only	Also finds “cheap vehicle”, “budget sedan”
“heart attack symptoms”	Misses “myocardial infarction”	Matches medical synonyms
An image of a dog	Cannot search images	Finds visually similar dog photos

Vector databases power the semantic layer behind modern AI applications: RAG pipelines, recommendation engines, duplicate detection, image search, and more.

Core Concepts

Concept	Description
Embedding	A fixed-length array of floats representing an item’s meaning in high-dimensional space
Embedding Model	A ML model (e.g., OpenAI `text-embedding-3-small`) that converts raw data into embeddings
Dimension	The length of the embedding vector (e.g., 384, 1536, 3072)
Index	A data structure enabling fast approximate nearest-neighbor (ANN) search
Collection / Namespace	A logical group of vectors, like a table in a relational database
Metadata	Structured fields stored alongside each vector for filtering (e.g., `author`, `date`)
Similarity Metric	How “closeness” is measured: cosine similarity, dot product, or Euclidean distance
ANN	Approximate Nearest Neighbor — trades tiny accuracy loss for massive speed gains

How Vector Search Works

Step 1: Embed Your Data

Every piece of data is converted into a vector by an embedding model.

"The quick brown fox" ──► embedding model ──► [0.12, -0.87, 0.34, ... 1536 values]

Step 2: Store Vectors in an Index

Vectors are stored in a specialized index (e.g., HNSW, IVF) that organizes them spatially for fast retrieval.

Step 3: Query by Similarity

At query time, embed the search query and find the k vectors closest to it in the index.

Query: "fast animal" ──► [0.10, -0.85, 0.31, ...]
                                     │
                              ANN search
                                     │
                         ┌───────────▼───────────┐
                         │ "The quick brown fox"  │  similarity: 0.94
                         │ "A swift gazelle runs" │  similarity: 0.91
                         │ "Slow turtle crossing" │  similarity: 0.42
                         └───────────────────────┘

Similarity Metrics

Cosine Similarity — measures the angle between vectors; ignores magnitude. Best for text.

cosine(A, B) = (A · B) / (|A| × |B|)     range: -1 to 1

Dot Product — like cosine but magnitude-sensitive. Good when embeddings are normalized.

Euclidean Distance (L2) — straight-line distance. Better for image embeddings.

Choosing a Vector Database

Comparison of Popular Options

Database	Best For	Hosting	Highlights
Pinecone	Production SaaS, minimal ops	Managed cloud	Fully managed, simple API, auto-scaling
Weaviate	Hybrid search, GraphQL API	Self-hosted / Cloud	Built-in BM25 + vector hybrid search
Qdrant	High performance, filtering	Self-hosted / Cloud	Payload filtering, Rust core, fast
Chroma	Local dev, prototyping	Self-hosted / In-memory	Zero-config, great for RAG prototypes
Milvus	Billion-scale deployments	Self-hosted / Cloud	Enterprise-grade, highly scalable
pgvector	Already using PostgreSQL	Self-hosted	Adds vector search to your existing Postgres DB
Redis VSS	Low-latency caching + search	Self-hosted / Cloud	Sub-millisecond, good for real-time apps

Decision Guide

Do you already use PostgreSQL?
  └─ Yes ──► pgvector (least friction)
  └─ No
      ├─ Need fully managed, zero ops? ──► Pinecone
      ├─ Prototyping locally? ──► Chroma
      ├─ Need hybrid (keyword + vector)? ──► Weaviate
      ├─ Need ultra-fast filtering? ──► Qdrant
      └─ Billion-scale? ──► Milvus

Getting Started

Chroma (Local / Prototype)

pip install chromadb openai

import chromadb
from openai import OpenAI

client = OpenAI()
chroma = chromadb.Client()
collection = chroma.create_collection("docs")

def embed(text: str) -> list[float]:
    res = client.embeddings.create(model="text-embedding-3-small", input=text)
    return res.data[0].embedding

# Add documents
texts = ["The sky is blue.", "Python is a programming language.", "Dogs are loyal pets."]
collection.add(
    ids=[f"doc_{i}" for i in range(len(texts))],
    embeddings=[embed(t) for t in texts],
    documents=texts,
)

# Query
results = collection.query(query_embeddings=[embed("What color is the sky?")], n_results=2)
print(results["documents"])
# [['The sky is blue.', 'Dogs are loyal pets.']]

Pinecone (Managed Cloud)

pip install pinecone-client openai

from pinecone import Pinecone, ServerlessSpec

pc = Pinecone(api_key="YOUR_API_KEY")
pc.create_index(
    name="my-index",
    dimension=1536,
    metric="cosine",
    spec=ServerlessSpec(cloud="aws", region="us-east-1"),
)
index = pc.Index("my-index")

# Upsert vectors
index.upsert(vectors=[
    {"id": "doc1", "values": embed("Hello world"), "metadata": {"source": "blog"}},
    {"id": "doc2", "values": embed("Vector databases are fast"), "metadata": {"source": "docs"}},
])

# Query
results = index.query(vector=embed("fast database"), top_k=3, include_metadata=True)
for match in results["matches"]:
    print(match["id"], match["score"], match["metadata"])

Qdrant (Self-Hosted)

docker run -p 6333:6333 qdrant/qdrant
pip install qdrant-client

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct

client = QdrantClient("localhost", port=6333)
client.create_collection(
    collection_name="articles",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)

client.upsert(
    collection_name="articles",
    points=[
        PointStruct(id=1, vector=embed("AI is transforming industries"), payload={"category": "tech"}),
        PointStruct(id=2, vector=embed("Recipe for chocolate cake"), payload={"category": "food"}),
    ],
)

results = client.search(
    collection_name="articles",
    query_vector=embed("machine learning applications"),
    limit=5,
)

Embedding Models

Choosing an Embedding Model

Model	Dimensions	Best For	Cost
`text-embedding-3-small` (OpenAI)	1536	General text, good price/perf	Low
`text-embedding-3-large` (OpenAI)	3072	Higher accuracy tasks	Medium
`all-MiniLM-L6-v2` (Sentence Transformers)	384	Local, fast, free	Free
`bge-large-en-v1.5` (BAAI)	1024	High accuracy, free	Free
`embed-english-v3.0` (Cohere)	1024	Production, multilingual	Low
`nomic-embed-text` (Nomic)	768	Local open-source	Free

Running Embeddings Locally

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

texts = ["Vector databases are fast.", "I love machine learning."]
embeddings = model.encode(texts)  # shape: (2, 384)

Batching for Efficiency

Always batch embedding calls to avoid rate limits and reduce latency:

def embed_batch(texts: list[str], batch_size: int = 100) -> list[list[float]]:
    all_embeddings = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i : i + batch_size]
        res = openai_client.embeddings.create(model="text-embedding-3-small", input=batch)
        all_embeddings.extend([r.embedding for r in res.data])
    return all_embeddings

Indexing & Storage

Index Algorithms

HNSW (Hierarchical Navigable Small World) — the most widely used ANN algorithm. Builds a multi-layer graph for fast greedy search. High recall, fast queries, higher memory use.

IVF (Inverted File Index) — clusters vectors into Voronoi cells; searches only nearby clusters. Lower memory than HNSW, slightly lower recall.

Flat (Brute Force) — exact nearest neighbor by comparing every vector. Perfect recall, slow at scale. Good for datasets under ~100k vectors.

Dataset Size    Recommended Index
─────────────────────────────────
< 100k          Flat (exact)
100k – 10M      HNSW
> 10M           IVF-PQ or HNSW + quantization

Chunking Strategies for Text

Long documents must be split into chunks before embedding. Chunk size significantly affects retrieval quality.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,        # characters per chunk
    chunk_overlap=64,      # overlap to preserve context across chunks
    separators=["\n\n", "\n", ".", " "],
)

chunks = splitter.split_text(long_document)

Chunking tips:

Use overlapping chunks (10–15%) to avoid cutting context at boundaries
Smaller chunks (256–512 tokens) improve precision; larger chunks (1024+) improve recall
Store the chunk’s parent document ID in metadata for context retrieval

Querying & Filtering

Basic Similarity Search

# Pinecone
results = index.query(vector=query_embedding, top_k=10)

# Qdrant
results = client.search(collection_name="docs", query_vector=query_embedding, limit=10)

# Chroma
results = collection.query(query_embeddings=[query_embedding], n_results=10)

Metadata Filtering

Filter by structured fields while searching by vector — crucial for multi-tenant apps or time-scoped search.

# Pinecone: filter during query
results = index.query(
    vector=query_embedding,
    top_k=10,
    filter={"source": {"$eq": "blog"}, "year": {"$gte": 2023}},
)

# Qdrant: filter with payload conditions
from qdrant_client.models import Filter, FieldCondition, MatchValue

results = client.search(
    collection_name="articles",
    query_vector=query_embedding,
    query_filter=Filter(
        must=[FieldCondition(key="category", match=MatchValue(value="tech"))]
    ),
    limit=10,
)

Hybrid Search (Vector + Keyword)

Combine dense vector search with sparse BM25 keyword search for best-of-both-worlds results.

# Weaviate hybrid search
result = (
    client.query
    .get("Article", ["title", "body"])
    .with_hybrid(query="transformer neural network", alpha=0.5)  # 0=BM25, 1=vector
    .with_limit(10)
    .do()
)

alpha controls the blend: 0.0 = pure keyword, 1.0 = pure vector, 0.5 = equal mix.

RAG: Retrieval-Augmented Generation

RAG is the most common use case for vector databases: retrieve relevant context from a knowledge base, then pass it to an LLM to generate a grounded answer.

RAG Pipeline

User Question
     │
     ▼
[Embed Question]
     │
     ▼
[Vector Search] ──► Top-K relevant chunks
     │
     ▼
[Build Prompt]  ──► "Answer using this context: {chunks}\n\nQuestion: {question}"
     │
     ▼
[LLM Generate]  ──► Grounded Answer

End-to-End RAG Example

from openai import OpenAI
import chromadb

openai_client = OpenAI()
chroma_client = chromadb.Client()
collection = chroma_client.get_or_create_collection("knowledge_base")

def embed(text: str) -> list[float]:
    return openai_client.embeddings.create(
        model="text-embedding-3-small", input=text
    ).data[0].embedding

def index_documents(docs: list[dict]):
    """docs: list of {"id": str, "text": str, "metadata": dict}"""
    collection.upsert(
        ids=[d["id"] for d in docs],
        embeddings=[embed(d["text"]) for d in docs],
        documents=[d["text"] for d in docs],
        metadatas=[d["metadata"] for d in docs],
    )

def rag_query(question: str, k: int = 5) -> str:
    # Retrieve
    results = collection.query(query_embeddings=[embed(question)], n_results=k)
    context = "\n\n".join(results["documents"][0])

    # Generate
    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": "Answer using only the provided context. If unsure, say so.",
            },
            {
                "role": "user",
                "content": f"Context:\n{context}\n\nQuestion: {question}",
            },
        ],
    )
    return response.choices[0].message.content

# Usage
index_documents([
    {"id": "1", "text": "MCP was created by Anthropic in 2024.", "metadata": {"source": "blog"}},
    {"id": "2", "text": "Vector databases store high-dimensional embeddings.", "metadata": {"source": "docs"}},
])

answer = rag_query("Who created MCP?")
print(answer)

Performance Tuning

Reducing Embedding Dimensions

OpenAI’s text-embedding-3 models support dimension reduction with minimal accuracy loss:

res = openai_client.embeddings.create(
    model="text-embedding-3-small",
    input="Hello world",
    dimensions=256,  # reduce from 1536 → 256
)

Quantization

Quantization compresses vectors from 32-bit floats to 8-bit integers, reducing memory by 4x with ~1% recall loss.

# Qdrant scalar quantization
from qdrant_client.models import ScalarQuantization, ScalarQuantizationConfig, ScalarType

client.create_collection(
    collection_name="compressed",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
    quantization_config=ScalarQuantization(
        scalar=ScalarQuantizationConfig(type=ScalarType.INT8, quantile=0.99, always_ram=True)
    ),
)

HNSW Tuning Parameters

Parameter	Effect	Trade-off
`m` (connections per node)	Higher = better recall	More memory
`ef_construction`	Higher = better index quality	Slower indexing
`ef` (search)	Higher = better recall at query time	Slower queries

Start with m=16, ef_construction=200, ef=100 and tune from there.

Caching Embeddings

Avoid re-embedding the same text repeatedly — cache results by content hash:

import hashlib

_cache: dict[str, list[float]] = {}

def cached_embed(text: str) -> list[float]:
    key = hashlib.sha256(text.encode()).hexdigest()
    if key not in _cache:
        _cache[key] = embed(text)
    return _cache[key]

For production, back the cache with Redis or a persistent key-value store.

Security Considerations

Multi-Tenancy

Never mix vectors from different users or tenants in the same namespace without isolation:

# Use per-tenant namespaces (Pinecone) or collections (Qdrant/Chroma)
index.upsert(vectors=vectors, namespace=f"tenant_{user_id}")
results = index.query(vector=query_vec, top_k=10, namespace=f"tenant_{user_id}")

Metadata Sanitization

Metadata is returned directly to callers — never store sensitive PII in vector metadata:

# Bad: storing sensitive data in metadata
payload = {"user_email": "alice@example.com", "ssn": "123-45-6789", "text": chunk}

# Good: store only non-sensitive, lookupable fields
payload = {"document_id": "doc_42", "source": "handbook", "section": "onboarding"}

Access Control

Vector databases themselves have limited built-in RBAC. Enforce access at the application layer:

Validate user permissions before issuing queries
Use separate collections or namespaces per access level
Rotate API keys regularly and store them in a secrets manager (never in code)

Prompt Injection via Retrieved Content

Malicious content in your vector store can hijack your RAG pipeline:

# Sanitize retrieved chunks before inserting into prompts
import html

def safe_context(chunks: list[str]) -> str:
    sanitized = [html.escape(c) for c in chunks]
    return "\n\n---\n\n".join(sanitized)

Consider adding a content moderation step when indexing user-generated content.

Real-World Examples

Example 1: Semantic Document Search

from pathlib import Path
from openai import OpenAI
import chromadb

openai_client = OpenAI()
chroma = chromadb.PersistentClient(path="./chroma_store")
collection = chroma.get_or_create_collection("documents")

def ingest_folder(folder: str):
    for path in Path(folder).rglob("*.txt"):
        text = path.read_text()
        chunks = [text[i:i+500] for i in range(0, len(text), 450)]
        for i, chunk in enumerate(chunks):
            vec = openai_client.embeddings.create(
                model="text-embedding-3-small", input=chunk
            ).data[0].embedding
            collection.upsert(
                ids=[f"{path.stem}_{i}"],
                embeddings=[vec],
                documents=[chunk],
                metadatas=[{"file": path.name, "chunk": i}],
            )

def search(query: str, k: int = 5):
    vec = openai_client.embeddings.create(
        model="text-embedding-3-small", input=query
    ).data[0].embedding
    return collection.query(query_embeddings=[vec], n_results=k)

ingest_folder("./my_docs")
results = search("quarterly revenue targets")
for doc, meta in zip(results["documents"][0], results["metadatas"][0]):
    print(f"[{meta['file']}] {doc[:120]}...")

Example 2: Image Similarity Search

from PIL import Image
import torch
from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

def embed_image(image_path: str) -> list[float]:
    image = Image.open(image_path)
    inputs = processor(images=image, return_tensors="pt")
    with torch.no_grad():
        features = model.get_image_features(**inputs)
    return features[0].numpy().tolist()

def embed_text_clip(text: str) -> list[float]:
    inputs = processor(text=[text], return_tensors="pt", padding=True)
    with torch.no_grad():
        features = model.get_text_features(**inputs)
    return features[0].numpy().tolist()

# Index images into Qdrant
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct

qdrant = QdrantClient(":memory:")
qdrant.create_collection("images", vectors_config=VectorParams(size=512, distance=Distance.COSINE))

image_paths = ["cat.jpg", "dog.jpg", "car.jpg"]
qdrant.upsert("images", points=[
    PointStruct(id=i, vector=embed_image(p), payload={"path": p})
    for i, p in enumerate(image_paths)
])

# Text-to-image search
results = qdrant.search("images", query_vector=embed_text_clip("a fluffy animal"), limit=3)
for r in results:
    print(r.payload["path"], r.score)

Example 3: Duplicate Detection

Find near-duplicate records in a dataset using vector similarity:

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")

records = [
    "Customer complaint: product arrived broken",
    "Feedback: item was damaged on delivery",
    "New feature request: dark mode support",
    "Bug report: app crashes on startup",
    "Issue: received a broken product",
]

embeddings = model.encode(records, normalize_embeddings=True)
similarity_matrix = np.dot(embeddings, embeddings.T)

threshold = 0.85
for i in range(len(records)):
    for j in range(i + 1, len(records)):
        if similarity_matrix[i][j] >= threshold:
            score = float(similarity_matrix[i][j])
            print(f"Duplicate (score={score:.2f}):")
            print(f"  A: {records[i]}")
            print(f"  B: {records[j]}\n")

Table of Contents