#vector-database #rag #llm #embeddings #postgresql

Vector Databases Compared: pgvector vs Pinecone vs Weaviate

BackendBytes Engineering Team

Mar 2, 2026

19 min read

Vector Databases Compared: pgvector vs Pinecone vs Weaviate

Part of Series: AI Engineering in Production

Lesson 2 of 6

Prev Next

Key Takeaways

→pgvector HNSW index rebuilds block writes — one team's routine REINDEX turned into an 8-hour ingestion outage
→Choose by infrastructure fit: already have Postgres → pgvector; need zero-ops → Pinecone; want hybrid BM25+vector → Weaviate; 100M+ vectors → Milvus
→Switching embedding models requires re-embedding your entire corpus — use dual-index migration with a feature flag, not a maintenance window
→Under 100K vectors with infrequent queries, brute-force NumPy is fine — you don't need a vector database until you need sub-100ms latency at 1M+ vectors

A plain in-place REINDEX INDEX on a 12M-row pgvector table can stall a RAG ingestion pipeline for hours. A non-concurrent HNSW rebuild takes an AccessExclusiveLock^{[PostgreSQL — Explicit Locking]} on the index — SELECTs served from cached plans stay green, so the dashboards look fine, but the INSERTs and UPDATEs writing new embeddings block and the tickets start arriving: "Why can't the chatbot see the document I just uploaded?" The fix — REINDEX INDEX CONCURRENTLY, whose ShareUpdateExclusiveLock lets writes through, or a build-new-and-swap — is exactly the kind of thing a feature matrix never tells you.

The point: vector databases have operational characteristics no feature comparison captures. Each one trades off scalability, cost, and operational burden differently, and the sharp edges only surface under production load. This guide compares them with real code, the documented performance envelopes, and the operational gotchas — anchored to each database's own documentation rather than vibes.

Quick Take

Choose based on infrastructure fit: already have Postgres → start with pgvector (zero new infrastructure); need zero-ops managed service → Pinecone; want hybrid BM25+vector search → Weaviate; have 100M+ vectors at high QPS → Milvus. Don't migrate prematurely; measure first.

pgvector wins if you already run PostgreSQL and don't need separate scaling
Pinecone wins if you want fully managed with no infrastructure overhead
Weaviate wins if users need both keyword precision and semantic understanding
Milvus wins if you have 100M+ vectors and need thousands of QPS

The quick start: Vector DB comparison matrix

Choose based on your infrastructure and constraints, not by checking feature boxes:

Dimension	pgvector	Pinecone	Weaviate	Milvus
Type	PostgreSQL extension	Fully managed SaaS	Open-source DB	Open-source distributed DB
Best for	Already use Postgres	Zero-ops, fast prototyping	Hybrid BM25+vector search	100M+ vectors, 1000+ QPS
Deployment	Your existing Postgres	API endpoint only	Self-hosted or managed	Self-hosted or Zilliz Cloud
Scaling	Read replicas / Citus	Auto-scales, no tuning	Native sharding	Distributed (query, data, index nodes)
Hybrid search	Manual SQL join	Sparse-dense in one query	Native BM25 + alpha tuning	Sparse + dense vectors
ACID / SQL joins	Yes (full Postgres)	No	No	No
Max vectors (single node)	~10M well-tuned	N/A (managed)	~100M	Unlimited (distributed)
Index build time (1M vectors)	8–15 min	Seconds	3–5 min	2–4 min
Cold start latency	None	2–5 sec (serverless)	None	None
Cost (10M vectors, 100 QPS)	$50–450/mo self-hosted; $25–200/mo managed	$100–350/mo	$250–680/mo	$200–900/mo
Operational overhead	Moderate (index rebalancing)	None	Medium (sharding ops)	High (etcd, MinIO, multi-node)
Latest version (mid-2026)	pgvector 0.8.2	Serverless (managed)	Weaviate 1.37	Milvus 2.6.x

The cost row blends self-hosted infra and managed-tier list prices into an order-of-magnitude band — treat it as a planning estimate, not a quote. The managed pricing models diverge in ways the band hides, and all four moved in the last year: Pinecone serverless bills usage as storage ($0.33/GB-month), write units ($4–$4.50/M), and read units ($16–$18/M), with plan minimums of $0 (Starter), $20/mo (the newer Builder tier), $50/mo (Standard), and $500/mo (Enterprise). Weaviate Cloud moved to a vector-dimensions-stored model in October 2025: a free tier, then Flex from $45/mo, Plus from $280/mo, and Premium from $400/mo. Milvus is self-hosted (Apache-2.0); the managed path is Zilliz Cloud, whose serverless storage dropped to $0.04/GB-month in January 2026. Re-price against each vendor's calculator before committing — list prices change quarterly.

Two databases not broken out above are worth knowing in 2026. Qdrant (Rust, latest 1.18) competes directly with the four here: native payload filtering, a free managed tier (0.5 vCPU / 1 GB RAM), and resource-based Cloud pricing. LanceDB ($30M Series A in mid-2025) takes a different shape — an embedded, serverless engine on the columnar Lance format, aimed at multimodal and lakehouse-scale data rather than a long-running server. If your fit is "Postgres-adjacent but standalone," evaluate Qdrant; if it is "embedded, file-based, multimodal," evaluate LanceDB.

When you DON'T need a vector database: under 100K vectors with infrequent queries, brute-force NumPy is fine. You need an index when sub-100ms latency at 1M+ vectors with concurrent users matters.

pgvector: Start here if you have PostgreSQL

^{[pgvector README]}

pgvector adds a vector column type and HNSW indexes to Postgres. If you already run Postgres, you get vector search with ACID transactions and SQL joins — no new infrastructure.

CREATE EXTENSION IF NOT EXISTS vector;
 
CREATE TABLE documents (
    id         BIGINT PRIMARY KEY,
    content    TEXT NOT NULL,
    metadata   JSONB DEFAULT '{}',
    embedding  vector(1536),  -- OpenAI text-embedding-3-small dimensions
    created_at TIMESTAMPTZ DEFAULT now()
);
 
-- HNSW index: m=16, ef_construction=200 is production sweet spot
CREATE INDEX idx_embedding ON documents
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 200);
 
-- Vector search with metadata filters and SQL joins
SELECT d.id, d.content, 1 - (d.embedding <=> $1::vector) AS similarity
FROM documents d
JOIN categories c ON d.category_id = c.id
WHERE c.slug = 'architecture'
  AND d.created_at > now() - INTERVAL '90 days'
ORDER BY d.embedding <=> $1::vector
LIMIT 10;

The <=> operator is cosine distance. You get full SQL expressiveness — WHERE clauses, JOINs, window functions — all with indexes. HNSW tuning matters: m=16 balances memory and recall, and raising m to 32 increases memory substantially for diminishing recall gains — the classic HNSW connectivity-vs-memory trade. Validate the exact recall delta on your own vectors, since it's dataset-dependent. Higher-dimensional vectors and larger collections push memory hard; plan for large-RAM instances at multi-million-vector scale. See ^{[pgvector README]} for the full HNSW parameter reference.

The release worth knowing about is 0.8.0 (October 2024), carried through the current 0.8.2: it added iterative index scans (SET hnsw.iterative_scan = 'relaxed_order') plus improved cost estimation for filtered queries. Before 0.8.0, a selective WHERE clause on top of an HNSW scan could "overfilter" — the index returned its ef_search candidates, the filter discarded most of them, and you got back fewer than LIMIT rows. Iterative scans keep walking the index until the filtered result set is full, which makes the metadata-filtered queries above behave the way you'd expect. If you're on a pre-0.8.0 build (some managed Postgres lagged for months), upgrading is the single highest-leverage change for filtered RAG queries.

The sharp edge — the one from this article's opening: a plain, in-place REINDEX INDEX takes an AccessExclusiveLock, which blocks INSERTs, UPDATEs, and index-using reads until it finishes (per PostgreSQL's lock-conflict table^{[PostgreSQL — Explicit Locking]}). The CONCURRENTLY variant instead takes a ShareUpdateExclusiveLock — which does not conflict with the RowExclusiveLock that writes hold, so writes keep flowing — but it runs two passes and is much slower. Index builds on large HNSW tables are not quick either way, so a naive in-place rebuild that looks harmless can stall ingestion for a long window. Mitigation: use REINDEX INDEX CONCURRENTLY, or build into a new index and swap with the CREATE → DROP → RENAME pattern instead of rebuilding in place, and time any rebuild against a measured build rate on your own data.

Pinecone: Zero-ops at massive scale

^{[Pinecone documentation]}

Pinecone is a fully managed SaaS. No infrastructure to provision — just an API endpoint. Scales automatically to billions of vectors.

from pinecone import Pinecone
 
pc = Pinecone(api_key="your-api-key")
index = pc.Index("documents")
 
# Upsert vectors with metadata
index.upsert(vectors=[
    {"id": "doc-001", "values": embedding, "metadata": {"category": "arch"}},
    {"id": "doc-002", "values": embedding_2, "metadata": {"category": "devops"}},
])
 
# Query with metadata filter
results = index.query(
    vector=query_embedding,
    top_k=10,
    filter={"category": {"$eq": "arch"}},
    include_metadata=True
)

The tradeoff: serverless cold-starts after idle can eat several seconds before the first query returns (documented behavior of Pinecone's serverless tier ^{[Pinecone Serverless Docs]}). The serverless model bills three dimensions independently — storage ($0.33/GB-month), write units ($4–$4.50/M), and read units ($16–$18/M) — on top of plan minimums ($0 Starter, $20/mo Builder, $50/mo Standard, $500/mo Enterprise) ^{[Pinecone documentation]}. Because read units are priced far above storage, the bill scales with query volume rather than corpus size: read-heavy RAG traffic, not stored vectors, is what moves it. Model your own spend from those three rates before committing — the dominant term is almost always reads. Use namespaces for bulk deletes (instant, O(1)); avoid individual filter-based deletes.

Wins when: you want zero-ops and can accept cold-start latency; your team has no appetite for database operations. Struggles when: cost matters at massive scale; vendor lock-in concerns; you need SQL-grade filtering.

Weaviate: Native hybrid search (BM25 + vector)

^{[Weaviate hybrid search docs]}

Weaviate's killer feature: native BM25 + vector hybrid search in a single query with tunable alpha (0.0 = pure keyword, 1.0 = pure vector) — see ^{[Weaviate Hybrid Search Docs]} for the full alpha semantics. Workloads that mix factual lookup with semantic retrieval usually favor a vector-leaning blend over either extreme, but the right alpha is query-dependent — sweep it against a labelled set of your own queries rather than adopting a default blind.

import weaviate
client = weaviate.connect_to_local()
documents = client.collections.get("Document")
 
response = documents.query.hybrid(
    query="distributed consensus",
    alpha=0.75,  # 75% vector, 25% keyword
    limit=10,
)

Pure vector search misses exact keyword matches (searching "OAuth2 PKCE" drifts toward "authentication"). Hybrid search fixes this.

Two 2026 changes matter for sizing. Recent Weaviate releases ship rotational quantization (RQ) and turn vector compression on by default, which is what keeps the per-million-dimension storage cost in check at scale. And Weaviate Cloud's billing model changed in October 2025 from the old per-resource Serverless tier to a vector-dimensions-stored model (free tier, then Flex from $45/mo) ^{[Weaviate hybrid search docs]} — budget against dimensions × objects × replication factor, not raw vector count, and re-check the calculator if you priced this before late 2025.

Wins when: users need both semantic understanding and keyword precision. Struggles when: operational overhead (shard rebalancing is not transparent).

Milvus: Maximum throughput at 100M+ vectors

Open-source, built for massive scale. Distributed architecture (query, data, index nodes) lets each layer scale independently. Purpose-built for hundreds of millions of vectors at thousands of QPS.

from pymilvus import MilvusClient, CollectionSchema, FieldSchema, DataType
 
client = MilvusClient(uri="http://localhost:19530")
 
fields = [
    FieldSchema("id", DataType.INT64, is_primary=True, auto_id=True),
    FieldSchema("embedding", DataType.FLOAT_VECTOR, dim=1536),
]
client.create_collection(collection_name="documents", schema=CollectionSchema(fields))
 
results = client.search(collection_name="documents", data=[query_embedding], limit=10)

The current stable line is Milvus 2.6 (GA mid-2025), which leans hard into cost reduction at billion scale — its headline feature is RaBitQ 1-bit quantization that compresses an index to roughly 1/32 of its float size, paired with SQ8 reranking to hold recall. (A 3.0 line with multi-vector and geospatial types began landing in 2026; 2.6 remains the production-stable choice.) The cost lever at this scale is quantization, not node count.

Sharp edge: etcd is the most fragile component. A platform colocated etcd on query nodes; traffic spike starved it of CPU, lost quorum, and Milvus stopped ingesting. Fix: dedicate machines for etcd.

Wins when: > 100M vectors at 1000+ QPS; your team can operate a distributed system. Struggles when: operational overhead exceeds value for small datasets.

Embedding model migrations: Dual-index strategy

Every team eventually swaps models — ada-002 → text-embedding-3-small, or to self-hosted nomic-embed-text to cut cost. Embeddings from different models are incompatible — you cannot mix them in one index. Re-embedding 5M documents naively (stop ingestion, re-embed, swap) takes 28 hours and leaves the system down or stale.

The dual-index migration sequence — zero downtime, reversible until the very last step:

graph TD
    Start[Production on model v1<br/>idx_v1 hot] --> Step1[Step 1<br/>ALTER TABLE add embedding_v2 column<br/>CREATE INDEX CONCURRENTLY idx_v2]
    Step1 --> Step2[Step 2<br/>Backfill embedding_v2<br/>via batched workers + replication-lag throttling]
    Step2 --> Step3[Step 3<br/>Dual-write: writes update both v1 and v2<br/>reads still hit v1]
    Step3 --> Step4[Step 4<br/>Feature-flag cutover<br/>5 percent of reads → v2<br/>compare quality metrics]
    Step4 --> Quality{Quality<br/>regression?}
    Quality -->|Yes| Rollback[Flip flag back to v1<br/>investigate offline]
    Quality -->|No| Step5[Step 5<br/>Ramp 5 → 25 → 50 → 100 percent<br/>over 1-2 weeks]
    Step5 --> Step6[Step 6<br/>Stop dual-write<br/>drop idx_v1 + embedding_v1 column]
    Rollback --> Step3
    style Step6 fill:#dfd
    style Rollback fill:#ffd

The diagram captures the safety property: every step before Step 6 is reversible. The cutover is gradual, the rollback is one feature-flag flip, and you never have a window where queries hit a half-built index.

Use dual indexes with a feature-flag cutover instead:

Step 1: Add a second embedding column (pgvector) or namespace (Pinecone); build the index concurrently.

ALTER TABLE documents ADD COLUMN embedding_v3 vector(1536);
CREATE INDEX CONCURRENTLY idx_documents_embedding_v3small
    ON documents USING hnsw (embedding_v3 vector_cosine_ops)
    WITH (m = 16, ef_construction = 200);

Step 2: Dual-write new documents; embed with both models.

def upsert(doc_id: str, text: str):
    ada_emb = embed_client.embed(text, model="text-embedding-ada-002")
    v3_emb = embed_client.embed(text, model="text-embedding-3-small")
 
    conn.execute(
        "INSERT INTO documents (id, embedding, embedding_v3) VALUES (%s, %s, %s)",
        (doc_id, ada_emb, v3_emb)
    )

Step 3: Backfill existing docs (restartable via WHERE embedding_v3 IS NULL); validate recall parity; flip feature flag to serve from new index.

Step 4: Monitor query latency and recall; if parity holds after 1 week, drop the old embedding column.

Dimension Changes

Changing dimensions (1536 → 3072) requires a new column — PostgreSQL vector is fixed-dimension per column. On Pinecone, a new index. Plan dimension changes as a separate migration from model changes.

Production checklist

Before going live with any vector DB:

Measure baseline performance — p50, p99, recall@10 on your actual data and query patterns. Published benchmarks use synthetic data.
Plan index rebuild windows — HNSW rebuilds block writes. For pgvector: use CREATE → DROP → RENAME pattern. For Pinecone/Weaviate: automatic but plan shard rebalancing.
Monitor index health — Track index bloat (pgvector), shard imbalance (Weaviate/Milvus), cold-start latency (Pinecone).
Budget for embedding migrations — OpenAI's text-embedding-3-small costs $0.02 / 1M tokens. Re-embedding 10M docs at 500 tokens each = $100. Plan model migrations with dual-index strategy.
Test with your actual scale — Don't migrate prematurely. A well-tuned pgvector instance comfortably serves multi-million-vector workloads at low-latency p99 on adequate RAM, and many sub-billion-vector deployments never need to leave it. Confirm the ceiling on your own data and query mix before assuming you've outgrown it — the migration is rarely as urgent as it feels.

pgvector index health: the queries you run before users notice

Two queries that surface most pgvector production issues — bloat-driven recall regressions and the lock-contention failure mode behind the "REINDEX took the chatbot offline" incident this article opened with:

-- 1. Index size, bloat estimate, and last-VACUUM timestamp. Run weekly; if
--    pg_relation_size grows by >50% without matching row growth, schedule
--    REINDEX CONCURRENTLY in the next maintenance window.
SELECT
    i.relname                                                    AS index_name,
    pg_size_pretty(pg_relation_size(i.oid))                      AS index_size,
    s.n_live_tup                                                 AS table_rows,
    s.last_vacuum,
    s.last_autovacuum
FROM pg_class i
JOIN pg_index x          ON x.indexrelid = i.oid
JOIN pg_class t          ON t.oid = x.indrelid
JOIN pg_stat_user_tables s ON s.relid = t.oid
WHERE i.relkind = 'i'
  AND pg_get_indexdef(i.oid) ILIKE '%USING hnsw%'
ORDER BY pg_relation_size(i.oid) DESC;
 
-- 2. Live maintenance/blocking-lock detector. Paste during an incident:
--    mode='AccessExclusiveLock' on a vector index means a plain in-place
--    REINDEX is blocking your write path right now; mode='ShareUpdateExclusiveLock'
--    means a CONCURRENTLY reindex/VACUUM is running (slow, but writes still flow).
--    Rows with granted=false are sessions already stuck waiting on a lock.
SELECT
    a.pid,
    a.usename,
    a.query_start,
    now() - a.query_start                                        AS waiting_for,
    l.mode,
    l.relation::regclass                                         AS blocked_object,
    a.query
FROM pg_stat_activity a
JOIN pg_locks l ON l.pid = a.pid
WHERE NOT l.granted
   OR l.mode IN ('ShareUpdateExclusiveLock', 'AccessExclusiveLock')
ORDER BY a.query_start;

The first query is your "is bloat eating recall?" check; the second is your "why are inserts hanging right now?" check. Together they cover ~90% of pgvector incidents. ^{[pgvector README]}

Decision Framework

Start from your existing infrastructure, not from a feature matrix:

┌─────────────────────────────────────┐
│     Already running PostgreSQL?     │
└──────────────┬──────────────────────┘
               │
          ┌────▼────┐
          │   Yes   │──────────────────▶  Start with pgvector
          └─────────┘                     (zero new infrastructure)
               │ No                              │
               ▼                                 │
     ┌─────────────────┐                         │
     │ Need hybrid      │                        ▼
     │ search (BM25 +   │               Dataset > 10M vectors
     │ vector)?         │               or need > 1000 QPS?
     └────────┬────────┘                         │
              │                             ┌────▼────┐
         ┌────▼────┐                        │   Yes   │──▶ Migrate to
         │   Yes   │──▶ Weaviate            └─────────┘    dedicated DB
         └─────────┘                             │ No
              │ No                               │
              ▼                                  ▼
     ┌─────────────────┐                  Stay on pgvector
     │ Want zero-ops    │
     │ managed service? │
     └────────┬────────┘
              │
         ┌────▼────┐
         │   Yes   │──▶ Pinecone
         └─────────┘
              │ No
              ▼
     ┌─────────────────┐
     │ Need max QPS on  │
     │ 100M+ vectors?   │
     └────────┬────────┘
              │
         ┌────▼────┐
         │   Yes   │──▶ Milvus (distributed)
         └─────────┘
              │ No
              ▼
       Start with pgvector
       (simplest default)

The same routing in Mermaid form — keep this as the cheat-sheet for "which vector database":

graph TD
    Start[Need vector search] --> PG{Already running<br/>PostgreSQL?}
    PG -->|Yes| Hybrid1{Need hybrid search<br/>BM25 plus vector?}
    PG -->|No| Hybrid2{Need hybrid search<br/>BM25 plus vector?}
    Hybrid1 -->|Yes| Wv1[Weaviate]
    Hybrid1 -->|No| Scale1{Over 10M vectors<br/>or over 1000 QPS?}
    Scale1 -->|No| PGv[pgvector<br/>zero new infrastructure]
    Scale1 -->|Yes| Ops1{Want zero-ops<br/>managed service?}
    Ops1 -->|Yes| Pine[Pinecone]
    Ops1 -->|No| Throughput1{Need max QPS on<br/>over 100M vectors?}
    Throughput1 -->|Yes| Mil[Milvus<br/>distributed]
    Throughput1 -->|No| PGv
    Hybrid2 -->|Yes| Wv1
    Hybrid2 -->|No| Ops1
    style PGv fill:#dfd
    style Wv1 fill:#dfd
    style Pine fill:#ffd
    style Mil fill:#ffd

Start with pgvector, measure, and migrate only when you hit a concrete limitation — not a theoretical one.

Abstract your vector store interface

Build migration flexibility from day one. Use an abstraction so swapping backends is a new class, not a pipeline rewrite:

from abc import ABC, abstractmethod
from dataclasses import dataclass
 
@dataclass
class SearchResult:
    id: str
    score: float
    metadata: dict
 
class VectorStore(ABC):
    @abstractmethod
    def upsert(self, id: str, vector: list[float], metadata: dict) -> None: ...
 
    @abstractmethod
    def search(self, vector: list[float], top_k: int = 10, filter: dict = None) -> list[SearchResult]: ...
 
    @abstractmethod
    def delete(self, ids: list[str]) -> None: ...

Implement pgvector, then Pinecone, then Weaviate. A config flag selects which backend to use. When you outgrow pgvector, you swap implementations without rewriting your application logic.

Hybrid retrieval: BM25 + vector + cross-encoder rerank

Pure vector search loses on rare tokens — product SKUs, error codes, library names, version numbers. BM25 still owns those. The production-grade pattern is a three-stage funnel: a wide BM25 + dense recall pass over the corpus, a fusion step that merges the two ranked lists, then a cross-encoder reranker that re-scores the top 50–100 candidates with a much stronger model. The reranker is too expensive to apply to the full index, but cheap enough to run on a short shortlist, and it consistently lifts NDCG@10 by 5–15 points over either signal alone.

In Postgres you can build the hybrid stage without leaving the database. Add a tsvector column alongside the embedding, run BM25-style ranking with ts_rank_cd, and combine the two scores with Reciprocal Rank Fusion (RRF) — which is robust to score-scale mismatch in a way that simple weighted sums are not.

-- One CTE per signal, fused by rank position (k = 60 is the standard RRF
-- constant). RRF avoids having to normalise BM25 against cosine distance.
WITH bm25 AS (
    SELECT id,
           ts_rank_cd(content_tsv, plainto_tsquery('english', $1)) AS score,
           row_number() OVER (ORDER BY ts_rank_cd(content_tsv,
               plainto_tsquery('english', $1)) DESC) AS rnk
    FROM documents
    WHERE content_tsv @@ plainto_tsquery('english', $1)
    LIMIT 100
),
dense AS (
    SELECT id,
           1 - (embedding <=> $2::vector) AS score,
           row_number() OVER (ORDER BY embedding <=> $2::vector) AS rnk
    FROM documents
    ORDER BY embedding <=> $2::vector
    LIMIT 100
)
SELECT d.id, d.content,
       COALESCE(1.0 / (60 + b.rnk), 0) + COALESCE(1.0 / (60 + v.rnk), 0) AS rrf
FROM documents d
LEFT JOIN bm25  b ON b.id = d.id
LEFT JOIN dense v ON v.id = d.id
WHERE b.id IS NOT NULL OR v.id IS NOT NULL
ORDER BY rrf DESC
LIMIT 50;

Feed the 50 RRF survivors into a cross-encoder — a bge-reranker-v2-m3 or cohere-rerank-3 call — and return the top 10. Budget the reranker call at ~80 ms p99 for 50 candidates on a single GPU; on CPU, cap the shortlist at 25.

Embedding model migration playbook

A complete dual-index migration looks like a four-week rollout, not a weekend job. The realistic timeline for a 10M-document corpus on text-embedding-3-small:

Week 1 — shadow build. Add the new vector column, build the index CONCURRENTLY, enable dual-write so every fresh document hits both columns. No read traffic touches the new index yet. Smoke-test the embedding client at 100 docs/sec sustained.
Week 2 — backfill. Run batched workers (1k docs per batch, 4–8 workers) to embed historical rows. At an OpenAI rate of roughly 3k requests/min, 10M docs takes ~55 hours of wall-clock. Throttle on replication lag — pause the workers if pg_replication_slots.confirmed_flush_lsn falls more than 30 seconds behind primary.
Week 3 — shadow eval. Mirror 5% of read traffic to both indexes, log the top-10 ID sets, and compute Jaccard overlap and recall@10 against a held-out judgment set. Block the cutover if recall regresses by more than 2 absolute points.
Week 4 — ramp and drop. Flip the feature flag from 5% → 25% → 50% → 100% over four days, watching p99 latency and answer-quality eval scores. Wait one full week with 100% traffic on the new index before dropping the old column and reclaiming disk with VACUUM FULL during an off-peak window. ^{[pgvector README]}

Budget the dollars up front: re-embedding 10M docs at 500 tokens each is 5B tokens; at $0.02 per 1M tokens, that is $100 of compute plus whatever your egress and worker hours cost. The expensive part is the eval judgment set, not the embeddings.

Capacity planning: memory and disk per million 1536-dim vectors

Three formulas cover 90% of pgvector capacity questions. For 1536-dimension float32 vectors with HNSW at m = 16: ^{[pgvector README]}

Raw vector bytes per row = dim * 4 = 1536 × 4 = 6,144 B (~6 KB).
HNSW index bytes per row ≈ dim * 4 + m * 2 * 8 + 32 = 6144 + 256 + 32 = 6,432 B. Roughly the same as raw storage — HNSW essentially doubles your footprint.
Heap row overhead = ~28 B (tuple header) + content TEXT toasted out-of-line + JSONB metadata. Budget 200 B per row for typical RAG payloads.

Interactive Vector Memory & DB Estimator

RAM Calculator

Vector Count: 1,000,00010k to 10M

Dimensions (1536)

Index Type

Quantisation

Estimated Memory Footprint

0.0 MBTotal RAM

Raw payload: 0.0 MB

Index overhead: 0.0 MB

Self-Hosted or Cloud Dedicated

Recommended: Weaviate / Qdrant

For medium-to-large datasets requiring high search precision without quantisation, Weaviate or Qdrant are excellent. They are optimized for HNSW and provide sub-10ms search times with rich metadata filtering.

Developer Tip:For Qdrant, enable memmap for the HNSW index to keep graph links on disk if memory is tight, or use scalar quantisation.

Worked example for 5M documents with 1.5 KB of average text content per row:

Raw vectors:        5_000_000 * 6144   = 30.7 GB
HNSW index:         5_000_000 * 6432   = 32.2 GB
Heap (text+meta):   5_000_000 * 1700   =  8.5 GB
TOAST + WAL slack:                       ~10 GB
                                        --------
Total disk:                              ~81 GB
 
Working set in RAM (HNSW must fit hot):
  index pages           ~32 GB
  Postgres shared_buffers (25% RAM)
  OS page cache (25% RAM)
                                        --------
  Right-sized instance:    96-128 GB RAM

Rule of thumb: provision RAM ≥ 1.3 × HNSW index size so the entire index plus a working set of heap pages stays resident. Spilling HNSW to disk turns a 5 ms p99 query into a 200 ms one — cold pages on NVMe are 40× slower than RAM. Quantizing to halfvec (float16) cuts vector and index size in half at the cost of ~1 point of recall@10; quantizing to bit (binary) cuts another 4× but only works as a first-stage filter before a full-precision rerank.

Frequently Asked Questions

Should I use pgvector or a dedicated vector database?

Use pgvector if you already run PostgreSQL and need ACID, SQL joins, and want to avoid a new infrastructure dependency. Choose Pinecone for zero-ops at billions of vectors; Weaviate for hybrid BM25+vector search; Milvus for 100M+ vectors at high QPS.

What is HNSW and why does it matter?

HNSW (Hierarchical Navigable Small World) is the dominant ANN index algorithm. It gives sub-millisecond query times but requires significant memory and has expensive index builds — plan rebuild windows carefully in production.

How do you migrate embedding models without downtime?

Use a dual-index strategy: add a new embedding column, dual-write new documents to both old and new indexes, backfill existing docs in batches, validate recall parity, then swap traffic with a feature flag. Never switch models without re-embedding.

How many vectors can pgvector handle?

pgvector handles 5–10M vectors well on a single instance with tuned HNSW (m=16 uses ~12GB RAM per million 1536-dim vectors). Beyond 10M vectors or for sub-10ms p99 at high concurrency, migrate to a distributed option (Milvus, Pinecone) or add Citus sharding.

Keep Reading

Building Production RAG Pipelines in Go — Chunking, embeddings, pgvector operations, and retrieval evaluation for the pipeline that feeds your vector store
Spring AI in Production — The Java/Spring equivalent: RAG pipelines with pgvector, circuit breakers, and quality evaluation
Database Indexing Strategies — B-tree, GIN, and GiST index internals for the metadata indexes that sit alongside your HNSW vectors in pgvector
LLM API Integration Patterns — Token budgets, retries, and streaming for the embedding-API client that fills your vector index
Securing AI Agent Infrastructure — Tenant isolation in pgvector / Pinecone / Weaviate so a poisoned RAG corpus does not leak across customers

Coming Next

Coming Next: Building Production RAG Pipelines in Go

Selecting the right vector database is only the first step in building a resilient retrieval pipeline. In our next deep dive, we walk through constructing a production-grade RAG pipeline in Go, detailing chunking strategies, embedding requests, hybrid search execution, and retrieval evaluation metrics. Read the RAG Pipelines Guide.

Was this article helpful?

Your feedback directly shapes our editorial depth and technical accuracy.