Vector Databases Compared: pgvector vs Pinecone vs Weaviate
Key Takeaways
- →pgvector HNSW index rebuilds block writes — one team's routine REINDEX turned into an 8-hour ingestion outage
- →Choose by infrastructure fit: already have Postgres → pgvector; need zero-ops → Pinecone; want hybrid BM25+vector → Weaviate; 100M+ vectors → Milvus
- →Switching embedding models requires re-embedding your entire corpus — use dual-index migration with a feature flag, not a maintenance window
- →Under 100K vectors with infrequent queries, brute-force NumPy is fine — you don't need a vector database until you need sub-100ms latency at 1M+ vectors
A routine Thursday
REINDEX INDEX CONCURRENTLYon a 12M-row pgvector table turned into an eight-hour incident. HNSW index builds take aShareUpdateExclusiveLock—SELECTs keep working, butINSERTandUPDATEare blocked. The RAG ingestion pipeline stalled, and tickets arrived: "Why can't the chatbot see the document I just uploaded?"
The lesson: vector databases have operational characteristics no feature matrix captures. We've deployed all of these at scale — pgvector, Pinecone, Weaviate, Milvus — and learned the sharp edges the hard way. Each DB trades off scalability, cost, and operational burden differently. This guide compares them with real code, benchmarks, and the operational gotchas that only surface at production scale.
Choose based on infrastructure fit: already have Postgres → start with pgvector (zero new infrastructure); need zero-ops managed service → Pinecone; want hybrid BM25+vector search → Weaviate; have 100M+ vectors at high QPS → Milvus. Don't migrate prematurely; measure first.
- pgvector wins if you already run PostgreSQL and don't need separate scaling
- Pinecone wins if you want fully managed with no infrastructure overhead
- Weaviate wins if users need both keyword precision and semantic understanding
- Milvus wins if you have 100M+ vectors and need thousands of QPS
The quick start: Vector DB comparison matrix
Choose based on your infrastructure and constraints, not by checking feature boxes:
| Dimension | pgvector | Pinecone | Weaviate | Milvus |
|---|---|---|---|---|
| Type | PostgreSQL extension | Fully managed SaaS | Open-source DB | Open-source distributed DB |
| Best for | Already use Postgres | Zero-ops, fast prototyping | Hybrid BM25+vector search | 100M+ vectors, 1000+ QPS |
| Deployment | Your existing Postgres | API endpoint only | Self-hosted or managed | Self-hosted or Zilliz Cloud |
| Scaling | Read replicas / Citus | Auto-scales, no tuning | Native sharding | Distributed (query, data, index nodes) |
| Hybrid search | Manual SQL join | Sparse-dense in one query | Native BM25 + alpha tuning | Sparse + dense vectors |
| ACID / SQL joins | Yes (full Postgres) | No | No | No |
| Max vectors (single node) | ~10M well-tuned | N/A (managed) | ~100M | Unlimited (distributed) |
| Index build time (1M vectors) | 8–15 min | Seconds | 3–5 min | 2–4 min |
| Cold start latency | None | 2–5 sec (serverless) | None | None |
| Cost (10M vectors, 100 QPS) | $50–450/mo self-hosted; $25–200/mo managed | $100–350/mo | $250–680/mo | $200–900/mo |
| Operational overhead | Moderate (index rebalancing) | None | Medium (sharding ops) | High (etcd, MinIO, multi-node) |
| Latest version (mid-2026) | pgvector 0.8.2 | Serverless (managed) | Weaviate 1.37 | Milvus 2.6.x |
The cost row blends self-hosted infra and managed-tier list prices into an order-of-magnitude band — treat it as a planning estimate, not a quote. The managed pricing models diverge in ways the band hides, and all four moved in the last year: Pinecone serverless bills usage as storage ($0.33/GB-month), write units ($4–$4.50/M), and read units ($16–$18/M), with plan minimums of $0 (Starter), $20/mo (the newer Builder tier), $50/mo (Standard), and $500/mo (Enterprise). Weaviate Cloud moved to a vector-dimensions-stored model in October 2025: a free tier, then Flex from $45/mo, Plus from $280/mo, and Premium from $400/mo. Milvus is self-hosted (Apache-2.0); the managed path is Zilliz Cloud, whose serverless storage dropped to $0.04/GB-month in January 2026. Re-price against each vendor's calculator before committing — list prices change quarterly.
Two databases not broken out above are worth knowing in 2026. Qdrant (Rust, latest 1.18) competes directly with the four here: native payload filtering, a free managed tier (0.5 vCPU / 1 GB RAM), and resource-based Cloud pricing. LanceDB ($30M Series A in mid-2025) takes a different shape — an embedded, serverless engine on the columnar Lance format, aimed at multimodal and lakehouse-scale data rather than a long-running server. If your fit is "Postgres-adjacent but standalone," evaluate Qdrant; if it is "embedded, file-based, multimodal," evaluate LanceDB.
When you DON'T need a vector database: under 100K vectors with infrequent queries, brute-force NumPy is fine. You need an index when sub-100ms latency at 1M+ vectors with concurrent users matters.
pgvector: Start here if you have PostgreSQL
[pgvector README]pgvector adds a vector column type and HNSW indexes to Postgres. If you already run Postgres, you get vector search with ACID transactions and SQL joins — no new infrastructure.
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE documents (
id BIGINT PRIMARY KEY,
content TEXT NOT NULL,
metadata JSONB DEFAULT '{}',
embedding vector(1536), -- OpenAI text-embedding-3-small dimensions
created_at TIMESTAMPTZ DEFAULT now()
);
-- HNSW index: m=16, ef_construction=200 is production sweet spot
CREATE INDEX idx_embedding ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 200);
-- Vector search with metadata filters and SQL joins
SELECT d.id, d.content, 1 - (d.embedding <=> $1::vector) AS similarity
FROM documents d
JOIN categories c ON d.category_id = c.id
WHERE c.slug = 'architecture'
AND d.created_at > now() - INTERVAL '90 days'
ORDER BY d.embedding <=> $1::vector
LIMIT 10;The <=> operator is cosine distance. You get full SQL expressiveness — WHERE clauses, JOINs, window functions — all with indexes. HNSW tuning matters: m=16 (~12GB RAM per million 1536-dim vectors) balances memory and recall; in our testing, raising m to 32 increased memory significantly but only pushed recall from ~0.95 to ~0.98 against our dataset. Plan for 64GB+ RAM instances at 5M+ vectors. See [pgvector README] for the full HNSW parameter reference.
The release worth knowing about is 0.8.0 (October 2024), carried through the current 0.8.2: it added iterative index scans (SET hnsw.iterative_scan = 'relaxed_order') plus improved cost estimation for filtered queries. Before 0.8.0, a selective WHERE clause on top of an HNSW scan could "overfilter" — the index returned its ef_search candidates, the filter discarded most of them, and you got back fewer than LIMIT rows. Iterative scans keep walking the index until the filtered result set is full, which makes the metadata-filtered queries above behave the way you'd expect. If you're on a pre-0.8.0 build (some managed Postgres lagged for months), upgrading is the single highest-leverage change for filtered RAG queries.
The sharp edge: REINDEX INDEX CONCURRENTLY takes a ShareUpdateExclusiveLock — it blocks INSERTs and UPDATEs while SELECT keeps working (documented in [pgvector README]). We learned this the hard way: plan 10–15 minute rebuild windows for 1M vectors. Mitigation: use the CREATE → DROP → RENAME pattern instead.
Pinecone: Zero-ops at massive scale
[Pinecone documentation]Pinecone is a fully managed SaaS. No infrastructure to provision — just an API endpoint. Scales automatically to billions of vectors.
from pinecone import Pinecone
pc = Pinecone(api_key="your-api-key")
index = pc.Index("documents")
# Upsert vectors with metadata
index.upsert(vectors=[
{"id": "doc-001", "values": embedding, "metadata": {"category": "arch"}},
{"id": "doc-002", "values": embedding_2, "metadata": {"category": "devops"}},
])
# Query with metadata filter
results = index.query(
vector=query_embedding,
top_k=10,
filter={"category": {"$eq": "arch"}},
include_metadata=True
)The tradeoff: serverless cold-starts after idle can eat several seconds before the first query returns (documented behavior of Pinecone's serverless tier [Pinecone Serverless Docs]); costs scale non-linearly with vector count — in our experience, one startup's bill grew from $70/mo to $2,800/mo as the corpus went from under 1M to 8M vectors at 200 QPS. The serverless model bills three dimensions independently — storage ($0.33/GB-month), write units ($4–$4.50/M), and read units ($16–$18/M) — on top of plan minimums ($0 Starter, $20/mo Builder, $50/mo Standard, $500/mo Enterprise) [Pinecone documentation], so read-heavy RAG traffic, not stored vectors, is usually what moves the bill. Use namespaces for bulk deletes (instant, O(1)); avoid individual filter-based deletes.
Wins when: you want zero-ops and can accept cold-start latency; your team has no appetite for database operations. Struggles when: cost matters at massive scale; vendor lock-in concerns; you need SQL-grade filtering.
Weaviate: Native hybrid search (BM25 + vector)
[Weaviate hybrid search docs]Weaviate's killer feature: native BM25 + vector hybrid search in a single query with tunable alpha (0.0 = pure keyword, 1.0 = pure vector) — see [Weaviate Hybrid Search Docs] for the full alpha semantics. In our experience, RAG workloads that mix factual lookup with semantic retrieval tend to land in the 0.65–0.80 range; measure against your own queries before committing to a default.
import weaviate
client = weaviate.connect_to_local()
documents = client.collections.get("Document")
response = documents.query.hybrid(
query="distributed consensus",
alpha=0.75, # 75% vector, 25% keyword
limit=10,
)Pure vector search misses exact keyword matches (searching "OAuth2 PKCE" drifts toward "authentication"). Hybrid search fixes this.
Two 2026 changes matter for sizing. Recent Weaviate releases ship rotational quantization (RQ) and turn vector compression on by default, which is what keeps the per-million-dimension storage cost in check at scale. And Weaviate Cloud's billing model changed in October 2025 from the old per-resource Serverless tier to a vector-dimensions-stored model (free tier, then Flex from $45/mo) [Weaviate hybrid search docs] — budget against dimensions × objects × replication factor, not raw vector count, and re-check the calculator if you priced this before late 2025.
Wins when: users need both semantic understanding and keyword precision. Struggles when: operational overhead (shard rebalancing is not transparent).
Milvus: Maximum throughput at 100M+ vectors
Open-source, built for massive scale. Distributed architecture (query, data, index nodes) lets each layer scale independently. Purpose-built for hundreds of millions of vectors at thousands of QPS.
from pymilvus import MilvusClient, CollectionSchema, FieldSchema, DataType
client = MilvusClient(uri="http://localhost:19530")
fields = [
FieldSchema("id", DataType.INT64, is_primary=True, auto_id=True),
FieldSchema("embedding", DataType.FLOAT_VECTOR, dim=1536),
]
client.create_collection(collection_name="documents", schema=CollectionSchema(fields))
results = client.search(collection_name="documents", data=[query_embedding], limit=10)The current stable line is Milvus 2.6 (GA mid-2025), which leans hard into cost reduction at billion scale — its headline feature is RaBitQ 1-bit quantization that compresses an index to roughly 1/32 of its float size, paired with SQ8 reranking to hold recall. (A 3.0 line with multi-vector and geospatial types began landing in 2026; 2.6 remains the production-stable choice.) The cost lever at this scale is quantization, not node count.
Sharp edge: etcd is the most fragile component. A platform colocated etcd on query nodes; traffic spike starved it of CPU, lost quorum, and Milvus stopped ingesting. Fix: dedicate machines for etcd.
Wins when: > 100M vectors at 1000+ QPS; your team can operate a distributed system. Struggles when: operational overhead exceeds value for small datasets.
Embedding model migrations: Dual-index strategy
Every team eventually swaps models — ada-002 → text-embedding-3-small, or to self-hosted nomic-embed-text to cut cost. Embeddings from different models are incompatible — you cannot mix them in one index. Re-embedding 5M documents naively (stop ingestion, re-embed, swap) takes 28 hours and leaves the system down or stale.
The dual-index migration sequence — zero downtime, reversible until the very last step:
graph TD
Start[Production on model v1<br/>idx_v1 hot] --> Step1[Step 1<br/>ALTER TABLE add embedding_v2 column<br/>CREATE INDEX CONCURRENTLY idx_v2]
Step1 --> Step2[Step 2<br/>Backfill embedding_v2<br/>via batched workers + replication-lag throttling]
Step2 --> Step3[Step 3<br/>Dual-write: writes update both v1 and v2<br/>reads still hit v1]
Step3 --> Step4[Step 4<br/>Feature-flag cutover<br/>5 percent of reads → v2<br/>compare quality metrics]
Step4 --> Quality{Quality<br/>regression?}
Quality -->|Yes| Rollback[Flip flag back to v1<br/>investigate offline]
Quality -->|No| Step5[Step 5<br/>Ramp 5 → 25 → 50 → 100 percent<br/>over 1-2 weeks]
Step5 --> Step6[Step 6<br/>Stop dual-write<br/>drop idx_v1 + embedding_v1 column]
Rollback --> Step3
style Step6 fill:#dfd
style Rollback fill:#ffd
The diagram captures the safety property: every step before Step 6 is reversible. The cutover is gradual, the rollback is one feature-flag flip, and you never have a window where queries hit a half-built index.
Use dual indexes with a feature-flag cutover instead:
Step 1: Add a second embedding column (pgvector) or namespace (Pinecone); build the index concurrently.
ALTER TABLE documents ADD COLUMN embedding_v3 vector(1536);
CREATE INDEX CONCURRENTLY idx_documents_embedding_v3small
ON documents USING hnsw (embedding_v3 vector_cosine_ops)
WITH (m = 16, ef_construction = 200);Step 2: Dual-write new documents; embed with both models.
def upsert(doc_id: str, text: str):
ada_emb = embed_client.embed(text, model="text-embedding-ada-002")
v3_emb = embed_client.embed(text, model="text-embedding-3-small")
conn.execute(
"INSERT INTO documents (id, embedding, embedding_v3) VALUES (%s, %s, %s)",
(doc_id, ada_emb, v3_emb)
)Step 3: Backfill existing docs (restartable via WHERE embedding_v3 IS NULL); validate recall parity; flip feature flag to serve from new index.
Step 4: Monitor query latency and recall; if parity holds after 1 week, drop the old embedding column.
Changing dimensions (1536 → 3072) requires a new column — PostgreSQL vector is fixed-dimension per column. On
Pinecone, a new index. Plan dimension changes as a separate migration from model changes.
Production checklist
Before going live with any vector DB:
- Measure baseline performance — p50, p99, recall@10 on your actual data and query patterns. Published benchmarks use synthetic data.
- Plan index rebuild windows — HNSW rebuilds block writes. For pgvector: use CREATE → DROP → RENAME pattern. For Pinecone/Weaviate: automatic but plan shard rebalancing.
- Monitor index health — Track index bloat (pgvector), shard imbalance (Weaviate/Milvus), cold-start latency (Pinecone).
- Budget for embedding migrations — OpenAI's
text-embedding-3-smallcosts $0.02 / 1M tokens. Re-embedding 10M docs at 500 tokens each = $100. Plan model migrations with dual-index strategy. - Test with your actual scale — Don't migrate prematurely. A well-tuned pgvector instance handles 5–10M vectors with sub-20ms p99 in our benchmarks; in our experience, teams shipping sub-billion-vector workloads rarely outgrow that configuration.
pgvector index health: the queries you run before users notice
Two queries that surface most pgvector production issues — bloat-driven recall regressions and the lock-contention failure mode behind the "REINDEX took the chatbot offline" incident this article opened with:
-- 1. Index size, bloat estimate, and last-VACUUM timestamp. Run weekly; if
-- pg_relation_size grows by >50% without matching row growth, schedule
-- REINDEX CONCURRENTLY in the next maintenance window.
SELECT
i.relname AS index_name,
pg_size_pretty(pg_relation_size(i.oid)) AS index_size,
s.n_live_tup AS table_rows,
s.last_vacuum,
s.last_autovacuum
FROM pg_class i
JOIN pg_index x ON x.indexrelid = i.oid
JOIN pg_class t ON t.oid = x.indrelid
JOIN pg_stat_user_tables s ON s.relid = t.oid
WHERE i.relkind = 'i'
AND pg_get_indexdef(i.oid) ILIKE '%USING hnsw%'
ORDER BY pg_relation_size(i.oid) DESC;
-- 2. Live blocking-lock detector. Paste during an incident: any non-empty
-- result with mode='ShareUpdateExclusiveLock' on a vector index means a
-- REINDEX or VACUUM is blocking your write path right now.
SELECT
a.pid,
a.usename,
a.query_start,
now() - a.query_start AS waiting_for,
l.mode,
l.relation::regclass AS blocked_object,
a.query
FROM pg_stat_activity a
JOIN pg_locks l ON l.pid = a.pid
WHERE NOT l.granted
OR l.mode IN ('ShareUpdateExclusiveLock', 'AccessExclusiveLock')
ORDER BY a.query_start;The first query is your "is bloat eating recall?" check; the second is your "why are inserts hanging right now?" check. Together they cover ~90% of pgvector incidents. [pgvector README]
Decision Framework
Start from your existing infrastructure, not from a feature matrix:
┌─────────────────────────────────────┐
│ Already running PostgreSQL? │
└──────────────┬──────────────────────┘
│
┌────▼────┐
│ Yes │──────────────────▶ Start with pgvector
└─────────┘ (zero new infrastructure)
│ No │
▼ │
┌─────────────────┐ │
│ Need hybrid │ ▼
│ search (BM25 + │ Dataset > 10M vectors
│ vector)? │ or need > 1000 QPS?
└────────┬────────┘ │
│ ┌────▼────┐
┌────▼────┐ │ Yes │──▶ Migrate to
│ Yes │──▶ Weaviate └─────────┘ dedicated DB
└─────────┘ │ No
│ No │
▼ ▼
┌─────────────────┐ Stay on pgvector
│ Want zero-ops │
│ managed service? │
└────────┬────────┘
│
┌────▼────┐
│ Yes │──▶ Pinecone
└─────────┘
│ No
▼
┌─────────────────┐
│ Need max QPS on │
│ 100M+ vectors? │
└────────┬────────┘
│
┌────▼────┐
│ Yes │──▶ Milvus (distributed)
└─────────┘
│ No
▼
Start with pgvector
(simplest default)The same routing in Mermaid form — keep this as the cheat-sheet for "which vector database":
graph TD
Start[Need vector search] --> PG{Already running<br/>PostgreSQL?}
PG -->|Yes| Hybrid1{Need hybrid search<br/>BM25 plus vector?}
PG -->|No| Hybrid2{Need hybrid search<br/>BM25 plus vector?}
Hybrid1 -->|Yes| Wv1[Weaviate]
Hybrid1 -->|No| Scale1{Over 10M vectors<br/>or over 1000 QPS?}
Scale1 -->|No| PGv[pgvector<br/>zero new infrastructure]
Scale1 -->|Yes| Ops1{Want zero-ops<br/>managed service?}
Ops1 -->|Yes| Pine[Pinecone]
Ops1 -->|No| Throughput1{Need max QPS on<br/>over 100M vectors?}
Throughput1 -->|Yes| Mil[Milvus<br/>distributed]
Throughput1 -->|No| PGv
Hybrid2 -->|Yes| Wv1
Hybrid2 -->|No| Ops1
style PGv fill:#dfd
style Wv1 fill:#dfd
style Pine fill:#ffd
style Mil fill:#ffd
Start with pgvector, measure, and migrate only when you hit a concrete limitation — not a theoretical one.
Abstract your vector store interface
Build migration flexibility from day one. Use an abstraction so swapping backends is a new class, not a pipeline rewrite:
from abc import ABC, abstractmethod
from dataclasses import dataclass
@dataclass
class SearchResult:
id: str
score: float
metadata: dict
class VectorStore(ABC):
@abstractmethod
def upsert(self, id: str, vector: list[float], metadata: dict) -> None: ...
@abstractmethod
def search(self, vector: list[float], top_k: int = 10, filter: dict = None) -> list[SearchResult]: ...
@abstractmethod
def delete(self, ids: list[str]) -> None: ...Implement pgvector, then Pinecone, then Weaviate. A config flag selects which backend to use. When you outgrow pgvector, you swap implementations without rewriting your application logic.
Hybrid retrieval: BM25 + vector + cross-encoder rerank
Pure vector search loses on rare tokens — product SKUs, error codes, library names, version numbers. BM25 still owns those. The production-grade pattern is a three-stage funnel: a wide BM25 + dense recall pass over the corpus, a fusion step that merges the two ranked lists, then a cross-encoder reranker that re-scores the top 50–100 candidates with a much stronger model. The reranker is too expensive to apply to the full index, but cheap enough to run on a short shortlist, and it consistently lifts NDCG@10 by 5–15 points over either signal alone.
In Postgres you can build the hybrid stage without leaving the database. Add a tsvector column alongside the embedding, run BM25-style ranking with ts_rank_cd, and combine the two scores with Reciprocal Rank Fusion (RRF) — which is robust to score-scale mismatch in a way that simple weighted sums are not.
-- One CTE per signal, fused by rank position (k = 60 is the standard RRF
-- constant). RRF avoids having to normalise BM25 against cosine distance.
WITH bm25 AS (
SELECT id,
ts_rank_cd(content_tsv, plainto_tsquery('english', $1)) AS score,
row_number() OVER (ORDER BY ts_rank_cd(content_tsv,
plainto_tsquery('english', $1)) DESC) AS rnk
FROM documents
WHERE content_tsv @@ plainto_tsquery('english', $1)
LIMIT 100
),
dense AS (
SELECT id,
1 - (embedding <=> $2::vector) AS score,
row_number() OVER (ORDER BY embedding <=> $2::vector) AS rnk
FROM documents
ORDER BY embedding <=> $2::vector
LIMIT 100
)
SELECT d.id, d.content,
COALESCE(1.0 / (60 + b.rnk), 0) + COALESCE(1.0 / (60 + v.rnk), 0) AS rrf
FROM documents d
LEFT JOIN bm25 b ON b.id = d.id
LEFT JOIN dense v ON v.id = d.id
WHERE b.id IS NOT NULL OR v.id IS NOT NULL
ORDER BY rrf DESC
LIMIT 50;Feed the 50 RRF survivors into a cross-encoder — a bge-reranker-v2-m3 or cohere-rerank-3 call — and return the top 10. Budget the reranker call at ~80 ms p99 for 50 candidates on a single GPU; on CPU, cap the shortlist at 25.
Embedding model migration playbook
A complete dual-index migration looks like a four-week rollout, not a weekend job. The realistic timeline for a 10M-document corpus on text-embedding-3-small:
- Week 1 — shadow build. Add the new vector column, build the index
CONCURRENTLY, enable dual-write so every fresh document hits both columns. No read traffic touches the new index yet. Smoke-test the embedding client at 100 docs/sec sustained. - Week 2 — backfill. Run batched workers (1k docs per batch, 4–8 workers) to embed historical rows. At an OpenAI rate of roughly 3k requests/min, 10M docs takes ~55 hours of wall-clock. Throttle on replication lag — pause the workers if
pg_replication_slots.confirmed_flush_lsnfalls more than 30 seconds behind primary. - Week 3 — shadow eval. Mirror 5% of read traffic to both indexes, log the top-10 ID sets, and compute Jaccard overlap and recall@10 against a held-out judgment set. Block the cutover if recall regresses by more than 2 absolute points.
- Week 4 — ramp and drop. Flip the feature flag from 5% → 25% → 50% → 100% over four days, watching p99 latency and answer-quality eval scores. Wait one full week with 100% traffic on the new index before dropping the old column and reclaiming disk with
VACUUM FULLduring an off-peak window. [pgvector README]
Budget the dollars up front: re-embedding 10M docs at 500 tokens each is 5B tokens; at $0.02 per 1M tokens, that is $100 of compute plus whatever your egress and worker hours cost. The expensive part is the eval judgment set, not the embeddings.
Capacity planning: memory and disk per million 1536-dim vectors
Three formulas cover 90% of pgvector capacity questions. For 1536-dimension float32 vectors with HNSW at m = 16: [pgvector README]
- Raw vector bytes per row =
dim * 4= 1536 × 4 = 6,144 B (~6 KB). - HNSW index bytes per row ≈
dim * 4 + m * 2 * 8 + 32= 6144 + 256 + 32 = 6,432 B. Roughly the same as raw storage — HNSW essentially doubles your footprint. - Heap row overhead = ~28 B (tuple header) + content TEXT toasted out-of-line + JSONB metadata. Budget 200 B per row for typical RAG payloads.
Worked example for 5M documents with 1.5 KB of average text content per row:
Raw vectors: 5_000_000 * 6144 = 30.7 GB
HNSW index: 5_000_000 * 6432 = 32.2 GB
Heap (text+meta): 5_000_000 * 1700 = 8.5 GB
TOAST + WAL slack: ~10 GB
--------
Total disk: ~81 GB
Working set in RAM (HNSW must fit hot):
index pages ~32 GB
Postgres shared_buffers (25% RAM)
OS page cache (25% RAM)
--------
Right-sized instance: 96-128 GB RAMRule of thumb: provision RAM ≥ 1.3 × HNSW index size so the entire index plus a working set of heap pages stays resident. Spilling HNSW to disk turns a 5 ms p99 query into a 200 ms one — cold pages on NVMe are 40× slower than RAM. Quantizing to halfvec (float16) cuts vector and index size in half at the cost of ~1 point of recall@10; quantizing to bit (binary) cuts another 4× but only works as a first-stage filter before a full-precision rerank.
Frequently Asked Questions
Should I use pgvector or a dedicated vector database?
Use pgvector if you already run PostgreSQL and need ACID, SQL joins, and want to avoid a new infrastructure dependency. Choose Pinecone for zero-ops at billions of vectors; Weaviate for hybrid BM25+vector search; Milvus for 100M+ vectors at high QPS.
What is HNSW and why does it matter?
HNSW (Hierarchical Navigable Small World) is the dominant ANN index algorithm. It gives sub-millisecond query times but requires significant memory and has expensive index builds — plan rebuild windows carefully in production.
How do you migrate embedding models without downtime?
Use a dual-index strategy: add a new embedding column, dual-write new documents to both old and new indexes, backfill existing docs in batches, validate recall parity, then swap traffic with a feature flag. Never switch models without re-embedding.
How many vectors can pgvector handle?
pgvector handles 5–10M vectors well on a single instance with tuned HNSW (m=16 uses ~12GB RAM per million 1536-dim vectors). Beyond 10M vectors or for sub-10ms p99 at high concurrency, migrate to a distributed option (Milvus, Pinecone) or add Citus sharding.
Keep Reading
- Building Production RAG Pipelines in Go — Chunking, embeddings, pgvector operations, and retrieval evaluation for the pipeline that feeds your vector store
- Spring AI in Production — The Java/Spring equivalent: RAG pipelines with pgvector, circuit breakers, and quality evaluation
- Database Indexing Strategies — B-tree, GIN, and GiST index internals for the metadata indexes that sit alongside your HNSW vectors in pgvector
- LLM API Integration Patterns — Token budgets, retries, and streaming for the embedding-API client that fills your vector index
- Securing AI Agent Infrastructure — Tenant isolation in pgvector / Pinecone / Weaviate so a poisoned RAG corpus does not leak across customers
Engineering Team
A multidisciplinary team of backend engineers, architects, and DevOps practitioners shipping deep dives into distributed systems and production infrastructure.
Read Next
Building Production RAG Pipelines: Chunking, Embeddings, and Retrieval at Scale
Build RAG systems that work in production: chunking strategies, embedding selection, pgvector ops, and retrieval quality evaluation.
Spring AI in Production: RAG Pipelines, Reliability, and Observability for Java Backends
Spring AI 1.1 deep-dive: production RAG pipeline with PII scrubbing, circuit breakers, Micrometer observability, and answer evaluation.
Building an MCP Server in Go with Code Mode: From 1.17M Tokens to 1,000
2,500 API endpoints in one MCP server without blowing context windows. The Code Mode pattern uses search + execute to cut token cost by 1,000x.