Skip to content

Vector Databases Compared: pgvector vs Pinecone vs Weaviate

BackendBytes Engineering Team
BackendBytes Engineering Team
9 min read
Vector Databases Compared: pgvector vs Pinecone vs Weaviate

Key Takeaways

  • pgvector HNSW index rebuilds block writes — one team's routine REINDEX turned into an 8-hour ingestion outage
  • Choose by infrastructure fit: already have Postgres → pgvector; need zero-ops → Pinecone; want hybrid BM25+vector → Weaviate; 100M+ vectors → Milvus
  • Switching embedding models requires re-embedding your entire corpus — use dual-index migration with a feature flag, not a maintenance window
  • Under 100K vectors with infrequent queries, brute-force NumPy is fine — you don't need a vector database until you need sub-100ms latency at 1M+ vectors

A routine Thursday REINDEX INDEX CONCURRENTLY on a 12M-row pgvector table turned into an eight-hour incident. HNSW index builds take a ShareUpdateExclusiveLockSELECTs keep working, but INSERT and UPDATE are blocked. The RAG ingestion pipeline stalled, and tickets arrived: "Why can't the chatbot see the document I just uploaded?"

The lesson: vector databases have operational characteristics no feature matrix captures. We've deployed all of these at scale — pgvector, Pinecone, Weaviate, Milvus — and learned the sharp edges the hard way. Each DB trades off scalability, cost, and operational burden differently. This guide compares them with real code, benchmarks, and the operational gotchas that only surface at production scale.

Quick Take

Choose based on infrastructure fit: already have Postgres → start with pgvector (zero new infrastructure); need zero-ops managed service → Pinecone; want hybrid BM25+vector search → Weaviate; have 100M+ vectors at high QPS → Milvus. Don't migrate prematurely; measure first.

  • pgvector wins if you already run PostgreSQL and don't need separate scaling
  • Pinecone wins if you want fully managed with no infrastructure overhead
  • Weaviate wins if users need both keyword precision and semantic understanding
  • Milvus wins if you have 100M+ vectors and need thousands of QPS

The quick start: Vector DB comparison matrix

Choose based on your infrastructure and constraints, not by checking feature boxes:

DimensionpgvectorPineconeWeaviateMilvus
TypePostgreSQL extensionFully managed SaaSOpen-source DBOpen-source distributed DB
Best forAlready use PostgresZero-ops, fast prototypingHybrid BM25+vector search100M+ vectors, 1000+ QPS
DeploymentYour existing PostgresAPI endpoint onlySelf-hosted or managedSelf-hosted or Zilliz Cloud
ScalingRead replicas / CitusAuto-scales, no tuningNative shardingDistributed (query, data, index nodes)
Hybrid searchManual SQL joinSparse-dense in one queryNative BM25 + alpha tuningSparse + dense vectors
ACID / SQL joinsYes (full Postgres)NoNoNo
Max vectors (single node)~10M well-tunedN/A (managed)~100MUnlimited (distributed)
Index build time (1M vectors)8–15 minSeconds3–5 min2–4 min
Cold start latencyNone2–5 sec (serverless)NoneNone
Cost (10M vectors, 100 QPS)$50–450/mo self-hosted; $25–200/mo managed$100–350/mo$250–680/mo$200–900/mo
Operational overheadModerate (index rebalancing)NoneMedium (sharding ops)High (etcd, MinIO, multi-node)
Latest version (mid-2026)pgvector 0.8.2Serverless (managed)Weaviate 1.37Milvus 2.6.x

The cost row blends self-hosted infra and managed-tier list prices into an order-of-magnitude band — treat it as a planning estimate, not a quote. The managed pricing models diverge in ways the band hides, and all four moved in the last year: Pinecone serverless bills usage as storage ($0.33/GB-month), write units ($4–$4.50/M), and read units ($16–$18/M), with plan minimums of $0 (Starter), $20/mo (the newer Builder tier), $50/mo (Standard), and $500/mo (Enterprise). Weaviate Cloud moved to a vector-dimensions-stored model in October 2025: a free tier, then Flex from $45/mo, Plus from $280/mo, and Premium from $400/mo. Milvus is self-hosted (Apache-2.0); the managed path is Zilliz Cloud, whose serverless storage dropped to $0.04/GB-month in January 2026. Re-price against each vendor's calculator before committing — list prices change quarterly.

Two databases not broken out above are worth knowing in 2026. Qdrant (Rust, latest 1.18) competes directly with the four here: native payload filtering, a free managed tier (0.5 vCPU / 1 GB RAM), and resource-based Cloud pricing. LanceDB ($30M Series A in mid-2025) takes a different shape — an embedded, serverless engine on the columnar Lance format, aimed at multimodal and lakehouse-scale data rather than a long-running server. If your fit is "Postgres-adjacent but standalone," evaluate Qdrant; if it is "embedded, file-based, multimodal," evaluate LanceDB.

When you DON'T need a vector database: under 100K vectors with infrequent queries, brute-force NumPy is fine. You need an index when sub-100ms latency at 1M+ vectors with concurrent users matters.

pgvector: Start here if you have PostgreSQL

[pgvector README]

pgvector adds a vector column type and HNSW indexes to Postgres. If you already run Postgres, you get vector search with ACID transactions and SQL joins — no new infrastructure.

CREATE EXTENSION IF NOT EXISTS vector;
 
CREATE TABLE documents (
    id         BIGINT PRIMARY KEY,
    content    TEXT NOT NULL,
    metadata   JSONB DEFAULT '{}',
    embedding  vector(1536),  -- OpenAI text-embedding-3-small dimensions
    created_at TIMESTAMPTZ DEFAULT now()
);
 
-- HNSW index: m=16, ef_construction=200 is production sweet spot
CREATE INDEX idx_embedding ON documents
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 200);
 
-- Vector search with metadata filters and SQL joins
SELECT d.id, d.content, 1 - (d.embedding <=> $1::vector) AS similarity
FROM documents d
JOIN categories c ON d.category_id = c.id
WHERE c.slug = 'architecture'
  AND d.created_at > now() - INTERVAL '90 days'
ORDER BY d.embedding <=> $1::vector
LIMIT 10;

The <=> operator is cosine distance. You get full SQL expressiveness — WHERE clauses, JOINs, window functions — all with indexes. HNSW tuning matters: m=16 (~12GB RAM per million 1536-dim vectors) balances memory and recall; in our testing, raising m to 32 increased memory significantly but only pushed recall from ~0.95 to ~0.98 against our dataset. Plan for 64GB+ RAM instances at 5M+ vectors. See [pgvector README] for the full HNSW parameter reference.

The release worth knowing about is 0.8.0 (October 2024), carried through the current 0.8.2: it added iterative index scans (SET hnsw.iterative_scan = 'relaxed_order') plus improved cost estimation for filtered queries. Before 0.8.0, a selective WHERE clause on top of an HNSW scan could "overfilter" — the index returned its ef_search candidates, the filter discarded most of them, and you got back fewer than LIMIT rows. Iterative scans keep walking the index until the filtered result set is full, which makes the metadata-filtered queries above behave the way you'd expect. If you're on a pre-0.8.0 build (some managed Postgres lagged for months), upgrading is the single highest-leverage change for filtered RAG queries.

The sharp edge: REINDEX INDEX CONCURRENTLY takes a ShareUpdateExclusiveLock — it blocks INSERTs and UPDATEs while SELECT keeps working (documented in [pgvector README]). We learned this the hard way: plan 10–15 minute rebuild windows for 1M vectors. Mitigation: use the CREATE → DROP → RENAME pattern instead.

Pinecone: Zero-ops at massive scale

[Pinecone documentation]

Pinecone is a fully managed SaaS. No infrastructure to provision — just an API endpoint. Scales automatically to billions of vectors.

from pinecone import Pinecone
 
pc = Pinecone(api_key="your-api-key")
index = pc.Index("documents")
 
# Upsert vectors with metadata
index.upsert(vectors=[
    {"id": "doc-001", "values": embedding, "metadata": {"category": "arch"}},
    {"id": "doc-002", "values": embedding_2, "metadata": {"category": "devops"}},
])
 
# Query with metadata filter
results = index.query(
    vector=query_embedding,
    top_k=10,
    filter={"category": {"$eq": "arch"}},
    include_metadata=True
)

The tradeoff: serverless cold-starts after idle can eat several seconds before the first query returns (documented behavior of Pinecone's serverless tier [Pinecone Serverless Docs]); costs scale non-linearly with vector count — in our experience, one startup's bill grew from $70/mo to $2,800/mo as the corpus went from under 1M to 8M vectors at 200 QPS. The serverless model bills three dimensions independently — storage ($0.33/GB-month), write units ($4–$4.50/M), and read units ($16–$18/M) — on top of plan minimums ($0 Starter, $20/mo Builder, $50/mo Standard, $500/mo Enterprise) [Pinecone documentation], so read-heavy RAG traffic, not stored vectors, is usually what moves the bill. Use namespaces for bulk deletes (instant, O(1)); avoid individual filter-based deletes.

Wins when: you want zero-ops and can accept cold-start latency; your team has no appetite for database operations. Struggles when: cost matters at massive scale; vendor lock-in concerns; you need SQL-grade filtering.

Weaviate: Native hybrid search (BM25 + vector)

[Weaviate hybrid search docs]

Weaviate's killer feature: native BM25 + vector hybrid search in a single query with tunable alpha (0.0 = pure keyword, 1.0 = pure vector) — see [Weaviate Hybrid Search Docs] for the full alpha semantics. In our experience, RAG workloads that mix factual lookup with semantic retrieval tend to land in the 0.65–0.80 range; measure against your own queries before committing to a default.

import weaviate
client = weaviate.connect_to_local()
documents = client.collections.get("Document")
 
response = documents.query.hybrid(
    query="distributed consensus",
    alpha=0.75,  # 75% vector, 25% keyword
    limit=10,
)

Pure vector search misses exact keyword matches (searching "OAuth2 PKCE" drifts toward "authentication"). Hybrid search fixes this.

Two 2026 changes matter for sizing. Recent Weaviate releases ship rotational quantization (RQ) and turn vector compression on by default, which is what keeps the per-million-dimension storage cost in check at scale. And Weaviate Cloud's billing model changed in October 2025 from the old per-resource Serverless tier to a vector-dimensions-stored model (free tier, then Flex from $45/mo) [Weaviate hybrid search docs] — budget against dimensions × objects × replication factor, not raw vector count, and re-check the calculator if you priced this before late 2025.

Wins when: users need both semantic understanding and keyword precision. Struggles when: operational overhead (shard rebalancing is not transparent).

Milvus: Maximum throughput at 100M+ vectors

Open-source, built for massive scale. Distributed architecture (query, data, index nodes) lets each layer scale independently. Purpose-built for hundreds of millions of vectors at thousands of QPS.

from pymilvus import MilvusClient, CollectionSchema, FieldSchema, DataType
 
client = MilvusClient(uri="http://localhost:19530")
 
fields = [
    FieldSchema("id", DataType.INT64, is_primary=True, auto_id=True),
    FieldSchema("embedding", DataType.FLOAT_VECTOR, dim=1536),
]
client.create_collection(collection_name="documents", schema=CollectionSchema(fields))
 
results = client.search(collection_name="documents", data=[query_embedding], limit=10)

The current stable line is Milvus 2.6 (GA mid-2025), which leans hard into cost reduction at billion scale — its headline feature is RaBitQ 1-bit quantization that compresses an index to roughly 1/32 of its float size, paired with SQ8 reranking to hold recall. (A 3.0 line with multi-vector and geospatial types began landing in 2026; 2.6 remains the production-stable choice.) The cost lever at this scale is quantization, not node count.

Sharp edge: etcd is the most fragile component. A platform colocated etcd on query nodes; traffic spike starved it of CPU, lost quorum, and Milvus stopped ingesting. Fix: dedicate machines for etcd.

Wins when: > 100M vectors at 1000+ QPS; your team can operate a distributed system. Struggles when: operational overhead exceeds value for small datasets.

Embedding model migrations: Dual-index strategy

Every team eventually swaps models — ada-002text-embedding-3-small, or to self-hosted nomic-embed-text to cut cost. Embeddings from different models are incompatible — you cannot mix them in one index. Re-embedding 5M documents naively (stop ingestion, re-embed, swap) takes 28 hours and leaves the system down or stale.

The dual-index migration sequence — zero downtime, reversible until the very last step:

graph TD
    Start[Production on model v1<br/>idx_v1 hot] --> Step1[Step 1<br/>ALTER TABLE add embedding_v2 column<br/>CREATE INDEX CONCURRENTLY idx_v2]
    Step1 --> Step2[Step 2<br/>Backfill embedding_v2<br/>via batched workers + replication-lag throttling]
    Step2 --> Step3[Step 3<br/>Dual-write: writes update both v1 and v2<br/>reads still hit v1]
    Step3 --> Step4[Step 4<br/>Feature-flag cutover<br/>5 percent of reads → v2<br/>compare quality metrics]
    Step4 --> Quality{Quality<br/>regression?}
    Quality -->|Yes| Rollback[Flip flag back to v1<br/>investigate offline]
    Quality -->|No| Step5[Step 5<br/>Ramp 5 → 25 → 50 → 100 percent<br/>over 1-2 weeks]
    Step5 --> Step6[Step 6<br/>Stop dual-write<br/>drop idx_v1 + embedding_v1 column]
    Rollback --> Step3
    style Step6 fill:#dfd
    style Rollback fill:#ffd

The diagram captures the safety property: every step before Step 6 is reversible. The cutover is gradual, the rollback is one feature-flag flip, and you never have a window where queries hit a half-built index.

Use dual indexes with a feature-flag cutover instead:

Step 1: Add a second embedding column (pgvector) or namespace (Pinecone); build the index concurrently.

ALTER TABLE documents ADD COLUMN embedding_v3 vector(1536);
CREATE INDEX CONCURRENTLY idx_documents_embedding_v3small
    ON documents USING hnsw (embedding_v3 vector_cosine_ops)
    WITH (m = 16, ef_construction = 200);

Step 2: Dual-write new documents; embed with both models.

def upsert(doc_id: str, text: str):
    ada_emb = embed_client.embed(text, model="text-embedding-ada-002")
    v3_emb = embed_client.embed(text, model="text-embedding-3-small")
 
    conn.execute(
        "INSERT INTO documents (id, embedding, embedding_v3) VALUES (%s, %s, %s)",
        (doc_id, ada_emb, v3_emb)
    )

Step 3: Backfill existing docs (restartable via WHERE embedding_v3 IS NULL); validate recall parity; flip feature flag to serve from new index.

Step 4: Monitor query latency and recall; if parity holds after 1 week, drop the old embedding column.

Dimension Changes

Changing dimensions (1536 → 3072) requires a new column — PostgreSQL vector is fixed-dimension per column. On Pinecone, a new index. Plan dimension changes as a separate migration from model changes.

Production checklist

Before going live with any vector DB:

  • Measure baseline performance — p50, p99, recall@10 on your actual data and query patterns. Published benchmarks use synthetic data.
  • Plan index rebuild windows — HNSW rebuilds block writes. For pgvector: use CREATE → DROP → RENAME pattern. For Pinecone/Weaviate: automatic but plan shard rebalancing.
  • Monitor index health — Track index bloat (pgvector), shard imbalance (Weaviate/Milvus), cold-start latency (Pinecone).
  • Budget for embedding migrations — OpenAI's text-embedding-3-small costs $0.02 / 1M tokens. Re-embedding 10M docs at 500 tokens each = $100. Plan model migrations with dual-index strategy.
  • Test with your actual scale — Don't migrate prematurely. A well-tuned pgvector instance handles 5–10M vectors with sub-20ms p99 in our benchmarks; in our experience, teams shipping sub-billion-vector workloads rarely outgrow that configuration.

pgvector index health: the queries you run before users notice

Two queries that surface most pgvector production issues — bloat-driven recall regressions and the lock-contention failure mode behind the "REINDEX took the chatbot offline" incident this article opened with:

-- 1. Index size, bloat estimate, and last-VACUUM timestamp. Run weekly; if
--    pg_relation_size grows by >50% without matching row growth, schedule
--    REINDEX CONCURRENTLY in the next maintenance window.
SELECT
    i.relname                                                    AS index_name,
    pg_size_pretty(pg_relation_size(i.oid))                      AS index_size,
    s.n_live_tup                                                 AS table_rows,
    s.last_vacuum,
    s.last_autovacuum
FROM pg_class i
JOIN pg_index x          ON x.indexrelid = i.oid
JOIN pg_class t          ON t.oid = x.indrelid
JOIN pg_stat_user_tables s ON s.relid = t.oid
WHERE i.relkind = 'i'
  AND pg_get_indexdef(i.oid) ILIKE '%USING hnsw%'
ORDER BY pg_relation_size(i.oid) DESC;
 
-- 2. Live blocking-lock detector. Paste during an incident: any non-empty
--    result with mode='ShareUpdateExclusiveLock' on a vector index means a
--    REINDEX or VACUUM is blocking your write path right now.
SELECT
    a.pid,
    a.usename,
    a.query_start,
    now() - a.query_start                                        AS waiting_for,
    l.mode,
    l.relation::regclass                                         AS blocked_object,
    a.query
FROM pg_stat_activity a
JOIN pg_locks l ON l.pid = a.pid
WHERE NOT l.granted
   OR l.mode IN ('ShareUpdateExclusiveLock', 'AccessExclusiveLock')
ORDER BY a.query_start;

The first query is your "is bloat eating recall?" check; the second is your "why are inserts hanging right now?" check. Together they cover ~90% of pgvector incidents. [pgvector README]

Decision Framework

Start from your existing infrastructure, not from a feature matrix:

┌─────────────────────────────────────┐
│     Already running PostgreSQL?     │
└──────────────┬──────────────────────┘

          ┌────▼────┐
          │   Yes   │──────────────────▶  Start with pgvector
          └─────────┘                     (zero new infrastructure)
               │ No                              │
               ▼                                 │
     ┌─────────────────┐                         │
     │ Need hybrid      │                        ▼
     │ search (BM25 +   │               Dataset > 10M vectors
     │ vector)?         │               or need > 1000 QPS?
     └────────┬────────┘                         │
              │                             ┌────▼────┐
         ┌────▼────┐                        │   Yes   │──▶ Migrate to
         │   Yes   │──▶ Weaviate            └─────────┘    dedicated DB
         └─────────┘                             │ No
              │ No                               │
              ▼                                  ▼
     ┌─────────────────┐                  Stay on pgvector
     │ Want zero-ops    │
     │ managed service? │
     └────────┬────────┘

         ┌────▼────┐
         │   Yes   │──▶ Pinecone
         └─────────┘
              │ No

     ┌─────────────────┐
     │ Need max QPS on  │
     │ 100M+ vectors?   │
     └────────┬────────┘

         ┌────▼────┐
         │   Yes   │──▶ Milvus (distributed)
         └─────────┘
              │ No

       Start with pgvector
       (simplest default)

The same routing in Mermaid form — keep this as the cheat-sheet for "which vector database":

graph TD
    Start[Need vector search] --> PG{Already running<br/>PostgreSQL?}
    PG -->|Yes| Hybrid1{Need hybrid search<br/>BM25 plus vector?}
    PG -->|No| Hybrid2{Need hybrid search<br/>BM25 plus vector?}
    Hybrid1 -->|Yes| Wv1[Weaviate]
    Hybrid1 -->|No| Scale1{Over 10M vectors<br/>or over 1000 QPS?}
    Scale1 -->|No| PGv[pgvector<br/>zero new infrastructure]
    Scale1 -->|Yes| Ops1{Want zero-ops<br/>managed service?}
    Ops1 -->|Yes| Pine[Pinecone]
    Ops1 -->|No| Throughput1{Need max QPS on<br/>over 100M vectors?}
    Throughput1 -->|Yes| Mil[Milvus<br/>distributed]
    Throughput1 -->|No| PGv
    Hybrid2 -->|Yes| Wv1
    Hybrid2 -->|No| Ops1
    style PGv fill:#dfd
    style Wv1 fill:#dfd
    style Pine fill:#ffd
    style Mil fill:#ffd

Start with pgvector, measure, and migrate only when you hit a concrete limitation — not a theoretical one.

Abstract your vector store interface

Build migration flexibility from day one. Use an abstraction so swapping backends is a new class, not a pipeline rewrite:

from abc import ABC, abstractmethod
from dataclasses import dataclass
 
@dataclass
class SearchResult:
    id: str
    score: float
    metadata: dict
 
class VectorStore(ABC):
    @abstractmethod
    def upsert(self, id: str, vector: list[float], metadata: dict) -> None: ...
 
    @abstractmethod
    def search(self, vector: list[float], top_k: int = 10, filter: dict = None) -> list[SearchResult]: ...
 
    @abstractmethod
    def delete(self, ids: list[str]) -> None: ...

Implement pgvector, then Pinecone, then Weaviate. A config flag selects which backend to use. When you outgrow pgvector, you swap implementations without rewriting your application logic.

Hybrid retrieval: BM25 + vector + cross-encoder rerank

Pure vector search loses on rare tokens — product SKUs, error codes, library names, version numbers. BM25 still owns those. The production-grade pattern is a three-stage funnel: a wide BM25 + dense recall pass over the corpus, a fusion step that merges the two ranked lists, then a cross-encoder reranker that re-scores the top 50–100 candidates with a much stronger model. The reranker is too expensive to apply to the full index, but cheap enough to run on a short shortlist, and it consistently lifts NDCG@10 by 5–15 points over either signal alone.

In Postgres you can build the hybrid stage without leaving the database. Add a tsvector column alongside the embedding, run BM25-style ranking with ts_rank_cd, and combine the two scores with Reciprocal Rank Fusion (RRF) — which is robust to score-scale mismatch in a way that simple weighted sums are not.

-- One CTE per signal, fused by rank position (k = 60 is the standard RRF
-- constant). RRF avoids having to normalise BM25 against cosine distance.
WITH bm25 AS (
    SELECT id,
           ts_rank_cd(content_tsv, plainto_tsquery('english', $1)) AS score,
           row_number() OVER (ORDER BY ts_rank_cd(content_tsv,
               plainto_tsquery('english', $1)) DESC) AS rnk
    FROM documents
    WHERE content_tsv @@ plainto_tsquery('english', $1)
    LIMIT 100
),
dense AS (
    SELECT id,
           1 - (embedding <=> $2::vector) AS score,
           row_number() OVER (ORDER BY embedding <=> $2::vector) AS rnk
    FROM documents
    ORDER BY embedding <=> $2::vector
    LIMIT 100
)
SELECT d.id, d.content,
       COALESCE(1.0 / (60 + b.rnk), 0) + COALESCE(1.0 / (60 + v.rnk), 0) AS rrf
FROM documents d
LEFT JOIN bm25  b ON b.id = d.id
LEFT JOIN dense v ON v.id = d.id
WHERE b.id IS NOT NULL OR v.id IS NOT NULL
ORDER BY rrf DESC
LIMIT 50;

Feed the 50 RRF survivors into a cross-encoder — a bge-reranker-v2-m3 or cohere-rerank-3 call — and return the top 10. Budget the reranker call at ~80 ms p99 for 50 candidates on a single GPU; on CPU, cap the shortlist at 25.

Embedding model migration playbook

A complete dual-index migration looks like a four-week rollout, not a weekend job. The realistic timeline for a 10M-document corpus on text-embedding-3-small:

  • Week 1 — shadow build. Add the new vector column, build the index CONCURRENTLY, enable dual-write so every fresh document hits both columns. No read traffic touches the new index yet. Smoke-test the embedding client at 100 docs/sec sustained.
  • Week 2 — backfill. Run batched workers (1k docs per batch, 4–8 workers) to embed historical rows. At an OpenAI rate of roughly 3k requests/min, 10M docs takes ~55 hours of wall-clock. Throttle on replication lag — pause the workers if pg_replication_slots.confirmed_flush_lsn falls more than 30 seconds behind primary.
  • Week 3 — shadow eval. Mirror 5% of read traffic to both indexes, log the top-10 ID sets, and compute Jaccard overlap and recall@10 against a held-out judgment set. Block the cutover if recall regresses by more than 2 absolute points.
  • Week 4 — ramp and drop. Flip the feature flag from 5% → 25% → 50% → 100% over four days, watching p99 latency and answer-quality eval scores. Wait one full week with 100% traffic on the new index before dropping the old column and reclaiming disk with VACUUM FULL during an off-peak window. [pgvector README]

Budget the dollars up front: re-embedding 10M docs at 500 tokens each is 5B tokens; at $0.02 per 1M tokens, that is $100 of compute plus whatever your egress and worker hours cost. The expensive part is the eval judgment set, not the embeddings.

Capacity planning: memory and disk per million 1536-dim vectors

Three formulas cover 90% of pgvector capacity questions. For 1536-dimension float32 vectors with HNSW at m = 16: [pgvector README]

  • Raw vector bytes per row = dim * 4 = 1536 × 4 = 6,144 B (~6 KB).
  • HNSW index bytes per rowdim * 4 + m * 2 * 8 + 32 = 6144 + 256 + 32 = 6,432 B. Roughly the same as raw storage — HNSW essentially doubles your footprint.
  • Heap row overhead = ~28 B (tuple header) + content TEXT toasted out-of-line + JSONB metadata. Budget 200 B per row for typical RAG payloads.

Worked example for 5M documents with 1.5 KB of average text content per row:

Raw vectors:        5_000_000 * 6144   = 30.7 GB
HNSW index:         5_000_000 * 6432   = 32.2 GB
Heap (text+meta):   5_000_000 * 1700   =  8.5 GB
TOAST + WAL slack:                       ~10 GB
                                        --------
Total disk:                              ~81 GB
 
Working set in RAM (HNSW must fit hot):
  index pages           ~32 GB
  Postgres shared_buffers (25% RAM)
  OS page cache (25% RAM)
                                        --------
  Right-sized instance:    96-128 GB RAM

Rule of thumb: provision RAM ≥ 1.3 × HNSW index size so the entire index plus a working set of heap pages stays resident. Spilling HNSW to disk turns a 5 ms p99 query into a 200 ms one — cold pages on NVMe are 40× slower than RAM. Quantizing to halfvec (float16) cuts vector and index size in half at the cost of ~1 point of recall@10; quantizing to bit (binary) cuts another 4× but only works as a first-stage filter before a full-precision rerank.

Frequently Asked Questions

Should I use pgvector or a dedicated vector database?

Use pgvector if you already run PostgreSQL and need ACID, SQL joins, and want to avoid a new infrastructure dependency. Choose Pinecone for zero-ops at billions of vectors; Weaviate for hybrid BM25+vector search; Milvus for 100M+ vectors at high QPS.

What is HNSW and why does it matter?

HNSW (Hierarchical Navigable Small World) is the dominant ANN index algorithm. It gives sub-millisecond query times but requires significant memory and has expensive index builds — plan rebuild windows carefully in production.

How do you migrate embedding models without downtime?

Use a dual-index strategy: add a new embedding column, dual-write new documents to both old and new indexes, backfill existing docs in batches, validate recall parity, then swap traffic with a feature flag. Never switch models without re-embedding.

How many vectors can pgvector handle?

pgvector handles 5–10M vectors well on a single instance with tuned HNSW (m=16 uses ~12GB RAM per million 1536-dim vectors). Beyond 10M vectors or for sub-10ms p99 at high concurrency, migrate to a distributed option (Milvus, Pinecone) or add Citus sharding.

Keep Reading

BackendBytes Engineering Team
BackendBytes Engineering Team

Engineering Team

A multidisciplinary team of backend engineers, architects, and DevOps practitioners shipping deep dives into distributed systems and production infrastructure.

Read Next