Kafka vs RabbitMQ vs NATS vs SQS: Choosing the Right Message Broker
Key Takeaways
- →Kafka for ordered event streams with replay; RabbitMQ for flexible task routing with competing consumers; NATS for lightweight service communication; SQS for managed AWS simplicity
- →One team spent $14K/month running Kafka for password resets at 2 msg/s — ops complexity (KRaft, rebalancing, partition management) scales with setup, not throughput
- →SQS deletes messages immediately after consumption (no replay); Kafka auto-commit loses messages on crash (use manual commit); NATS JetStream stores but requires configuration — match to your failure-recovery model
- →Kafka 4.0 (March 2025) removed ZooKeeper entirely — KRaft is now the only option; adds complexity that's only justified for high-throughput streaming
Two teams. Two opposite mistakes.
One ran a six-broker Kafka cluster for password resets at 2 messages/second. $14K/month infrastructure. One SRE whose job description was "keep Kafka alive." Password reset backed up, 6,000 users couldn't log in.
Another used SQS for their order event stream and lost two days of data after a deployment. SQS deletes messages after consumption. They spent a week reconciling inventory from database snapshots.
The broker you choose constrains the guarantees your system can provide. Ordering, replay, delivery semantics — all baked into architecture decisions. This article compares Kafka, RabbitMQ, NATS, and SQS across the dimensions that matter in production: delivery guarantees, ordering semantics, throughput characteristics, operational complexity, and infrastructure cost. Each section includes working Go code and the failure scenarios that documentation often glosses over.
Match the broker to the messaging model your workload requires. Kafka excels at high-throughput ordered event streams with replay; RabbitMQ at flexible task routing; NATS at lightweight service communication; SQS at managed simplicity in AWS. Use the decision framework below to classify your workload first, check constraints, then pick the broker. Most production systems use more than one broker for different workloads.
- Event streaming (ordered, replayable): Kafka or NATS JetStream
- Task queues (fire-and-forget, competing consumers): RabbitMQ or SQS
- Service communication (request-reply, low-latency): Core NATS
The Quick Start: Decision Table
| Workload | Best Fit | Why |
|---|---|---|
| Ordered event stream, replay required | Kafka | Persistent log, consumer groups, offset seek for replay |
| Event streaming, simpler ops | NATS JetStream | Log semantics, lower infrastructure overhead |
| Task queue with competing consumers | RabbitMQ | Flexible routing (exchanges, bindings), built for job dispatch |
| Task queue in AWS, zero ops | SQS | Fully managed, scales automatically, FIFO for ordering |
| Service-to-service RPC | NATS | Built-in request-reply, sub-millisecond latency |
| Fan-out to multiple subscribers | RabbitMQ (fanout) or Core NATS | Fanout exchange (RabbitMQ) or ephemeral pub/sub (NATS) |
Apache Kafka: The Distributed Event Log
Kafka[Apache Kafka Docs] is purpose-built for high-throughput, ordered, replayable event streaming. It's overkill for task queues but unmatched for audit logs, CDC pipelines, and event sourcing.
Architecture: Clusters of brokers, data organized into topics (partitioned), each partition is an ordered append-only log. Producers write to partitions (by key for ordering or round-robin). Consumers track offsets per consumer group — enabling multiple groups to read the same data independently.
Why ordering matters: All events for the same order go to the same partition (partition key = order ID). A single consumer processes the partition sequentially, guaranteeing per-entity ordering. Different partitions process in parallel across multiple consumer instances. This is critical for workflows where the sequence matters — payment, then shipment, then delivery.
Why replay works: Consumer offsets are just integers — positions in each partition. Reset the offset to 0 and reprocess all events since the beginning. Reset to yesterday's offset and replay the last 24 hours. No other broker provides this flexibility.
Producer: Idempotent Writes
package main
import (
"context"
"encoding/json"
"time"
"github.com/segmentio/kafka-go"
)
type OrderEvent struct {
OrderID string `json:"order_id"`
Amount int64 `json:"amount_cents"`
Timestamp time.Time `json:"timestamp"`
}
func publishOrderEvent(ctx context.Context, w *kafka.Writer, event OrderEvent) error {
payload, _ := json.Marshal(event)
msg := kafka.Message{
Key: []byte(event.OrderID), // same order → same partition → ordered
Value: payload,
}
return w.WriteMessages(ctx, msg)
}
func newKafkaWriter(brokers []string, topic string) *kafka.Writer {
return &kafka.Writer{
Addr: kafka.TCP(brokers...),
Topic: topic,
Balancer: &kafka.Hash{}, // partition by key
RequiredAcks: kafka.RequireAll, // wait for all ISR replicas
MaxAttempts: 5,
Async: false, // synchronous: fail immediately on error
}
}RequiredAcks = All: Leader waits for all in-sync replicas to acknowledge before returning. Only way to guarantee no loss if leader dies immediately after.
Consumer: Manual Offset Commit
func consumeOrders(ctx context.Context, brokers []string, topic, groupID string) error {
r := kafka.NewReader(kafka.ReaderConfig{
Brokers: brokers,
Topic: topic,
GroupID: groupID,
CommitInterval: 0, // disable auto-commit — commit manually
StartOffset: kafka.LastOffset,
})
defer r.Close()
for {
msg, _ := r.FetchMessage(ctx) // does NOT auto-commit
if err := processOrder(ctx, msg); err != nil {
log.Printf("process failed: %v — will retry", err)
continue // do NOT commit
}
// Commit only after successful processing
r.CommitMessages(ctx, msg)
}
}Auto-commit (default in many clients) commits offsets on a timer. If your consumer crashes between auto-commit and finishing processing, that message is lost. Disable auto-commit for any workload where losing data is unacceptable.
RabbitMQ: Flexible Message Routing
RabbitMQ[RabbitMQ docs] is built on AMQP[AMQP 0-9-1]. Its strength is sophisticated routing: producers publish to exchanges, exchanges route to queues via bindings. This decouples producers from consumers.
Architecture: Exchanges (routing rules), bindings (how messages get routed), queues (final destination). Four exchange types: direct (exact key match), topic (wildcard patterns), fanout (broadcast to all), headers (match message attributes).
Why it excels at task queues: Competing consumers. Multiple workers consume from the same queue. Each message goes to one worker. No consumer groups, no rebalancing slowness.
Publisher with Confirms
package main
import (
"context"
"encoding/json"
"time"
amqp "github.com/rabbitmq/amqp091-go"
)
type TaskPayload struct {
TaskID string `json:"task_id"`
Type string `json:"type"`
Payload []byte `json:"payload"`
}
func publishTask(ch *amqp.Channel, exchange, routingKey string, task TaskPayload) error {
// Enable publisher confirms
ch.Confirm(false)
body, _ := json.Marshal(task)
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
confirmation, _ := ch.PublishWithDeferredConfirmWithContext(
ctx, exchange, routingKey, true, false,
amqp.Publishing{
DeliveryMode: amqp.Persistent, // survive broker restart
ContentType: "application/json",
MessageId: task.TaskID,
Body: body,
},
)
return confirmation.Wait() // wait for broker ack
}Consumer with Ack
func consumeFromQueue(ch *amqp.Channel, queue string) error {
msgs, _ := ch.Consume(
queue, "", false, false, false, false, // autoAck=false: manual ack only
)
for msg := range msgs {
if err := processTask(msg.Body); err != nil {
// Negative ack: message goes back to queue
msg.Nack(false, true)
continue
}
// Positive ack: message removed
msg.Ack(false)
}
return nil
}NATS: Lightweight and Fast
NATS[NATS docs] is a minimal pub/sub system. Core NATS is ephemeral (no persistence), making it ideal for low-latency service communication. JetStream adds persistence with simpler ops than Kafka.
Core NATS: Sub-millisecond latency, no persistence. Perfect for synchronous RPC and telemetry.
JetStream: Adds persistent streams (log semantics) and durable consumers (like Kafka consumer groups) without the operational overhead.
Core NATS: Request-Reply
package main
import (
"log"
"time"
"github.com/nats-io/nats.go"
)
func rpcExample(nc *nats.Conn) error {
// Service: subscribe and respond
nc.Subscribe("api.users.get", func(msg *nats.Msg) {
response := `{"user_id": "42", "name": "Alice"}`
msg.Respond([]byte(response))
})
// Client: request with timeout
reply, err := nc.Request("api.users.get", []byte("42"), 2*time.Second)
if err != nil {
return err // timeout or no responder
}
log.Printf("response: %s", string(reply.Data))
return nil
}Built-in load balancing: multiple subscribers to the same subject means each request goes to one of N responders.
JetStream: Persistent Streams
func jetStreamPublish(js nats.JetStreamContext, subject string) error {
_, err := js.Publish(subject, []byte(`{"order_id":"123"}`))
return err
}
func jetStreamConsume(js nats.JetStreamContext) error {
// Create stream
js.AddStream(&nats.StreamConfig{
Name: "ORDERS",
Subjects: []string{"orders.>"},
Storage: nats.FileStorage,
MaxAge: 7 * 24 * time.Hour,
})
// Create durable consumer
js.AddConsumer("ORDERS", &nats.ConsumerConfig{
Durable: "order-processor",
AckPolicy: nats.AckExplicitPolicy,
AckWait: 30 * time.Second,
FilterSubject: "orders.created",
})
// Consume
sub, _ := js.PullSubscribe("orders.created", "order-processor")
msgs, _ := sub.Fetch(10)
for _, msg := range msgs {
msg.Ack()
}
return nil
}Amazon SQS: Managed Simplicity
SQS[AWS SQS docs] is fully managed: no brokers, no disks, no ops. Trade-off: less control, weaker guarantees.
Standard vs FIFO: Standard queues scale unlimited but have best-effort ordering. FIFO queues guarantee order within a message group. The default limit is 300 msg/sec per API action (≈3,000/sec with 10-message batching); enabling high-throughput FIFO spreads message groups across partitions for up to 70,000 msg/sec (700,000 batched) in supported regions.
No replay: Messages are deleted after consumption. Lost order events cannot be recovered.
Send and Consume
package main
import (
"context"
"encoding/json"
"github.com/aws/aws-sdk-go-v2/aws"
"github.com/aws/aws-sdk-go-v2/config"
"github.com/aws/aws-sdk-go-v2/service/sqs"
)
type EmailTask struct {
To string `json:"to"`
Subject string `json:"subject"`
Body string `json:"body"`
}
func sendToSQS(ctx context.Context, queueURL string, task EmailTask) error {
cfg, _ := config.LoadDefaultConfig(ctx)
client := sqs.NewFromConfig(cfg)
body, _ := json.Marshal(task)
_, err := client.SendMessage(ctx, &sqs.SendMessageInput{
QueueUrl: aws.String(queueURL),
MessageBody: aws.String(string(body)),
})
return err
}
func pollSQS(ctx context.Context, queueURL string) error {
cfg, _ := config.LoadDefaultConfig(ctx)
client := sqs.NewFromConfig(cfg)
result, _ := client.ReceiveMessage(ctx, &sqs.ReceiveMessageInput{
QueueUrl: aws.String(queueURL),
MaxNumberOfMessages: 10,
WaitTimeSeconds: 20, // long polling
VisibilityTimeout: 60, // 60s to process
})
for _, msg := range result.Messages {
var task EmailTask
json.Unmarshal([]byte(*msg.Body), &task)
if err := sendEmail(ctx, task); err != nil {
continue // visibility timeout will requeue
}
client.DeleteMessage(ctx, &sqs.DeleteMessageInput{
QueueUrl: aws.String(queueURL),
ReceiptHandle: msg.ReceiptHandle,
})
}
return nil
}Standard queues offer best-effort ordering only. Messages usually arrive in order but are not guaranteed. Use FIFO queues for strict per-group ordering, at the cost of lower throughput.
Head-to-Head Comparison
[Apache Kafka Docs]| Dimension | Kafka | RabbitMQ | NATS JetStream | SQS |
|---|---|---|---|---|
| Model | Log | Broker (AMQP) | Streams | Queue |
| Delivery | At-least-once (exactly-once with idempotent producers & transactions) | At-least-once | At-least-once | At-least-once (Standard); Exactly-once (FIFO) |
| Ordering | Per-partition | Per-queue | Per-stream | Best-effort (Standard); Per-group (FIFO) |
| Throughput | 1M+ msgs/sec | ~50K msgs/sec | Core NATS 1M+; JetStream 50-200K | Unlimited (Standard); FIFO 300/s, up to 70K high-throughput |
| Latency p99 | 20-100ms | 1-50ms+ (varies with load) | 10-30ms (Core); 50-200ms (JetStream) | 20-100ms |
| Replay | Yes | No | Yes | No |
| Retention | Days to months | Until consumed | Configurable | 60s–14 days |
| Operations | Medium (KRaft only, ZooKeeper removed in 4.0) | Medium | Low | None |
| Self-hosted cost (3-node) | $1,500-5,000/mo | $500-1,500/mo | $300-800/mo | N/A |
Decision Framework
[Beyer et al., 2016]graph TD
Start{"Workload shape?"} -->|Event log<br/>need replay| Replay{"Throughput?"}
Start -->|Task queue<br/>fire-and-forget| TQ{"AWS-only?"}
Start -->|Request-reply<br/>low latency| NATS["Core NATS"]
Start -->|Fan-out<br/>notifications| FanOut{"Persistence?"}
Replay -->|>100K msg/s<br/>+ ecosystem| Kafka["Kafka"]
Replay -->|simpler ops<br/>short retention| JetStream["NATS JetStream"]
TQ -->|Yes| SQS["Amazon SQS"]
TQ -->|No / multi-cloud /<br/>complex routing| RMQ["RabbitMQ"]
FanOut -->|Needed| RMQ
FanOut -->|Ephemeral OK| NATS
Step 1: Classify Your Workload
Event Streaming (ordered, replay required): Kafka or NATS JetStream. Choose Kafka if you need ecosystem (Kafka Connect, Schema Registry) or >100K msgs/sec. Choose JetStream for simpler ops and shorter retention windows.
Task Queues (competing consumers, fire-and-forget): RabbitMQ or SQS. Choose RabbitMQ for complex routing or multi-cloud. Choose SQS for AWS-native simplicity.
Service Communication (request-reply, low-latency): Core NATS. Built-in load balancing, sub-millisecond latency. gRPC is better for strongly-typed APIs.
Fan-Out Notifications: RabbitMQ fanout exchange (each subscriber gets its own queue) or Core NATS (ephemeral, millions/sec if loss is acceptable).
Step 2: Check Your Constraints
- Replay required? Eliminate RabbitMQ and SQS.
- Throughput >100K msgs/sec? Kafka or NATS.
- Latency p99
<5ms? RabbitMQ or Core NATS. - Complex routing rules? RabbitMQ only.
- Fully managed, AWS only? SQS.
- Multi-cloud or on-premises? Eliminate SQS.
Step 3: Use Hybrid Architectures
Most production systems use multiple brokers:
- Kafka: Event backbone (all domain events, 7-day retention)
- SQS or RabbitMQ: Task queues (email, PDF generation, webhooks)
- NATS: Internal service communication (request-reply between microservices)
flowchart TD
Events[Domain Events] --> Kafka[Kafka<br/>Event Log]
Kafka --> Analytics[Analytics<br/>Consumer]
Kafka --> OrderConsumer[Order<br/>Consumer]
OrderConsumer --> NATS[NATS<br/>Request-Reply]
Kafka --> Email[Email<br/>Notifier]
Email --> SQS[SQS<br/>Task Queue]
NATS --> Service[Microservice<br/>API]
Production Checklist
- Monitoring: Track consumer lag (Kafka), queue depth (RabbitMQ/SQS), message age. Alert on lag exceeding processing SLO.
- Dead letter queues: Every queue needs a DLQ. Separate consumer logs failure reason and optionally retries.
- Idempotent consumers: At-least-once delivery means duplicates possible. Deduplicate by message ID or make operations idempotent (upsert, not insert).
- Schema evolution: Use schema registry. Message format changes without coordination break consumers.
- Backpressure: Understand your broker's backpressure model (Kafka offset lag, RabbitMQ prefetch, SQS visibility timeout) and configure it.
- Graceful shutdown: Drain in-flight messages before killing consumer. Don't exit with uncommitted offsets or unacknowledged messages.
Local broker harness for integration tests
The fastest feedback loop for queue logic is a docker-compose with the four brokers wired up identically — same network, same ports, same retention defaults — so a single test suite exercises consumer-rebalance, redelivery, and DLQ behaviour without per-broker setup drift:
# docker-compose.brokers.yml — paste alongside your test runner
services:
kafka:
image: confluentinc/cp-kafka:7.6.0
ports: ["9092:9092"]
environment:
KAFKA_NODE_ID: 1
KAFKA_PROCESS_ROLES: broker,controller
KAFKA_LISTENERS: PLAINTEXT://:9092,CONTROLLER://:9093
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT
KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka:9093
KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER
CLUSTER_ID: MkU3OEVBNTcwNTJENDM2Qk
# Tight retention for tests so disk doesn't bloat across runs.
KAFKA_LOG_RETENTION_HOURS: 1
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
rabbitmq:
image: rabbitmq:3.13-management
ports: ["5672:5672", "15672:15672"]
environment:
RABBITMQ_DEFAULT_USER: test
RABBITMQ_DEFAULT_PASS: test
nats:
image: nats:2.10
command: ["-js", "-sd", "/data"]
ports: ["4222:4222"]
localstack: # SQS-compatible
image: localstack/localstack:3.4
ports: ["4566:4566"]
environment:
SERVICES: sqs
DEFAULT_REGION: us-east-1The other half of the harness — a consumer-rebalance handler that survives the partition-reassignment failure mode every Kafka consumer hits in production: in-flight messages must commit before the partition is revoked, otherwise the next consumer reprocesses them. Most consumer libraries support this via a rebalance listener:
package consumer
import (
"context"
"log/slog"
"github.com/segmentio/kafka-go"
)
type RebalanceAwareConsumer struct {
reader *kafka.Reader
commit func(ctx context.Context, msg kafka.Message) error
inFlight map[int]kafka.Message // partition -> last unprocessed msg
}
func (c *RebalanceAwareConsumer) Run(ctx context.Context) error {
for {
msg, err := c.reader.FetchMessage(ctx)
if err != nil {
return err
}
c.inFlight[msg.Partition] = msg
if err := c.process(ctx, msg); err != nil {
slog.ErrorContext(ctx, "process failed; will retry on redelivery",
"partition", msg.Partition, "offset", msg.Offset, "err", err)
continue
}
// Commit AFTER processing so a crash mid-process is safely replayed.
// The redundant book-keeping in inFlight is for the rebalance hook below.
if err := c.commit(ctx, msg); err != nil {
return err
}
delete(c.inFlight, msg.Partition)
}
}
// OnPartitionsRevoked: drain in-flight messages on revoked partitions
// BEFORE acknowledging the rebalance. Otherwise the next assignee replays
// up to one batch worth of work — usually the source of the "we processed
// this order twice" reports during deploys.
func (c *RebalanceAwareConsumer) OnPartitionsRevoked(ctx context.Context, revoked []int) {
for _, p := range revoked {
if msg, ok := c.inFlight[p]; ok {
if err := c.commit(ctx, msg); err != nil {
slog.ErrorContext(ctx, "could not commit during revocation",
"partition", p, "offset", msg.Offset, "err", err)
}
delete(c.inFlight, p)
}
}
}The two patterns compose: the docker-compose lets you trigger rebalance deterministically by killing one of two consumer instances; the rebalance handler ensures the second instance picks up cleanly. Add both to your CI matrix and the "we processed this order twice during deploys" class of bugs becomes catchable in a unit test.
When None of the Big Four Fit
The four brokers above cover ninety-plus percent of production workloads, but a small set of failure modes pushes teams toward Apache Pulsar, Redpanda, or NATS JetStream at multi-tenant scale. The pattern is consistent: a constraint that Kafka, RabbitMQ, NATS, and SQS each force you to architect around becomes the dominant cost in the system, and a niche broker eliminates it. The three triggers below are the ones that show up most often in incident review docs.
Trigger 1: Geo-Replicated Multi-Region with Tenanted Topics
Kafka MirrorMaker 2 replicates topics across regions but does it asynchronously and forces you to rename topics on the destination cluster (the <source>.<topic> convention). Consumers on the failover region read from a different topic name than producers on the primary region wrote to, which makes failover testing painful and consumer offset translation lossy. Pulsar's[Apache Pulsar docs] geo-replication is built into the broker layer — same topic name, replicated synchronously or asynchronously per namespace, with bookkeepers (the storage layer) and brokers (the serving layer) scaling independently. The split-storage architecture is the core advantage: you scale CPU for fan-out without rebalancing data.
# pulsar-namespace-policy.yaml — replicate the "orders" namespace across us-east, eu-west, ap-southeast
tenant: retail
namespace: orders
replication_clusters:
- us-east-1
- eu-west-1
- ap-southeast-1
retention_policies:
retention_size_in_mb: 524288 # 512 GiB cap per cluster
retention_time_in_minutes: 10080 # 7 days
backlog_quota_map:
destination_storage:
limit_size: 1099511627776 # 1 TiB before producers throttle
policy: producer_request_hold # block producers, not silent drop
schema_compatibility_strategy: BACKWARD_TRANSITIVEThe policy applies once and every topic in the namespace inherits it. There is no equivalent in Kafka — you would manage MirrorMaker config, ACLs, and quota on each topic individually, and a missed line in a config file means the topic falls out of replication silently.
Trigger 2: Sub-10ms Tail Latency Without Tuning Page Cache
Kafka's throughput is excellent but its p99.9 latency is dominated by Linux page-cache flush behaviour, which is non-deterministic under heavy fsync load. Redpanda[Redpanda docs] reimplements the Kafka protocol in C++ on top of the Seastar thread-per-core framework, bypasses the page cache, and pins one thread per CPU core. The result is a flatter latency distribution: p99.9 stays under ten milliseconds at workloads where Kafka's p99.9 climbs into the hundreds.
The trade-off: Redpanda is wire-compatible with Kafka clients but the operational tooling diverges. You give up Kafka Streams (use materialized views or push to a separate processor), Kafka Connect (use the OSS Connect runtime pointed at Redpanda — works but is unsupported), and the deep Confluent ecosystem. For trading systems, ad bidding, and real-time fraud detection, the latency floor is worth it. For a CDC pipeline feeding a data warehouse, it is not.
// Redpanda is Kafka API compatible — the same kafka-go client works,
// you just swap brokers. The latency win shows up in the histograms.
package main
import (
"context"
"time"
"github.com/segmentio/kafka-go"
)
func newRedpandaWriter(brokers []string, topic string) *kafka.Writer {
return &kafka.Writer{
Addr: kafka.TCP(brokers...),
Topic: topic,
Balancer: &kafka.Hash{},
RequiredAcks: kafka.RequireAll,
// Redpanda's WriteTimeout floor is much lower than Kafka's;
// the broker acks faster so the client can fail-fast sooner.
WriteTimeout: 50 * time.Millisecond,
BatchTimeout: 5 * time.Millisecond, // tight batching for low latency
Async: false,
}
}
func publishLowLatency(ctx context.Context, w *kafka.Writer, key, value []byte) error {
deadline, cancel := context.WithTimeout(ctx, 20*time.Millisecond)
defer cancel()
return w.WriteMessages(deadline, kafka.Message{Key: key, Value: value})
}A decent migration smoke test: run the producer above against both clusters with identical traffic, capture the latency histograms, and verify p99.9 stays inside your SLO under sustained 80% capacity. If Kafka clears the bar, do not migrate — the operational divergence costs more than the latency savings buy.
Trigger 3: Hundreds of Tenants with Per-Tenant Streams
NATS JetStream's underrated strength is that a stream is a cheap entity — you can run tens of thousands of streams on a single cluster, each with its own retention, replication factor, and ACL. The same workload on Kafka requires either tens of thousands of topics (which destroys the metadata layer) or topic-per-tenant with namespaced subjects (which makes ACLs untenable). For SaaS backends with a "one stream per customer" model — webhook delivery, audit logs, per-tenant change feeds — JetStream wins on operational simplicity by an order of magnitude.
# Provision a per-tenant stream with bounded retention and ACLs in three commands.
# Run this from your tenant-onboarding job; it is idempotent.
TENANT_ID="$1"
nats stream add "tenant-${TENANT_ID}" \
--subjects="tenants.${TENANT_ID}.events.>" \
--storage=file \
--retention=limits \
--max-age=720h \
--max-bytes=10737418240 \
--replicas=3 \
--discard=old
nats consumer add "tenant-${TENANT_ID}" "tenant-${TENANT_ID}-webhooks" \
--filter="tenants.${TENANT_ID}.events.>" \
--ack=explicit \
--max-deliver=10 \
--wait=30s \
--backoff=linear --backoff-min=1s --backoff-max=5m
nats auth user add "tenant-${TENANT_ID}-publisher" \
--pub="tenants.${TENANT_ID}.events.>" \
--sub="_INBOX.>" \
--account=tenantsThe same provisioning on Kafka — topic creation, ACL update, consumer group quota — requires multiple admin API calls, an ACL service round-trip, and a metadata controller update that rate-limits new topics. At 10k tenants, the difference between a sub-second NATS provision and a multi-second Kafka provision shows up as onboarding latency for paying customers.
Decision Rule
The honest version: pick a niche broker only when the constraint that triggers it is in your top-three production risks. Pulsar for compliant multi-region with tight RPOs. Redpanda for sub-ten-millisecond p99.9 under sustained load. JetStream for tenant-per-stream SaaS topologies. Outside those scenarios, the smaller community and thinner ecosystem will cost more in incident response time than the technical advantage saves in steady state.
Frequently Asked Questions
When should I use Kafka vs RabbitMQ?
Use Kafka for ordered event streaming with replay and consumer groups. Use RabbitMQ for task queues with flexible routing, competing consumers, and no replay requirement.
What is the difference between queue semantics and log semantics?
Queues: one consumer per message, deleted after ack, no replay. Logs: persistent, offset-based, multiple consumer groups, replay by offset reset.
Is Kafka overkill for simple task queues?
Yes. Kafka's operational complexity (KRaft, partitions, rebalancing) justifies only high-throughput workloads or replay requirements. Use RabbitMQ or SQS for simple job dispatch. Note: Kafka 4.0 (March 2025) removed ZooKeeper entirely — KRaft is now the only option for cluster coordination.
Can NATS replace Kafka for event streaming?
NATS JetStream provides log semantics with lower ops overhead than Kafka. Good for moderate throughput; less ecosystem tooling than Kafka. Choose based on retention needs and team ops capacity.
Keep Reading
- Event-Driven Microservices in Go: Kafka, Sagas, and the Outbox Pattern — Kafka implementation with saga orchestration, idempotent consumers, and the outbox pattern
- Microservices Architecture: From Monolith to Production-Ready Services — Where message brokers fit in microservices: service boundaries, communication patterns, and the event backbone
- Scaling Redis for High-Throughput Systems — Redis Streams and Pub/Sub for lightweight patterns when a full broker is overkill
- Idempotency Patterns in Distributed Systems — Every at-least-once broker (Kafka, RabbitMQ, NATS, SQS) demands idempotent consumers; this is the consumer-side rulebook
- Kafka Producer Tuning Cheat Sheet — Once you have picked Kafka, this cheat sheet is the next 30 minutes of work
Engineering Team
A multidisciplinary team of backend engineers, architects, and DevOps practitioners shipping deep dives into distributed systems and production infrastructure.
Read Next
Consistent Hashing: The Algorithm Behind Every Scalable Distributed System
Adding one cache server shouldn't invalidate every key. Consistent hashing with virtual nodes and bounded loads — full Go and Java implementations.
Microservices Architecture: From Monolith to Production-Ready Services
When to decompose a monolith, how to define boundaries, and the patterns that work: API gateways, sagas, and event-driven comms.
Distributed Rate Limiting at Scale: The Probabilistic Drop Architecture
Probabilistic drop rate limiting: uncoordinated enforcement bypassing Redis for 1M+ RPS with zero coordination overhead.