#kafka #rabbitmq #nats #sqs #message-queues #distributed-systems #system-design #go

Kafka vs RabbitMQ vs NATS vs SQS: Choosing the Right Message Broker

BackendBytes Engineering Team

Mar 1, 2026

17 min read

Kafka vs RabbitMQ vs NATS vs SQS: Choosing the Right Message Broker

Part of Series: Microservices Architecture

Lesson 4 of 5

Prev Next

Key Takeaways

→Kafka for ordered event streams with replay; RabbitMQ for flexible task routing with competing consumers; NATS for lightweight service communication; SQS for managed AWS simplicity
→One team spent $14K/month running Kafka for password resets at 2 msg/s — ops complexity (KRaft, rebalancing, partition management) scales with setup, not throughput
→SQS deletes messages immediately after consumption (no replay); Kafka auto-commit loses messages on crash (use manual commit); NATS JetStream stores but requires configuration — match to your failure-recovery model
→Kafka 4.0 (March 2025) removed ZooKeeper entirely — KRaft is now the only option; adds complexity that's only justified for high-throughput streaming

Two teams. Two opposite mistakes.

One ran a six-broker Kafka cluster for password resets at 2 messages/second. $14K/month infrastructure. One SRE whose job description was "keep Kafka alive." Password reset backed up, 6,000 users couldn't log in.

Another used SQS for their order event stream and lost two days of data after a deployment. SQS deletes messages after consumption. They spent a week reconciling inventory from database snapshots.

The broker you choose constrains the guarantees your system can provide. Ordering, replay, delivery semantics — all baked into architecture decisions. This article compares Kafka, RabbitMQ, NATS, and SQS across the dimensions that matter in production: delivery guarantees, ordering semantics, throughput characteristics, operational complexity, and infrastructure cost. Each section includes working Go code and the failure scenarios that documentation often glosses over.

TL;DR

Match the broker to the messaging model your workload requires. Kafka excels at high-throughput ordered event streams with replay; RabbitMQ at flexible task routing; NATS at lightweight service communication; SQS at managed simplicity in AWS. Use the decision framework below to classify your workload first, check constraints, then pick the broker. Most production systems use more than one broker for different workloads.

Event streaming (ordered, replayable): Kafka or NATS JetStream
Task queues (fire-and-forget, competing consumers): RabbitMQ or SQS
Service communication (request-reply, low-latency): Core NATS

The Quick Start: Decision Table

Workload	Best Fit	Why
Ordered event stream, replay required	Kafka	Persistent log, consumer groups, offset seek for replay
Event streaming, simpler ops	NATS JetStream	Log semantics, lower infrastructure overhead
Task queue with competing consumers	RabbitMQ	Flexible routing (exchanges, bindings), built for job dispatch
Task queue in AWS, zero ops	SQS	Fully managed, scales automatically, FIFO for ordering
Service-to-service RPC	NATS	Built-in request-reply, sub-millisecond latency
Fan-out to multiple subscribers	RabbitMQ (fanout) or Core NATS	Fanout exchange (RabbitMQ) or ephemeral pub/sub (NATS)

Apache Kafka: The Distributed Event Log

Kafka^{[Apache Kafka Docs]} is purpose-built for high-throughput, ordered, replayable event streaming. It's overkill for task queues but unmatched for audit logs, CDC pipelines, and event sourcing.

Architecture: Clusters of brokers, data organized into topics (partitioned), each partition is an ordered append-only log. Producers write to partitions (by key for ordering or round-robin). Consumers track offsets per consumer group — enabling multiple groups to read the same data independently.

Why ordering matters: All events for the same order go to the same partition (partition key = order ID). A single consumer processes the partition sequentially, guaranteeing per-entity ordering. Different partitions process in parallel across multiple consumer instances. This is critical for workflows where the sequence matters — payment, then shipment, then delivery.

Why replay works: Consumer offsets are just integers — positions in each partition. Reset the offset to 0 and reprocess all events since the beginning. Reset to yesterday's offset and replay the last 24 hours. No other broker provides this flexibility.

Producer: Idempotent Writes

package main
 
import (
	"context"
	"encoding/json"
	"fmt"
	"log"
	"time"
 
	"github.com/segmentio/kafka-go"
)
 
type OrderEvent struct {
	OrderID   string    `json:"order_id"`
	Amount    int64     `json:"amount_cents"`
	Timestamp time.Time `json:"timestamp"`
}
 
func publishOrderEvent(ctx context.Context, w *kafka.Writer, event OrderEvent) error {
	payload, err := json.Marshal(event)
	if err != nil {
		return fmt.Errorf("marshal event: %w", err)
	}
	msg := kafka.Message{
		Key:   []byte(event.OrderID), // same order → same partition → ordered
		Value: payload,
	}
	return w.WriteMessages(ctx, msg)
}
 
func newKafkaWriter(brokers []string, topic string) *kafka.Writer {
	return &kafka.Writer{
		Addr:         kafka.TCP(brokers...),
		Topic:        topic,
		Balancer:     &kafka.Hash{},     // partition by key
		RequiredAcks: kafka.RequireAll,  // wait for all ISR replicas
		MaxAttempts:  5,
		Async:        false, // synchronous: fail immediately on error
	}
}

RequiredAcks = All: Leader waits for all in-sync replicas to acknowledge before returning. Only way to guarantee no loss if leader dies immediately after.

Consumer: Manual Offset Commit

func consumeOrders(ctx context.Context, brokers []string, topic, groupID string) error {
	r := kafka.NewReader(kafka.ReaderConfig{
		Brokers:        brokers,
		Topic:          topic,
		GroupID:        groupID,
		CommitInterval: 0, // disable auto-commit — commit manually
		StartOffset:    kafka.LastOffset,
	})
	defer r.Close()
 
	for {
		msg, err := r.FetchMessage(ctx) // does NOT auto-commit
		if err != nil {
			return fmt.Errorf("fetch message: %w", err) // ctx cancelled or reader closed
		}
 
		if err := processOrder(ctx, msg); err != nil {
			log.Printf("process failed: %v — will retry", err)
			continue // do NOT commit
		}
 
		// Commit only after successful processing
		if err := r.CommitMessages(ctx, msg); err != nil {
			return fmt.Errorf("commit offset: %w", err)
		}
	}
}

Auto-Commit Loses Messages

Auto-commit (default in many clients) commits offsets on a timer. If your consumer crashes between auto-commit and finishing processing, that message is lost. Disable auto-commit for any workload where losing data is unacceptable.

RabbitMQ: Flexible Message Routing

RabbitMQ^{[RabbitMQ docs]} is built on AMQP^{[AMQP 0-9-1]}. Its strength is sophisticated routing: producers publish to exchanges, exchanges route to queues via bindings. This decouples producers from consumers.

Architecture: Exchanges (routing rules), bindings (how messages get routed), queues (final destination). Four exchange types: direct (exact key match), topic (wildcard patterns), fanout (broadcast to all), headers (match message attributes).

Why it excels at task queues: Competing consumers. Multiple workers consume from the same queue. Each message goes to one worker. No consumer groups, no rebalancing slowness.

Publisher with Confirms

package main
 
import (
	"context"
	"encoding/json"
	"fmt"
	"time"
 
	amqp "github.com/rabbitmq/amqp091-go"
)
 
type TaskPayload struct {
	TaskID  string `json:"task_id"`
	Type    string `json:"type"`
	Payload []byte `json:"payload"`
}
 
func publishTask(ch *amqp.Channel, exchange, routingKey string, task TaskPayload) error {
	// Enable publisher confirms
	if err := ch.Confirm(false); err != nil {
		return fmt.Errorf("enable confirms: %w", err)
	}
 
	body, err := json.Marshal(task)
	if err != nil {
		return fmt.Errorf("marshal task: %w", err)
	}
 
	ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
	defer cancel()
 
	confirmation, err := ch.PublishWithDeferredConfirmWithContext(
		ctx, exchange, routingKey, true, false,
		amqp.Publishing{
			DeliveryMode: amqp.Persistent, // survive broker restart
			ContentType:  "application/json",
			MessageId:    task.TaskID,
			Body:         body,
		},
	)
	if err != nil {
		return fmt.Errorf("publish: %w", err) // else confirmation is nil → Wait() panics
	}
 
	return confirmation.Wait() // wait for broker ack
}

Consumer with Ack

func consumeFromQueue(ch *amqp.Channel, queue string) error {
	msgs, err := ch.Consume(
		queue, "", false, false, false, false, // autoAck=false: manual ack only
	)
	if err != nil {
		return fmt.Errorf("consume: %w", err)
	}
 
	for msg := range msgs {
		if err := processTask(msg.Body); err != nil {
			// Negative ack: message goes back to queue
			msg.Nack(false, true)
			continue
		}
		// Positive ack: message removed
		msg.Ack(false)
	}
	return nil
}

NATS: Lightweight and Fast

NATS^{[NATS docs]} is a minimal pub/sub system. Core NATS is ephemeral (no persistence), making it ideal for low-latency service communication. JetStream adds persistence with simpler ops than Kafka.

Core NATS: Sub-millisecond latency, no persistence. Perfect for synchronous RPC and telemetry.

JetStream: Adds persistent streams (log semantics) and durable consumers (like Kafka consumer groups) without the operational overhead.

Core NATS: Request-Reply

package main
 
import (
	"log"
	"time"
 
	"github.com/nats-io/nats.go"
)
 
func rpcExample(nc *nats.Conn) error {
	// Service: subscribe and respond
	nc.Subscribe("api.users.get", func(msg *nats.Msg) {
		response := `{"user_id": "42", "name": "Alice"}`
		msg.Respond([]byte(response))
	})
 
	// Client: request with timeout
	reply, err := nc.Request("api.users.get", []byte("42"), 2*time.Second)
	if err != nil {
		return err // timeout or no responder
	}
 
	log.Printf("response: %s", string(reply.Data))
	return nil
}

Built-in load balancing: multiple subscribers to the same subject means each request goes to one of N responders.

JetStream: Persistent Streams

func jetStreamPublish(js nats.JetStreamContext, subject string) error {
	_, err := js.Publish(subject, []byte(`{"order_id":"123"}`))
	return err
}
 
func jetStreamConsume(js nats.JetStreamContext) error {
	// Create stream
	if _, err := js.AddStream(&nats.StreamConfig{
		Name:     "ORDERS",
		Subjects: []string{"orders.>"},
		Storage:  nats.FileStorage,
		MaxAge:   7 * 24 * time.Hour,
	}); err != nil {
		return fmt.Errorf("add stream: %w", err)
	}
 
	// Create durable consumer
	if _, err := js.AddConsumer("ORDERS", &nats.ConsumerConfig{
		Durable:       "order-processor",
		AckPolicy:     nats.AckExplicitPolicy,
		AckWait:       30 * time.Second,
		FilterSubject: "orders.created",
	}); err != nil {
		return fmt.Errorf("add consumer: %w", err)
	}
 
	// Consume
	sub, err := js.PullSubscribe("orders.created", "order-processor")
	if err != nil {
		return fmt.Errorf("subscribe: %w", err)
	}
	msgs, err := sub.Fetch(10)
	if err != nil {
		return fmt.Errorf("fetch: %w", err)
	}
	for _, msg := range msgs {
		if err := msg.Ack(); err != nil {
			log.Printf("ack failed for msg: %v", err)
		}
	}
	return nil
}

Amazon SQS: Managed Simplicity

SQS^{[AWS SQS docs]} is fully managed: no brokers, no disks, no ops. Trade-off: less control, weaker guarantees.

Standard vs FIFO: Standard queues scale unlimited but have best-effort ordering. FIFO queues guarantee order within a message group. The default limit is 300 msg/sec per API action (≈3,000/sec with 10-message batching); enabling high-throughput FIFO spreads message groups across partitions for up to 70,000 msg/sec (700,000 batched) in supported regions.

No replay: Messages are deleted after consumption. Lost order events cannot be recovered.

Send and Consume

package main
 
import (
	"context"
	"encoding/json"
	"fmt"
 
	"github.com/aws/aws-sdk-go-v2/aws"
	"github.com/aws/aws-sdk-go-v2/config"
	"github.com/aws/aws-sdk-go-v2/service/sqs"
)
 
type EmailTask struct {
	To      string `json:"to"`
	Subject string `json:"subject"`
	Body    string `json:"body"`
}
 
func sendToSQS(ctx context.Context, queueURL string, task EmailTask) error {
	cfg, err := config.LoadDefaultConfig(ctx)
	if err != nil {
		return fmt.Errorf("load AWS config: %w", err)
	}
	client := sqs.NewFromConfig(cfg)
 
	body, err := json.Marshal(task)
	if err != nil {
		return fmt.Errorf("marshal task: %w", err)
	}
	_, err = client.SendMessage(ctx, &sqs.SendMessageInput{
		QueueUrl:    aws.String(queueURL),
		MessageBody: aws.String(string(body)),
	})
	return err
}
 
func pollSQS(ctx context.Context, queueURL string) error {
	cfg, err := config.LoadDefaultConfig(ctx)
	if err != nil {
		return fmt.Errorf("load AWS config: %w", err)
	}
	client := sqs.NewFromConfig(cfg)
 
	result, err := client.ReceiveMessage(ctx, &sqs.ReceiveMessageInput{
		QueueUrl:            aws.String(queueURL),
		MaxNumberOfMessages: 10,
		WaitTimeSeconds:     20, // long polling
		VisibilityTimeout:   60, // 60s to process
	})
	if err != nil {
		return fmt.Errorf("receive: %w", err) // else result is nil → panic below
	}
 
	for _, msg := range result.Messages {
		var task EmailTask
		json.Unmarshal([]byte(*msg.Body), &task)
 
		if err := sendEmail(ctx, task); err != nil {
			continue // visibility timeout will requeue
		}
 
		client.DeleteMessage(ctx, &sqs.DeleteMessageInput{
			QueueUrl:      aws.String(queueURL),
			ReceiptHandle: msg.ReceiptHandle,
		})
	}
	return nil
}

Standard Queue Ordering

Standard queues offer best-effort ordering only. Messages usually arrive in order but are not guaranteed. Use FIFO queues for strict per-group ordering, at the cost of lower throughput.

Head-to-Head Comparison

^{[Apache Kafka Docs]}

Dimension	Kafka	RabbitMQ	NATS JetStream	SQS
Model	Log	Broker (AMQP)	Streams	Queue
Delivery	At-least-once (exactly-once with idempotent producers & transactions)	At-least-once	At-least-once	At-least-once (Standard); Exactly-once (FIFO)
Ordering	Per-partition	Per-queue	Per-stream	Best-effort (Standard); Per-group (FIFO)
Throughput	1M+ msgs/sec	~50K msgs/sec	Core NATS 1M+; JetStream 50-200K	Unlimited (Standard); FIFO 300/s, up to 70K high-throughput
Latency p99	20-100ms	1-50ms+ (varies with load)	10-30ms (Core); 50-200ms (JetStream)	20-100ms
Replay	Yes	No	Yes	No
Retention	Days to months	Until consumed	Configurable	60s–14 days
Operations	Medium (KRaft only, ZooKeeper removed in 4.0)	Medium	Low	None
Self-hosted cost (3-node)	$1,500-5,000/mo	$500-1,500/mo	$300-800/mo	N/A

Decision Framework

^{[Beyer et al., 2016]}

Interactive Broker Decision Finder

Step 1 of 2

What is your primary architectural workload shape?

graph TD
    Start{"Workload shape?"} -->|Event log<br/>need replay| Replay{"Throughput?"}
    Start -->|Task queue<br/>fire-and-forget| TQ{"AWS-only?"}
    Start -->|Request-reply<br/>low latency| NATS["Core NATS"]
    Start -->|Fan-out<br/>notifications| FanOut{"Persistence?"}

    Replay -->|>100K msg/s<br/>+ ecosystem| Kafka["Kafka"]
    Replay -->|simpler ops<br/>short retention| JetStream["NATS JetStream"]

    TQ -->|Yes| SQS["Amazon SQS"]
    TQ -->|No / multi-cloud /<br/>complex routing| RMQ["RabbitMQ"]

    FanOut -->|Needed| RMQ
    FanOut -->|Ephemeral OK| NATS

Step 1: Classify Your Workload

Event Streaming (ordered, replay required): Kafka or NATS JetStream. Choose Kafka if you need ecosystem (Kafka Connect, Schema Registry) or >100K msgs/sec. Choose JetStream for simpler ops and shorter retention windows.

Task Queues (competing consumers, fire-and-forget): RabbitMQ or SQS. Choose RabbitMQ for complex routing or multi-cloud. Choose SQS for AWS-native simplicity.

Service Communication (request-reply, low-latency): Core NATS. Built-in load balancing, sub-millisecond latency. gRPC is better for strongly-typed APIs.

Fan-Out Notifications: RabbitMQ fanout exchange (each subscriber gets its own queue) or Core NATS (ephemeral, millions/sec if loss is acceptable).

Step 2: Check Your Constraints

Replay required? Eliminate RabbitMQ and SQS.
Throughput >100K msgs/sec? Kafka or NATS.
Latency p99 <5ms? RabbitMQ or Core NATS.
Complex routing rules? RabbitMQ only.
Fully managed, AWS only? SQS.
Multi-cloud or on-premises? Eliminate SQS.

Step 3: Use Hybrid Architectures

Most production systems use multiple brokers:

Kafka: Event backbone (all domain events, 7-day retention)
SQS or RabbitMQ: Task queues (email, PDF generation, webhooks)
NATS: Internal service communication (request-reply between microservices)

flowchart TD
    Events[Domain Events] --> Kafka[Kafka<br/>Event Log]
    Kafka --> Analytics[Analytics<br/>Consumer]
    Kafka --> OrderConsumer[Order<br/>Consumer]
    OrderConsumer --> NATS[NATS<br/>Request-Reply]
    Kafka --> Email[Email<br/>Notifier]
    Email --> SQS[SQS<br/>Task Queue]
    NATS --> Service[Microservice<br/>API]

Production Checklist

Monitoring: Track consumer lag (Kafka), queue depth (RabbitMQ/SQS), message age. Alert on lag exceeding processing SLO.
Dead letter queues: Every queue needs a DLQ. Separate consumer logs failure reason and optionally retries.
Idempotent consumers: At-least-once delivery means duplicates possible. Deduplicate by message ID or make operations idempotent (upsert, not insert).
Schema evolution: Use schema registry. Message format changes without coordination break consumers.
Backpressure: Understand your broker's backpressure model (Kafka offset lag, RabbitMQ prefetch, SQS visibility timeout) and configure it.
Graceful shutdown: Drain in-flight messages before killing consumer. Don't exit with uncommitted offsets or unacknowledged messages.

Local broker harness for integration tests

The fastest feedback loop for queue logic is a docker-compose with the four brokers wired up identically — same network, same ports, same retention defaults — so a single test suite exercises consumer-rebalance, redelivery, and DLQ behaviour without per-broker setup drift:

# docker-compose.brokers.yml — paste alongside your test runner
services:
  kafka:
    image: confluentinc/cp-kafka:7.6.0
    ports: ["9092:9092"]
    environment:
      KAFKA_NODE_ID: 1
      KAFKA_PROCESS_ROLES: broker,controller
      KAFKA_LISTENERS: PLAINTEXT://:9092,CONTROLLER://:9093
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT
      KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka:9093
      KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER
      CLUSTER_ID: MkU3OEVBNTcwNTJENDM2Qk
      # Tight retention for tests so disk doesn't bloat across runs.
      KAFKA_LOG_RETENTION_HOURS: 1
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
 
  rabbitmq:
    image: rabbitmq:3.13-management
    ports: ["5672:5672", "15672:15672"]
    environment:
      RABBITMQ_DEFAULT_USER: test
      RABBITMQ_DEFAULT_PASS: test
 
  nats:
    image: nats:2.10
    command: ["-js", "-sd", "/data"]
    ports: ["4222:4222"]
 
  localstack:                         # SQS-compatible
    image: localstack/localstack:3.4
    ports: ["4566:4566"]
    environment:
      SERVICES: sqs
      DEFAULT_REGION: us-east-1

The other half of the harness — a consumer-rebalance handler that survives the partition-reassignment failure mode every Kafka consumer hits in production: in-flight messages must commit before the partition is revoked, otherwise the next consumer reprocesses them. Most consumer libraries support this via a rebalance listener:

package consumer
 
import (
	"context"
	"log/slog"
 
	"github.com/segmentio/kafka-go"
)
 
type RebalanceAwareConsumer struct {
	reader   *kafka.Reader
	commit   func(ctx context.Context, msg kafka.Message) error
	inFlight map[int]kafka.Message // partition -> last unprocessed msg
}
 
func (c *RebalanceAwareConsumer) Run(ctx context.Context) error {
	for {
		msg, err := c.reader.FetchMessage(ctx)
		if err != nil {
			return err
		}
		c.inFlight[msg.Partition] = msg
 
		if err := c.process(ctx, msg); err != nil {
			slog.ErrorContext(ctx, "process failed; will retry on redelivery",
				"partition", msg.Partition, "offset", msg.Offset, "err", err)
			continue
		}
 
		// Commit AFTER processing so a crash mid-process is safely replayed.
		// The redundant book-keeping in inFlight is for the rebalance hook below.
		if err := c.commit(ctx, msg); err != nil {
			return err
		}
		delete(c.inFlight, msg.Partition)
	}
}
 
// OnPartitionsRevoked: drain in-flight messages on revoked partitions
// BEFORE acknowledging the rebalance. Otherwise the next assignee replays
// up to one batch worth of work — usually the source of the "we processed
// this order twice" reports during deploys.
func (c *RebalanceAwareConsumer) OnPartitionsRevoked(ctx context.Context, revoked []int) {
	for _, p := range revoked {
		if msg, ok := c.inFlight[p]; ok {
			if err := c.commit(ctx, msg); err != nil {
				slog.ErrorContext(ctx, "could not commit during revocation",
					"partition", p, "offset", msg.Offset, "err", err)
			}
			delete(c.inFlight, p)
		}
	}
}

The two patterns compose: the docker-compose lets you trigger rebalance deterministically by killing one of two consumer instances; the rebalance handler ensures the second instance picks up cleanly. Add both to your CI matrix and the "we processed this order twice during deploys" class of bugs becomes catchable in a unit test.

When None of the Big Four Fit

The four brokers above cover ninety-plus percent of production workloads, but a small set of failure modes pushes teams toward Apache Pulsar, Redpanda, or NATS JetStream at multi-tenant scale. The pattern is consistent: a constraint that Kafka, RabbitMQ, NATS, and SQS each force you to architect around becomes the dominant cost in the system, and a niche broker eliminates it. The three triggers below are the ones that show up most often in incident review docs.

Trigger 1: Geo-Replicated Multi-Region with Tenanted Topics

Kafka MirrorMaker 2 replicates topics across regions but does it asynchronously and forces you to rename topics on the destination cluster (the <source>.<topic> convention). Consumers on the failover region read from a different topic name than producers on the primary region wrote to, which makes failover testing painful and consumer offset translation lossy. Pulsar's^{[Apache Pulsar docs]} geo-replication is built into the broker layer — same topic name, replicated synchronously or asynchronously per namespace, with bookkeepers (the storage layer) and brokers (the serving layer) scaling independently. The split-storage architecture is the core advantage: you scale CPU for fan-out without rebalancing data.

# pulsar-namespace-policy.yaml — replicate the "orders" namespace across us-east, eu-west, ap-southeast
tenant: retail
namespace: orders
replication_clusters:
  - us-east-1
  - eu-west-1
  - ap-southeast-1
retention_policies:
  retention_size_in_mb: 524288     # 512 GiB cap per cluster
  retention_time_in_minutes: 10080 # 7 days
backlog_quota_map:
  destination_storage:
    limit_size: 1099511627776       # 1 TiB before producers throttle
    policy: producer_request_hold   # block producers, not silent drop
schema_compatibility_strategy: BACKWARD_TRANSITIVE

The policy applies once and every topic in the namespace inherits it. There is no equivalent in Kafka — you would manage MirrorMaker config, ACLs, and quota on each topic individually, and a missed line in a config file means the topic falls out of replication silently.

Trigger 2: Sub-10ms Tail Latency Without Tuning Page Cache

Kafka's throughput is excellent but its p99.9 latency is dominated by Linux page-cache flush behaviour, which is non-deterministic under heavy fsync load. Redpanda^{[Redpanda docs]} reimplements the Kafka protocol in C++ on top of the Seastar thread-per-core framework, bypasses the page cache, and pins one thread per CPU core. The result is a flatter latency distribution: p99.9 stays under ten milliseconds at workloads where Kafka's p99.9 climbs into the hundreds.

The trade-off: Redpanda is wire-compatible with Kafka clients but the operational tooling diverges. You give up Kafka Streams (use materialized views or push to a separate processor), Kafka Connect (use the OSS Connect runtime pointed at Redpanda — works but is unsupported), and the deep Confluent ecosystem. For trading systems, ad bidding, and real-time fraud detection, the latency floor is worth it. For a CDC pipeline feeding a data warehouse, it is not.

// Redpanda is Kafka API compatible — the same kafka-go client works,
// you just swap brokers. The latency win shows up in the histograms.
package main
 
import (
	"context"
	"time"
 
	"github.com/segmentio/kafka-go"
)
 
func newRedpandaWriter(brokers []string, topic string) *kafka.Writer {
	return &kafka.Writer{
		Addr:         kafka.TCP(brokers...),
		Topic:        topic,
		Balancer:     &kafka.Hash{},
		RequiredAcks: kafka.RequireAll,
		// Redpanda's WriteTimeout floor is much lower than Kafka's;
		// the broker acks faster so the client can fail-fast sooner.
		WriteTimeout: 50 * time.Millisecond,
		BatchTimeout: 5 * time.Millisecond, // tight batching for low latency
		Async:        false,
	}
}
 
func publishLowLatency(ctx context.Context, w *kafka.Writer, key, value []byte) error {
	deadline, cancel := context.WithTimeout(ctx, 20*time.Millisecond)
	defer cancel()
	return w.WriteMessages(deadline, kafka.Message{Key: key, Value: value})
}

A decent migration smoke test: run the producer above against both clusters with identical traffic, capture the latency histograms, and verify p99.9 stays inside your SLO under sustained 80% capacity. If Kafka clears the bar, do not migrate — the operational divergence costs more than the latency savings buy.

Trigger 3: Hundreds of Tenants with Per-Tenant Streams

NATS JetStream's underrated strength is that a stream is a cheap entity — you can run tens of thousands of streams on a single cluster, each with its own retention, replication factor, and ACL. The same workload on Kafka requires either tens of thousands of topics (which destroys the metadata layer) or topic-per-tenant with namespaced subjects (which makes ACLs untenable). For SaaS backends with a "one stream per customer" model — webhook delivery, audit logs, per-tenant change feeds — JetStream wins on operational simplicity by an order of magnitude.

# Provision a per-tenant stream with bounded retention and ACLs in three commands.
# Run this from your tenant-onboarding job; it is idempotent.
TENANT_ID="$1"
 
nats stream add "tenant-${TENANT_ID}" \
  --subjects="tenants.${TENANT_ID}.events.>" \
  --storage=file \
  --retention=limits \
  --max-age=720h \
  --max-bytes=10737418240 \
  --replicas=3 \
  --discard=old
 
nats consumer add "tenant-${TENANT_ID}" "tenant-${TENANT_ID}-webhooks" \
  --filter="tenants.${TENANT_ID}.events.>" \
  --ack=explicit \
  --max-deliver=10 \
  --wait=30s \
  --backoff=linear --backoff-min=1s --backoff-max=5m
 
nats auth user add "tenant-${TENANT_ID}-publisher" \
  --pub="tenants.${TENANT_ID}.events.>" \
  --sub="_INBOX.>" \
  --account=tenants

The same provisioning on Kafka — topic creation, ACL update, consumer group quota — requires multiple admin API calls, an ACL service round-trip, and a metadata controller update that rate-limits new topics. At 10k tenants, the difference between a sub-second NATS provision and a multi-second Kafka provision shows up as onboarding latency for paying customers.

Decision Rule

The honest version: pick a niche broker only when the constraint that triggers it is in your top-three production risks. Pulsar for compliant multi-region with tight RPOs. Redpanda for sub-ten-millisecond p99.9 under sustained load. JetStream for tenant-per-stream SaaS topologies. Outside those scenarios, the smaller community and thinner ecosystem will cost more in incident response time than the technical advantage saves in steady state.

Frequently Asked Questions

When should I use Kafka vs RabbitMQ?

Use Kafka for ordered event streaming with replay and consumer groups. Use RabbitMQ for task queues with flexible routing, competing consumers, and no replay requirement.

What is the difference between queue semantics and log semantics?

Queues: one consumer per message, deleted after ack, no replay. Logs: persistent, offset-based, multiple consumer groups, replay by offset reset.

Is Kafka overkill for simple task queues?

Yes. Kafka's operational complexity (KRaft, partitions, rebalancing) justifies only high-throughput workloads or replay requirements. Use RabbitMQ or SQS for simple job dispatch. Note: Kafka 4.0 (March 2025) removed ZooKeeper entirely — KRaft is now the only option for cluster coordination.

Can NATS replace Kafka for event streaming?

NATS JetStream provides log semantics with lower ops overhead than Kafka. Good for moderate throughput; less ecosystem tooling than Kafka. Choose based on retention needs and team ops capacity.

Keep Reading

Event-Driven Microservices in Go: Kafka, Sagas, and the Outbox Pattern — Kafka implementation with saga orchestration, idempotent consumers, and the outbox pattern
Microservices Architecture: From Monolith to Production-Ready Services — Where message brokers fit in microservices: service boundaries, communication patterns, and the event backbone
Scaling Redis for High-Throughput Systems — Redis Streams and Pub/Sub for lightweight patterns when a full broker is overkill
Idempotency Patterns in Distributed Systems — Every at-least-once broker (Kafka, RabbitMQ, NATS, SQS) demands idempotent consumers; this is the consumer-side rulebook
Kafka Producer Tuning Cheat Sheet — Once you have picked Kafka, this cheat sheet is the next 30 minutes of work

Coming Next

Coming Next: Event-Driven Microservices in Go: Kafka, Sagas, and the Outbox Pattern

Choosing the broker is only half the battle. Implementing it in code without losing events or running into distributed consistency issues is where the real work begins. In our next deep dive, we build a transactional outbox publisher and idempotent consumer in Go using Kafka. Read the Event-Driven Microservices Guide.

Was this article helpful?

Your feedback directly shapes our editorial depth and technical accuracy.