Event-Driven Microservices in Go: Kafka, Sagas, and the Outbox Pattern
Key Takeaways
- →Dual-write bugs surface as phantom orders — the event publishes to Kafka, the DB write fails, downstream systems (inventory, billing, fulfilment) act on an order that doesn't exist. The outbox pattern is the only durable fix.
- →The outbox pattern solves atomicity by writing events to the database in the same transaction as business data; a background relay then polls and publishes to Kafka
- →Idempotent consumers are non-negotiable — at-least-once delivery guarantees duplicates, so track processed event IDs with unique constraints to safely replay
- →Sagas orchestrate multi-service transactions with compensating transactions that undo partial failures automatically, replacing two-phase commit which doesn't scale across services
The classic dual-write production incident. An order service publishes a Kafka event, then updates PostgreSQL. When the database write fails after the event has already been published, a phantom order propagates through downstream systems — inventory decremented, fulfilment notified, payment charged — for an order that does not exist in the database. Customers get charged for nothing. Root-cause tracing is painful because the event trail looks correct; the database is the one that disagrees. We debugged variants of this on multiple production Kafka-backed services[Kreps et al., 2011].
The dual-write problem — atomically updating a database and publishing events to separate systems — is solved by the outbox pattern (write events to the database in the same transaction)[Transactional outbox], idempotent consumers (safe to replay)[Akkoyunlu et al., 1975], and saga orchestration (distributed transactions with automatic compensation). All three are essential; none are optional for high-value systems.
- Write events to the database in a single transaction using the outbox pattern; a background relay publishes them to Kafka
- Make consumers idempotent by tracking processed event IDs in the database; at-least-once delivery guarantees you'll see duplicates
- Coordinate multi-service transactions with sagas; compensating transactions automatically undo partial failures
Architecture decision: outbox vs CDC vs event sourcing
[Transactional outbox]Three patterns solve the dual-write problem. Each trades operational complexity against throughput and consistency guarantees.
| Pattern | Throughput ceiling | Operational complexity | When to use | When NOT to use |
|---|---|---|---|---|
| Outbox | ~5-10k events/sec per relay | Low — one new table, one polling job | 99% of services; you already control the writes | When you cannot change application code |
| CDC (Debezium) | 50k+ events/sec | High — Kafka Connect, connector configs, WAL retention tuning | When outbox throughput is the bottleneck OR you cannot modify the writing application | When the operational team cannot run Kafka Connect |
| Event Sourcing | Limited by event store, not Kafka | Very high — rebuilds, snapshots, projections, schema evolution | Audit-heavy domains (banking, compliance, regulated finance) | Most CRUD apps; query patterns rule it out |
| Direct dual-write (anti-pattern) | High but unsafe | Low | Never — this is the bug pattern this article exists to fix | Always |
The outbox pattern (this article) writes events to a database table in the same transaction as your business data, then a background polling process publishes them to Kafka. It requires no new infrastructure—just a new table and a simple background job. The throughput ceiling is around 5,000–10,000 events/sec per relay instance due to polling overhead. This is the right choice for most services: simpler to operate, easier to debug, and sufficient for the vast majority of production workloads.
Change Data Capture (CDC) with Debezium streams database changes directly from the PostgreSQL write-ahead log to Kafka, bypassing the application layer entirely. The application never writes to an outbox table; Debezium reads the WAL and streams changes automatically. This eliminates polling overhead and can handle 50,000+ events/sec. The trade-off is operational: you now run Kafka Connect with Debezium connectors, manage connector configurations, and ensure WAL retention is sufficient. Use CDC when you exceed the outbox ceiling and the application is already event-driven enough that "no application code changes" is an acceptable cost.
Event Sourcing makes the database itself the event log. Every state change appends an immutable event record rather than updating rows in place. Querying requires replaying events to rebuild current state. This is powerful in audit-heavy domains (banking, compliance) where the full history must always be queryable, but it's overkill for most systems and complicates query patterns significantly. Only use event sourcing if you have legal or business requirements for complete audit trails. [Transactional outbox]
The outbox pattern is the most common choice: it's simple, requires no additional infrastructure, handles 99% of production workloads, and keeps debugging and operation straightforward. [Transactional outbox]
Building the outbox pattern in Go
The outbox pattern solves the dual-write problem by writing both your business data and an event record in the same database transaction. A background relay then polls the outbox table and publishes events to Kafka. This guarantees atomicity: either both the data and event are committed, or neither are.
graph LR
App["Application"] -->|"1. BEGIN TX"| DB[(PostgreSQL)]
App -->|"2. INSERT order"| DB
App -->|"3. INSERT outbox_event"| DB
App -->|"4. COMMIT"| DB
Relay["Outbox Relay<br/>(background goroutine)"] -->|"5. Poll pending events"| DB
Relay -->|"6. Publish"| Kafka["Kafka Topic"]
Relay -->|"7. Mark published"| DB
Outbox schema:
CREATE TABLE outbox_events (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
aggregate_type VARCHAR(255) NOT NULL,
aggregate_id VARCHAR(255) NOT NULL,
event_type VARCHAR(255) NOT NULL,
payload JSONB NOT NULL,
status VARCHAR(50) DEFAULT 'PENDING',
created_at TIMESTAMPTZ DEFAULT NOW(),
published_at TIMESTAMPTZ,
retry_count INT DEFAULT 0
);
CREATE INDEX idx_outbox_pending ON outbox_events(status, created_at) WHERE status = 'PENDING';Application writes events and data in one transaction:
// OrderRepository handles both order creation and outbox writing
// in a single database transaction.
type OrderRepository struct {
db *pgx.Conn
}
type OutboxEvent struct {
AggregateType string
AggregateID string
EventType string
Payload any
}
func (r *OrderRepository) CreateOrder(
ctx context.Context,
order *Order,
) error {
tx, err := r.db.Begin(ctx)
if err != nil {
return fmt.Errorf("begin transaction: %w", err)
}
defer tx.Rollback(ctx) // no-op if already committed
// 1. Insert the order
_, err = tx.Exec(ctx, `
INSERT INTO orders (id, customer_id, status, total_amount, created_at)
VALUES ($1, $2, $3, $4, $5)
`, order.ID, order.CustomerID, order.Status, order.TotalAmount, order.CreatedAt)
if err != nil {
return fmt.Errorf("insert order: %w", err)
}
// 2. Write the event to the outbox — same transaction
payload, err := json.Marshal(OrderCreatedEvent{
OrderID: order.ID,
CustomerID: order.CustomerID,
Items: order.Items,
TotalAmount: order.TotalAmount,
CreatedAt: order.CreatedAt,
})
if err != nil {
return fmt.Errorf("marshal event payload: %w", err)
}
_, err = tx.Exec(ctx, `
INSERT INTO outbox_events
(aggregate_type, aggregate_id, event_type, payload)
VALUES ($1, $2, $3, $4)
`, "order", order.ID.String(), "order.created", payload)
if err != nil {
return fmt.Errorf("insert outbox event: %w", err)
}
// 3. Commit both together — or neither
if err := tx.Commit(ctx); err != nil {
return fmt.Errorf("commit transaction: %w", err)
}
return nil
}Background relay publishes pending events:
type OutboxRelay struct {
db *pgxpool.Pool
producer *kafka.Writer
logger *slog.Logger
}
func (r *OutboxRelay) Run(ctx context.Context) error {
ticker := time.NewTicker(500 * time.Millisecond)
defer ticker.Stop()
for {
select {
case <-ctx.Done():
return ctx.Err()
case <-ticker.C:
if err := r.processBatch(ctx); err != nil {
r.logger.Error("outbox relay batch failed", "error", err)
// Continue — don't exit the relay on transient errors
}
}
}
}
func (r *OutboxRelay) processBatch(ctx context.Context) error {
// Lock rows we're about to process (SKIP LOCKED is important for
// horizontal scaling — multiple relay instances won't fight over rows)
rows, err := r.db.Query(ctx, `
SELECT id, aggregate_type, aggregate_id, event_type, payload
FROM outbox_events
WHERE status = 'PENDING'
AND retry_count < 5
ORDER BY created_at
LIMIT 100
FOR UPDATE SKIP LOCKED
`)
if err != nil {
return fmt.Errorf("query outbox: %w", err)
}
defer rows.Close()
type pendingEvent struct {
ID uuid.UUID
AggregateType string
AggregateID string
EventType string
Payload json.RawMessage
}
var events []pendingEvent
for rows.Next() {
var e pendingEvent
if err := rows.Scan(&e.ID, &e.AggregateType, &e.AggregateID, &e.EventType, &e.Payload); err != nil {
return fmt.Errorf("scan row: %w", err)
}
events = append(events, e)
}
if len(events) == 0 {
return nil
}
messages := make([]kafka.Message, len(events))
for i, e := range events {
messages[i] = kafka.Message{
Topic: topicForEventType(e.EventType),
Key: []byte(e.AggregateID), // Same key → same partition → ordered delivery
Value: e.Payload,
// Carry the unique outbox row ID so consumers can dedup on it.
// This is the event's own ID, NOT the aggregate ID (partition key).
Headers: []kafka.Header{
{Key: "outbox-id", Value: []byte(e.ID.String())},
},
}
}
if err := r.producer.WriteMessages(ctx, messages...); err != nil {
// Mark for retry; the relay will pick up this batch again on the next cycle
return fmt.Errorf("write to kafka: %w", err)
}
_, err = r.db.Exec(ctx, `
UPDATE outbox_events SET status = 'PUBLISHED', published_at = NOW() WHERE id = ANY($1)
`, ids(events))
return err
}Use aggregate ID (e.g., order_id) as the message key to ensure all events for one entity land on the same partition
and stay ordered. Use cooperative rebalancing so scaling consumers doesn't pause all processing. Use static membership
with InstanceID in Kubernetes to avoid rebalances during rolling restarts.
Consuming events reliably: idempotency and resilience
[Stripe idempotency]With at-least-once delivery, you will see duplicate events. Your consumers must be idempotent: processing the same event twice must produce the same result as processing it once.
The consumer-side flow makes the discipline explicit:
graph TD
Msg[Kafka message arrives] --> Extract[Extract outbox-id<br/>from headers]
Extract --> Check{processed_events<br/>has this id?}
Check -->|Yes — duplicate| Skip[Commit offset<br/>do nothing]
Check -->|No — first time| TX[BEGIN TX]
TX --> Apply[Apply business effect<br/>e.g. decrement inventory]
Apply --> Mark[INSERT into<br/>processed_events]
Mark --> Commit[COMMIT TX]
Commit --> Offset[Commit Kafka offset]
Apply -.->|Failure mid-way| Rollback[ROLLBACK<br/>do not commit offset]
Rollback --> Retry[Kafka redelivers<br/>same outbox-id<br/>safe to retry]
style Skip fill:#dfd
style Commit fill:#dfd
style Rollback fill:#fdd
style Retry fill:#ffd
Extract an idempotency key from the message (the outbox ID is perfect), check if you have already processed it in a processed_events table, and skip if you have.
type OrderEventConsumer struct {
db *pgxpool.Pool
inventoryClient InventoryClient
logger *slog.Logger
}
func (c *OrderEventConsumer) HandleOrderCreated(
ctx context.Context,
msg kafka.Message,
) error {
// Extract the outbox event ID from headers
// This is our idempotency key
var outboxID string
for _, h := range msg.Headers {
if h.Key == "outbox-id" {
outboxID = string(h.Value)
break
}
}
if outboxID == "" {
// Fallback: use partition:offset only when the header is missing.
// WARNING: partition:offset is a log position, not a logical event — a
// republished event (relay retry, or after MirrorMaker failover) lands at a
// new offset and slips past this check. Prefer outbox IDs for reliable dedup.
outboxID = fmt.Sprintf("kafka:%d:%d", msg.Partition, msg.Offset)
}
// Check if we've already processed this event
var alreadyProcessed bool
err := c.db.QueryRow(ctx, `
SELECT EXISTS(
SELECT 1 FROM processed_events WHERE event_id = $1
)
`, outboxID).Scan(&alreadyProcessed)
if err != nil {
return fmt.Errorf("check idempotency: %w", err)
}
if alreadyProcessed {
c.logger.Info("skipping duplicate event", "event_id", outboxID)
return nil // Commit the offset — we've handled this
}
var event OrderCreatedEvent
if err := json.Unmarshal(msg.Value, &event); err != nil {
// Bad payload — dead letter it, don't retry
c.logger.Error("unmarshal failed", "error", err, "event_id", outboxID)
return c.sendToDeadLetter(ctx, msg, err)
}
// Do the actual work — call the downstream service OUTSIDE the transaction
// to avoid holding a DB connection during network I/O (pool exhaustion risk).
if err := c.inventoryClient.Reserve(ctx, event.OrderID, event.Items); err != nil {
return fmt.Errorf("reserve inventory: %w", err)
}
// Record that we processed this event — short-lived transaction for DB only
tx, err := c.db.Begin(ctx)
if err != nil {
return fmt.Errorf("begin tx: %w", err)
}
defer tx.Rollback(ctx)
_, err = tx.Exec(ctx, `
INSERT INTO processed_events (event_id, processed_at)
VALUES ($1, NOW())
ON CONFLICT (event_id) DO NOTHING
`, outboxID)
if err != nil {
return fmt.Errorf("record processed event: %w", err)
}
return tx.Commit(ctx)
}For resilience patterns including circuit breakers, see Building Resilient Distributed Systems with Go.
Saga orchestration: coordinated multi-service transactions
[Transactional outbox]When a single operation spans multiple services—order placement reserves inventory, charges payment, and notifies fulfillment—you need a distributed transaction mechanism. Traditional two-phase commit (2PC) doesn't work across microservices because each service owns its own database and cannot participate in a global transaction.
Sagas replace 2PC. A saga is a sequence of local transactions (each in a single service) coordinated by an orchestrator. The orchestrator calls inventory.Reserve(), then payment.Charge(), then fulfillment.CreateShipment(). If any step fails, the orchestrator runs compensating transactions in reverse: if fulfillment fails, it calls payment.Refund(), then inventory.Release(). This provides application-level ACID semantics without relying on transactional databases.
We use orchestration-based sagas (one central coordinator) rather than choreography (services publish events that trigger the next step). Orchestration is easier to understand, debug, and test because the entire flow lives in one place:
type OrderSagaOrchestrator struct {
inventoryClient InventoryClient
paymentClient PaymentClient
fulfillmentClient FulfillmentClient
db *pgxpool.Pool
eventPublisher EventPublisher
logger *slog.Logger
}
type SagaStep struct {
Name string
Execute func(ctx context.Context, saga *OrderSaga) error
Compensate func(ctx context.Context, saga *OrderSaga) error
}
func (o *OrderSagaOrchestrator) ExecuteOrderSaga(
ctx context.Context,
orderID uuid.UUID,
) error {
saga, err := o.loadSaga(ctx, orderID)
if err != nil {
return fmt.Errorf("load saga: %w", err)
}
steps := []SagaStep{
{
Name: "reserve-inventory",
Execute: func(ctx context.Context, s *OrderSaga) error {
return o.inventoryClient.Reserve(ctx, s.OrderID, s.Items)
},
Compensate: func(ctx context.Context, s *OrderSaga) error {
return o.inventoryClient.Release(ctx, s.OrderID)
},
},
{
Name: "charge-payment",
Execute: func(ctx context.Context, s *OrderSaga) error {
chargeID, err := o.paymentClient.Charge(
ctx,
s.CustomerID,
s.TotalAmount,
s.OrderID.String(), // Idempotency key
)
if err != nil {
return err
}
s.ChargeID = chargeID
return nil
},
Compensate: func(ctx context.Context, s *OrderSaga) error {
if s.ChargeID == "" {
return nil // Never charged
}
return o.paymentClient.Refund(ctx, s.ChargeID)
},
},
{
Name: "notify-fulfillment",
Execute: func(ctx context.Context, s *OrderSaga) error {
return o.fulfillmentClient.CreateShipment(ctx, s.OrderID)
},
Compensate: func(ctx context.Context, s *OrderSaga) error {
return o.fulfillmentClient.CancelShipment(ctx, s.OrderID)
},
},
}
// Execute steps forward
var completedSteps []SagaStep
for _, step := range steps {
if err := step.Execute(ctx, saga); err != nil {
o.logger.Error("saga step failed, compensating",
"step", step.Name,
"order_id", orderID,
"error", err,
)
// Run compensating transactions in reverse
for i := len(completedSteps) - 1; i >= 0; i-- {
compensateCtx, cancel := context.WithTimeout(
context.Background(), // Fresh context — don't use cancelled one
30*time.Second,
)
if compErr := completedSteps[i].Compensate(compensateCtx, saga); compErr != nil {
o.logger.Error("compensation failed — manual intervention required",
"step", completedSteps[i].Name,
"order_id", orderID,
"error", compErr,
)
}
cancel()
}
return fmt.Errorf("saga failed at step %s: %w", step.Name, err)
}
completedSteps = append(completedSteps, step)
}
return o.eventPublisher.Publish(ctx, "order.completed", saga.OrderID.String(), saga)
}When running compensating transactions, use a fresh context.Background() — the original might be cancelled. Compensating transactions must complete regardless.
Detecting Stuck Sagas
Sagas can get stuck: a step times out, compensation fails, and the saga remains in a partial state. Without detection, these silently accumulate.
// saga_watchdog.go — runs on a periodic schedule (e.g., every 5 minutes)
func (w *SagaWatchdog) DetectStuckSagas(ctx context.Context) error {
rows, err := w.db.Query(ctx, `
SELECT id, order_id, current_step, status, updated_at
FROM order_sagas
WHERE status = 'IN_PROGRESS'
AND updated_at < NOW() - INTERVAL '10 minutes'
`)
if err != nil {
return err
}
defer rows.Close()
for rows.Next() {
var saga StuckSaga
if err := rows.Scan(&saga.ID, &saga.OrderID, &saga.CurrentStep, &saga.Status, &saga.UpdatedAt); err != nil {
return err
}
w.logger.Error("stuck saga detected",
"saga_id", saga.ID,
"order_id", saga.OrderID,
"stuck_at_step", saga.CurrentStep,
"stuck_since", saga.UpdatedAt,
)
// Option 1: Attempt automatic compensation
if err := w.orchestrator.CompensateFrom(ctx, saga.ID, saga.CurrentStep); err != nil {
// Option 2: Escalate for manual intervention
w.alerter.Send(ctx, Alert{
Severity: "critical",
Summary: fmt.Sprintf("Saga %s stuck at step %s for order %s — compensation failed",
saga.ID, saga.CurrentStep, saga.OrderID),
})
}
}
return nil
}Persist saga state to the database and maintain an updated_at timestamp. A periodic watchdog queries for sagas in IN_PROGRESS status where updated_at is stale (e.g., >10 minutes old). For transient failures, automatically reschedule the stuck step for retry. For permanent failures (circuit breaker repeatedly open, validation error), escalate to manual intervention. This pattern prevents sagas from silently accumulating in partial states.
Dead-letter queues and exponential backoff
When a consumer fails to process an event, use a dead-letter queue (DLQ) with exponential backoff. On failure, republish to a retry topic with exponential delays (10s, then 60s, then 5min). After exhausting all retries, send to a DLQ for manual investigation.
This prevents two failure modes: the thundering herd (a service recovering from an outage is immediately flooded with millions of retries and crashes again), and poison messages (invalid data that crashes all consumers forever). Exponential backoff gives the upstream service time to recover; the DLQ surfaces toxic events that need human analysis.
Set up separate retry topics for each delay level (orders.created.retry-1, .retry-2, etc.), and have a scheduler republish from retry topics back to the main topic after the delay expires. Add headers tracking original topic, retry count, and the error that triggered the retry. Alert when any DLQ topic accumulates messages older than 15 minutes — it indicates either a systemic problem or a schema that needs updating.
Production patterns: scaling and throughput
Outbox relay throughput ceiling: The polling outbox pattern hits ~5,000–10,000 events/sec per relay instance. The bottleneck is polling overhead (1–5ms per SELECT query) and WAL write amplification (each outbox row is written twice—once on INSERT from business transaction, again on UPDATE when published). If you consistently exceed 10K events/sec, migrate to CDC (Debezium), which reads the PostgreSQL write-ahead log directly and can sustain 50K+ events/sec. CDC trades operational simplicity (no polling code to maintain) for infrastructure complexity (Kafka Connect cluster to operate).
Scaling consumers:
Use cooperative rebalancing (the default in franz-go) so only moving partitions pause during rebalancing—others keep processing. Use static membership with InstanceID in Kubernetes so pod restarts within the SessionTimeout rejoin without rebalancing. Scale consumer count gradually (2x increments, not 10x jumps) to minimize rebalance duration and impact. [Transactional outbox]
Tuning: Enable LZ4 or zstd compression (60–80% savings on typical JSON payloads), set producer acks=all for durability, and tune batch.size (256KB for throughput) and consumer fetch.max.wait.ms (100ms for low p50 latency). Monitor lag and be aggressive about tuning—even small parameter changes often yield 30–50% throughput improvements. [Kafka producer config]
Production checklist
- Partitioning: Set partition count to at least 2x your consumer count, targeting 10MB/s per partition for throughput. Repartitioning is expensive and disruptive.
- Routing: Use aggregate ID as the message key; never publish with null keys. Same key guarantees same partition and preserves ordering within an entity.
- Compression: Enable LZ4 or zstd to reduce network and storage costs 60–80% for JSON payloads. The CPU cost of compression is negligible on modern hardware.
- Durability: Set producer
acks=allandenable.idempotence=truefor the outbox relay. For higher throughput, tuneBatchSizeto 256KB andBatchTimeoutto 10ms. - Monitoring: Monitor consumer lag; alert when any group falls >5 minutes behind. Monitor outbox backpressure (pending rows); alert when pending rows exceed your relay's throughput. Monitor DLQ accumulation; alert when any DLQ topic has messages >15 minutes old.
- Testing: End-to-end test with testcontainers-go. Use
FOR UPDATE SKIP LOCKEDin the relay to safely run multiple instances horizontally. - Sagas: Establish a runbook for stuck sagas. A saga stuck for >10 minutes should trigger an alert for human investigation.
- Scaling: If you consistently exceed 10K events/sec, migrate to CDC (Debezium) to eliminate polling overhead and achieve 50K+ events/sec throughput. [Transactional outbox]
Event schema evolution
Add a schema_version field to event payloads. When changing schemas, deploy consumers with support for both old and new versions first, then roll out producers with the new version. This prevents deserialization failures from version mismatches.
Safe schema changes: adding optional fields, removing unused fields. Unsafe changes: removing required fields, renaming fields, changing types without migration. At scale (>100K events/day), a schema registry (Confluent, AWS Glue, or Protobuf) enforces compatibility rules at publish time, preventing incompatible producers from pushing events to Kafka. Without a registry, schema evolution mistakes can silently corrupt your event stream and break all downstream consumers.
Disaster recovery and replay
Monitoring for failures:
Three metrics reveal problems. Consumer lag: if a consumer group falls 50K+ messages behind for >5 minutes, something is stuck. Outbox backpressure: if status='PENDING' rows exceed your relay's throughput, the relay or broker is down. DLQ accumulation: messages in DLQ >15 minutes indicate poison data.
Event replay:
When you discover a bug, reset your consumer group's offset to a specific timestamp using kafka-consumer-groups --reset-offsets --to-datetime, redeploy with the fix, and idempotent consumers will safely reprocess. The outbox table serves as an audit log for reconstructing events older than Kafka's retention window.
Cross-region failover: Replicate topics using MirrorMaker 2. On failover, point consumers to the secondary and reset to the last committed offset; idempotent consumers tolerate duplicates. Bound sync lag (RPO) based on your business requirements.
Real-world gotchas
Connection pool exhaustion:
The outbox relay and API share a database pool. If the relay holds connections during large batches, the API starves. Give the relay its own small, dedicated pool (2–3 connections). Use SKIP LOCKED so the relay skips rows held by the API, preventing blocking.
Saga compensation:
When a saga step fails, distinguish between transient failures (rate limiting, network timeout) and permanent ones (validation error). For transient failures, park the saga (set status = PARKED, retry_after = NOW() + 60 seconds) rather than compensating immediately. A periodic watchdog job resumes parked sagas when retry_after expires. This prevents false rollbacks while keeping the pipeline moving.
Conclusion: when to build this, when to skip it
Event-driven systems are differently reliable — they shift failures from synchronous (fail immediately) to asynchronous (fail silently for days or weeks). Implement the outbox pattern, idempotent consumers, saga compensation, and DLQ monitoring from the start. Learning these patterns through production outages is expensive and painful.
For analytical events (logging, metrics), simple async publish to Kafka is fine. For transactional data (financial orders, inventory, payments): build the outbox pattern and idempotent consumers from day one. Don't prematurely optimize simple CRUD services with event-driven complexity; scale to these patterns once you have >500 requests/sec and true multi-service dependencies. The complexity is real, but so is the reliability payoff.
Frequently Asked Questions
What is the dual-write problem in event-driven systems?
The dual-write problem occurs when you need to atomically update a database and publish an event to a message broker, but they are separate systems with no shared transaction. If either operation fails independently, your database state and published events become inconsistent.
How does the outbox pattern solve dual-write consistency?
The outbox pattern writes events to an outbox table in the same database transaction as the business data change. A separate poller or CDC process reads the outbox and publishes to Kafka, guaranteeing events are published if and only if the transaction committed.
When should I use Kafka vs RabbitMQ vs SQS?
Use Kafka for event sourcing, stream processing, and high-throughput workloads requiring message retention and replay. Use RabbitMQ for complex routing, task queues, and priority queues. Use SQS for simple pub/sub in serverless AWS environments where operational simplicity matters more than replay capability.
What is the saga pattern for distributed transactions?
A saga coordinates a multi-service transaction as a sequence of local transactions, each publishing an event that triggers the next step. If any step fails, compensating transactions undo the previous steps. This replaces two-phase commit, which doesn't scale across microservices.
Keep Reading
- Message Queue Comparison — Kafka vs RabbitMQ vs SQS vs Redis Streams, with throughput benchmarks and decision criteria
- Idempotency Patterns for Distributed Systems — The idempotency key pattern, dedup tables, and the database constraints that make at-least-once delivery safe
- Microservices Architecture Patterns — Service decomposition, data ownership boundaries, and the communication patterns that connect event-driven services
Engineering Team
A multidisciplinary team of backend engineers, architects, and DevOps practitioners shipping deep dives into distributed systems and production infrastructure.
Read Next
Kafka Producer Tuning Cheat Sheet: Throughput, Latency & Durability
Kafka producer configuration: acks, idempotence, batching, compression, and the tradeoffs that matter for throughput and durability.
Idempotency Patterns: Building Retry-Safe Distributed Systems
Why exactly-once is a myth, and how idempotency keys, database constraints, and the outbox pattern make retries safe in Go and Java.
Go Context in Depth: Cancellation, Timeouts, and Debugging in Production
Master context.Context in Go: cancellation propagation, deadline inheritance, goroutine leak patterns, and debugging with pprof.