#go #kafka #event-driven #microservices #distributed-systems #saga-pattern

Event-Driven Microservices in Go: Kafka, Sagas, and the Outbox Pattern

BackendBytes Engineering Team

Feb 10, 2026

16 min read

Event-Driven Microservices in Go: Kafka, Sagas, and the Outbox Pattern

Part of Series: Microservices Architecture

Lesson 3 of 5

Prev Next

Key Takeaways

→Dual-write bugs surface as phantom orders — the event publishes to Kafka, the DB write fails, downstream systems (inventory, billing, fulfilment) act on an order that doesn't exist. The outbox pattern is the only durable fix.
→The outbox pattern solves atomicity by writing events to the database in the same transaction as business data; a background relay then polls and publishes to Kafka
→Idempotent consumers are non-negotiable — at-least-once delivery guarantees duplicates, so track processed event IDs with unique constraints to safely replay
→Sagas orchestrate multi-service transactions with compensating transactions that undo partial failures automatically, replacing two-phase commit which doesn't scale across services

A customer gets charged for an order that never made it into the database. An order service publishes a Kafka event, then updates PostgreSQL. The database write fails after the event is already out — and a phantom order cascades downstream: inventory decremented, fulfilment notified, payment charged, all for an order that does not exist. Root-cause tracing is brutal: the event trail looks correct; the database is the one that disagrees. We debugged variants of this on multiple production Kafka-backed services^{[Kreps et al., 2011]}.

TL;DR

The dual-write problem — atomically updating a database and publishing events to separate systems — is solved by the outbox pattern (write events to the database in the same transaction)^{[Transactional outbox]}, idempotent consumers (safe to replay)^{[Akkoyunlu et al., 1975]}, and saga orchestration (distributed transactions with automatic compensation). All three are essential; none are optional for high-value systems.

Write events to the database in a single transaction using the outbox pattern; a background relay publishes them to Kafka
Make consumers idempotent by tracking processed event IDs in the database; at-least-once delivery guarantees you'll see duplicates
Coordinate multi-service transactions with sagas; compensating transactions automatically undo partial failures

Architecture decision: outbox vs CDC vs event sourcing

^{[Transactional outbox]}

Three patterns solve the dual-write problem. Each trades operational complexity against throughput and consistency guarantees.

Pattern	Throughput ceiling	Operational complexity	When to use	When NOT to use
Outbox	~5-10k events/sec per relay	Low — one new table, one polling job	99% of services; you already control the writes	When you cannot change application code
CDC (Debezium)	50k+ events/sec	High — Kafka Connect, connector configs, WAL retention tuning	When outbox throughput is the bottleneck OR you cannot modify the writing application	When the operational team cannot run Kafka Connect
Event Sourcing	Limited by event store, not Kafka	Very high — rebuilds, snapshots, projections, schema evolution	Audit-heavy domains (banking, compliance, regulated finance)	Most CRUD apps; query patterns rule it out
Direct dual-write (anti-pattern)	High but unsafe	Low	Never — this is the bug pattern this article exists to fix	Always

The outbox pattern (this article) writes events to a database table in the same transaction as your business data, then a background polling process publishes them to Kafka. It requires no new infrastructure—just a new table and a simple background job. The throughput ceiling is around 5,000–10,000 events/sec per relay instance due to polling overhead. This is the right choice for most services: simpler to operate, easier to debug, and sufficient for the vast majority of production workloads.

Change Data Capture (CDC) with Debezium streams database changes directly from the PostgreSQL write-ahead log to Kafka, bypassing the application layer entirely. The application never writes to an outbox table; Debezium reads the WAL and streams changes automatically. This eliminates polling overhead and can handle 50,000+ events/sec. The trade-off is operational: you now run Kafka Connect with Debezium connectors, manage connector configurations, and ensure WAL retention is sufficient. Use CDC when you exceed the outbox ceiling and the application is already event-driven enough that "no application code changes" is an acceptable cost.

Event Sourcing makes the database itself the event log. Every state change appends an immutable event record rather than updating rows in place. Querying requires replaying events to rebuild current state. This is powerful in audit-heavy domains (banking, compliance) where the full history must always be queryable, but it's overkill for most systems and complicates query patterns significantly. Only use event sourcing if you have legal or business requirements for complete audit trails. ^{[Transactional outbox]}

The outbox pattern is the most common choice: it's simple, requires no additional infrastructure, handles 99% of production workloads, and keeps debugging and operation straightforward. ^{[Transactional outbox]}

Building the outbox pattern in Go

The outbox pattern solves the dual-write problem by writing both your business data and an event record in the same database transaction. A background relay then polls the outbox table and publishes events to Kafka. This guarantees atomicity: either both the data and event are committed, or neither are.

graph LR
    App["Application"] -->|"1. BEGIN TX"| DB[(PostgreSQL)]
    App -->|"2. INSERT order"| DB
    App -->|"3. INSERT outbox_event"| DB
    App -->|"4. COMMIT"| DB
    Relay["Outbox Relay<br/>(background goroutine)"] -->|"5. Poll pending events"| DB
    Relay -->|"6. Publish"| Kafka["Kafka Topic"]
    Relay -->|"7. Mark published"| DB

Outbox schema:

CREATE TABLE outbox_events (
    id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    aggregate_type VARCHAR(255) NOT NULL,
    aggregate_id   VARCHAR(255) NOT NULL,
    event_type     VARCHAR(255) NOT NULL,
    payload        JSONB        NOT NULL,
    status         VARCHAR(50)  DEFAULT 'PENDING',
    created_at     TIMESTAMPTZ  DEFAULT NOW(),
    published_at   TIMESTAMPTZ,
    retry_count    INT          DEFAULT 0
);
CREATE INDEX idx_outbox_pending ON outbox_events(status, created_at) WHERE status = 'PENDING';

Application writes events and data in one transaction:

// OrderRepository handles both order creation and outbox writing
// in a single database transaction.
type OrderRepository struct {
    db *pgx.Conn
}
 
type OutboxEvent struct {
    AggregateType string
    AggregateID   string
    EventType     string
    Payload       any
}
 
func (r *OrderRepository) CreateOrder(
    ctx context.Context,
    order *Order,
) error {
    tx, err := r.db.Begin(ctx)
    if err != nil {
        return fmt.Errorf("begin transaction: %w", err)
    }
    defer tx.Rollback(ctx) // no-op if already committed
 
    // 1. Insert the order
    _, err = tx.Exec(ctx, `
        INSERT INTO orders (id, customer_id, status, total_amount, created_at)
        VALUES ($1, $2, $3, $4, $5)
    `, order.ID, order.CustomerID, order.Status, order.TotalAmount, order.CreatedAt)
    if err != nil {
        return fmt.Errorf("insert order: %w", err)
    }
 
    // 2. Write the event to the outbox — same transaction
    payload, err := json.Marshal(OrderCreatedEvent{
        OrderID:    order.ID,
        CustomerID: order.CustomerID,
        Items:      order.Items,
        TotalAmount: order.TotalAmount,
        CreatedAt:  order.CreatedAt,
    })
    if err != nil {
        return fmt.Errorf("marshal event payload: %w", err)
    }
 
    _, err = tx.Exec(ctx, `
        INSERT INTO outbox_events
            (aggregate_type, aggregate_id, event_type, payload)
        VALUES ($1, $2, $3, $4)
    `, "order", order.ID.String(), "order.created", payload)
    if err != nil {
        return fmt.Errorf("insert outbox event: %w", err)
    }
 
    // 3. Commit both together — or neither
    if err := tx.Commit(ctx); err != nil {
        return fmt.Errorf("commit transaction: %w", err)
    }
 
    return nil
}

Background relay publishes pending events:

type OutboxRelay struct {
    db       *pgxpool.Pool
    producer *kafka.Writer
    logger   *slog.Logger
}
 
func (r *OutboxRelay) Run(ctx context.Context) error {
    ticker := time.NewTicker(500 * time.Millisecond)
    defer ticker.Stop()
 
    for {
        select {
        case <-ctx.Done():
            return ctx.Err()
        case <-ticker.C:
            if err := r.processBatch(ctx); err != nil {
                r.logger.Error("outbox relay batch failed", "error", err)
                // Continue — don't exit the relay on transient errors
            }
        }
    }
}
 
func (r *OutboxRelay) processBatch(ctx context.Context) error {
    // Lock rows we're about to process (SKIP LOCKED is important for
    // horizontal scaling — multiple relay instances won't fight over rows)
    rows, err := r.db.Query(ctx, `
        SELECT id, aggregate_type, aggregate_id, event_type, payload
        FROM outbox_events
        WHERE status = 'PENDING'
          AND retry_count < 5
        ORDER BY created_at
        LIMIT 100
        FOR UPDATE SKIP LOCKED
    `)
    if err != nil {
        return fmt.Errorf("query outbox: %w", err)
    }
    defer rows.Close()
 
    type pendingEvent struct {
        ID            uuid.UUID
        AggregateType string
        AggregateID   string
        EventType     string
        Payload       json.RawMessage
    }
 
    var events []pendingEvent
    for rows.Next() {
        var e pendingEvent
        if err := rows.Scan(&e.ID, &e.AggregateType, &e.AggregateID, &e.EventType, &e.Payload); err != nil {
            return fmt.Errorf("scan row: %w", err)
        }
        events = append(events, e)
    }
 
    if len(events) == 0 {
        return nil
    }
 
    messages := make([]kafka.Message, len(events))
    for i, e := range events {
        messages[i] = kafka.Message{
            Topic: topicForEventType(e.EventType),
            Key:   []byte(e.AggregateID), // Same key → same partition → ordered delivery
            Value: e.Payload,
            // Carry the unique outbox row ID so consumers can dedup on it.
            // This is the event's own ID, NOT the aggregate ID (partition key).
            Headers: []kafka.Header{
                {Key: "outbox-id", Value: []byte(e.ID.String())},
            },
        }
    }
 
    if err := r.producer.WriteMessages(ctx, messages...); err != nil {
        // Mark for retry; the relay will pick up this batch again on the next cycle
        return fmt.Errorf("write to kafka: %w", err)
    }
 
    _, err = r.db.Exec(ctx, `
        UPDATE outbox_events SET status = 'PUBLISHED', published_at = NOW() WHERE id = ANY($1)
    `, ids(events))
    return err
}

Partition key and consumer scaling

Use aggregate ID (e.g., order_id) as the message key to ensure all events for one entity land on the same partition and stay ordered. Use cooperative rebalancing so scaling consumers doesn't pause all processing. Use static membership with InstanceID in Kubernetes to avoid rebalances during rolling restarts.

Consuming events reliably: idempotency and resilience

^{[Stripe idempotency]}

With at-least-once delivery, you will see duplicate events. Your consumers must be idempotent: processing the same event twice must produce the same result as processing it once.

The consumer-side flow makes the discipline explicit:

graph TD
    Msg[Kafka message arrives] --> Extract[Extract outbox-id<br/>from headers]
    Extract --> Check{processed_events<br/>has this id?}
    Check -->|Yes — duplicate| Skip[Commit offset<br/>do nothing]
    Check -->|No — first time| TX[BEGIN TX]
    TX --> Apply[Apply business effect<br/>e.g. decrement inventory]
    Apply --> Mark[INSERT into<br/>processed_events]
    Mark --> Commit[COMMIT TX]
    Commit --> Offset[Commit Kafka offset]
    Apply -.->|Failure mid-way| Rollback[ROLLBACK<br/>do not commit offset]
    Rollback --> Retry[Kafka redelivers<br/>same outbox-id<br/>safe to retry]
    style Skip fill:#dfd
    style Commit fill:#dfd
    style Rollback fill:#fdd
    style Retry fill:#ffd

Extract an idempotency key from the message (the outbox ID is perfect), check if you have already processed it in a processed_events table, and skip if you have.

type OrderEventConsumer struct {
    db             *pgxpool.Pool
    inventoryClient InventoryClient
    logger         *slog.Logger
}
 
func (c *OrderEventConsumer) HandleOrderCreated(
    ctx context.Context,
    msg kafka.Message,
) error {
    // Extract the outbox event ID from headers
    // This is our idempotency key
    var outboxID string
    for _, h := range msg.Headers {
        if h.Key == "outbox-id" {
            outboxID = string(h.Value)
            break
        }
    }
    if outboxID == "" {
        // Fallback: use partition:offset only when the header is missing.
        // WARNING: partition:offset is a log position, not a logical event — a
        // republished event (relay retry, or after MirrorMaker failover) lands at a
        // new offset and slips past this check. Prefer outbox IDs for reliable dedup.
        outboxID = fmt.Sprintf("kafka:%d:%d", msg.Partition, msg.Offset)
    }
 
    // Check if we've already processed this event
    var alreadyProcessed bool
    err := c.db.QueryRow(ctx, `
        SELECT EXISTS(
            SELECT 1 FROM processed_events WHERE event_id = $1
        )
    `, outboxID).Scan(&alreadyProcessed)
    if err != nil {
        return fmt.Errorf("check idempotency: %w", err)
    }
    if alreadyProcessed {
        c.logger.Info("skipping duplicate event", "event_id", outboxID)
        return nil // Commit the offset — we've handled this
    }
 
    var event OrderCreatedEvent
    if err := json.Unmarshal(msg.Value, &event); err != nil {
        // Bad payload — dead letter it, don't retry
        c.logger.Error("unmarshal failed", "error", err, "event_id", outboxID)
        return c.sendToDeadLetter(ctx, msg, err)
    }
 
    // Do the actual work — call the downstream service OUTSIDE the transaction
    // to avoid holding a DB connection during network I/O (pool exhaustion risk).
    if err := c.inventoryClient.Reserve(ctx, event.OrderID, event.Items); err != nil {
        return fmt.Errorf("reserve inventory: %w", err)
    }
 
    // Record that we processed this event — short-lived transaction for DB only
    tx, err := c.db.Begin(ctx)
    if err != nil {
        return fmt.Errorf("begin tx: %w", err)
    }
    defer tx.Rollback(ctx)
 
    _, err = tx.Exec(ctx, `
        INSERT INTO processed_events (event_id, processed_at)
        VALUES ($1, NOW())
        ON CONFLICT (event_id) DO NOTHING
    `, outboxID)
    if err != nil {
        return fmt.Errorf("record processed event: %w", err)
    }
 
    return tx.Commit(ctx)
}

For resilience patterns including circuit breakers, see Building Resilient Distributed Systems with Go.

Saga orchestration: coordinated multi-service transactions

^{[Transactional outbox]}

When a single operation spans multiple services—order placement reserves inventory, charges payment, and notifies fulfillment—you need a distributed transaction mechanism. Traditional two-phase commit (2PC) doesn't work across microservices because each service owns its own database and cannot participate in a global transaction.

Sagas replace 2PC. A saga is a sequence of local transactions (each in a single service) coordinated by an orchestrator. The orchestrator calls inventory.Reserve(), then payment.Charge(), then fulfillment.CreateShipment(). If any step fails, the orchestrator runs compensating transactions in reverse: if fulfillment fails, it calls payment.Refund(), then inventory.Release(). This provides application-level ACID semantics without relying on transactional databases.

We use orchestration-based sagas (one central coordinator) rather than choreography (services publish events that trigger the next step). Orchestration is easier to understand, debug, and test because the entire flow lives in one place:

Saga Orchestration vs Choreography Sequence Diagram

type OrderSagaOrchestrator struct {
    inventoryClient  InventoryClient
    paymentClient    PaymentClient
    fulfillmentClient FulfillmentClient
    db               *pgxpool.Pool
    eventPublisher   EventPublisher
    logger           *slog.Logger
}
 
type SagaStep struct {
    Name        string
    Execute     func(ctx context.Context, saga *OrderSaga) error
    Compensate  func(ctx context.Context, saga *OrderSaga) error
}
 
func (o *OrderSagaOrchestrator) ExecuteOrderSaga(
    ctx context.Context,
    orderID uuid.UUID,
) error {
    saga, err := o.loadSaga(ctx, orderID)
    if err != nil {
        return fmt.Errorf("load saga: %w", err)
    }
 
    steps := []SagaStep{
        {
            Name: "reserve-inventory",
            Execute: func(ctx context.Context, s *OrderSaga) error {
                return o.inventoryClient.Reserve(ctx, s.OrderID, s.Items)
            },
            Compensate: func(ctx context.Context, s *OrderSaga) error {
                return o.inventoryClient.Release(ctx, s.OrderID)
            },
        },
        {
            Name: "charge-payment",
            Execute: func(ctx context.Context, s *OrderSaga) error {
                chargeID, err := o.paymentClient.Charge(
                    ctx,
                    s.CustomerID,
                    s.TotalAmount,
                    s.OrderID.String(), // Idempotency key
                )
                if err != nil {
                    return err
                }
                s.ChargeID = chargeID
                return nil
            },
            Compensate: func(ctx context.Context, s *OrderSaga) error {
                if s.ChargeID == "" {
                    return nil // Never charged
                }
                return o.paymentClient.Refund(ctx, s.ChargeID)
            },
        },
        {
            Name: "notify-fulfillment",
            Execute: func(ctx context.Context, s *OrderSaga) error {
                return o.fulfillmentClient.CreateShipment(ctx, s.OrderID)
            },
            Compensate: func(ctx context.Context, s *OrderSaga) error {
                return o.fulfillmentClient.CancelShipment(ctx, s.OrderID)
            },
        },
    }
 
    // Execute steps forward
    var completedSteps []SagaStep
    for _, step := range steps {
        if err := step.Execute(ctx, saga); err != nil {
            o.logger.Error("saga step failed, compensating",
                "step", step.Name,
                "order_id", orderID,
                "error", err,
            )
 
            // Run compensating transactions in reverse
            for i := len(completedSteps) - 1; i >= 0; i-- {
                compensateCtx, cancel := context.WithTimeout(
                    context.Background(), // Fresh context — don't use cancelled one
                    30*time.Second,
                )
                if compErr := completedSteps[i].Compensate(compensateCtx, saga); compErr != nil {
                    o.logger.Error("compensation failed — manual intervention required",
                        "step", completedSteps[i].Name,
                        "order_id", orderID,
                        "error", compErr,
                    )
                }
                cancel()
            }
 
            return fmt.Errorf("saga failed at step %s: %w", step.Name, err)
        }
        completedSteps = append(completedSteps, step)
    }
 
    return o.eventPublisher.Publish(ctx, "order.completed", saga.OrderID.String(), saga)
}

When running compensating transactions, use a fresh context.Background() — the original might be cancelled. Compensating transactions must complete regardless.

Detecting Stuck Sagas

Sagas can get stuck: a step times out, compensation fails, and the saga remains in a partial state. Without detection, these silently accumulate.

// saga_watchdog.go — runs on a periodic schedule (e.g., every 5 minutes)
 
func (w *SagaWatchdog) DetectStuckSagas(ctx context.Context) error {
    rows, err := w.db.Query(ctx, `
        SELECT id, order_id, current_step, status, updated_at
        FROM order_sagas
        WHERE status = 'IN_PROGRESS'
          AND updated_at < NOW() - INTERVAL '10 minutes'
    `)
    if err != nil {
        return err
    }
    defer rows.Close()
 
    for rows.Next() {
        var saga StuckSaga
        if err := rows.Scan(&saga.ID, &saga.OrderID, &saga.CurrentStep, &saga.Status, &saga.UpdatedAt); err != nil {
            return err
        }
 
        w.logger.Error("stuck saga detected",
            "saga_id", saga.ID,
            "order_id", saga.OrderID,
            "stuck_at_step", saga.CurrentStep,
            "stuck_since", saga.UpdatedAt,
        )
 
        // Option 1: Attempt automatic compensation
        if err := w.orchestrator.CompensateFrom(ctx, saga.ID, saga.CurrentStep); err != nil {
            // Option 2: Escalate for manual intervention
            w.alerter.Send(ctx, Alert{
                Severity: "critical",
                Summary:  fmt.Sprintf("Saga %s stuck at step %s for order %s — compensation failed",
                    saga.ID, saga.CurrentStep, saga.OrderID),
            })
        }
    }
    return nil
}

Persist saga state to the database and maintain an updated_at timestamp. A periodic watchdog queries for sagas in IN_PROGRESS status where updated_at is stale (e.g., >10 minutes old). For transient failures, automatically reschedule the stuck step for retry. For permanent failures (circuit breaker repeatedly open, validation error), escalate to manual intervention. This pattern prevents sagas from silently accumulating in partial states.

Dead-letter queues and exponential backoff

When a consumer fails to process an event, use a dead-letter queue (DLQ) with exponential backoff. On failure, republish to a retry topic with exponential delays (10s, then 60s, then 5min). After exhausting all retries, send to a DLQ for manual investigation.

This prevents two failure modes: the thundering herd (a service recovering from an outage is immediately flooded with millions of retries and crashes again), and poison messages (invalid data that crashes all consumers forever). Exponential backoff gives the upstream service time to recover; the DLQ surfaces toxic events that need human analysis.

Set up separate retry topics for each delay level (orders.created.retry-1, .retry-2, etc.), and have a scheduler republish from retry topics back to the main topic after the delay expires. Add headers tracking original topic, retry count, and the error that triggered the retry. Alert when any DLQ topic accumulates messages older than 15 minutes — it indicates either a systemic problem or a schema that needs updating.

Production patterns: scaling and throughput

Outbox relay throughput ceiling: The polling outbox pattern hits ~5,000–10,000 events/sec per relay instance. The bottleneck is polling overhead (1–5ms per SELECT query) and WAL write amplification (each outbox row is written twice—once on INSERT from business transaction, again on UPDATE when published). If you consistently exceed 10K events/sec, migrate to CDC (Debezium), which reads the PostgreSQL write-ahead log directly and can sustain 50K+ events/sec. CDC trades operational simplicity (no polling code to maintain) for infrastructure complexity (Kafka Connect cluster to operate).

Scaling consumers: Use cooperative rebalancing (the default in franz-go) so only moving partitions pause during rebalancing—others keep processing. Use static membership with InstanceID in Kubernetes so pod restarts within the SessionTimeout rejoin without rebalancing. Scale consumer count gradually (2x increments, not 10x jumps) to minimize rebalance duration and impact. ^{[Transactional outbox]}

Tuning: Enable LZ4 or zstd compression (60–80% savings on typical JSON payloads), set producer acks=all for durability, and tune batch.size (256KB for throughput) and consumer fetch.max.wait.ms (100ms for low p50 latency). Monitor lag and be aggressive about tuning—even small parameter changes often yield 30–50% throughput improvements. ^{[Kafka producer config]}

Production checklist

Partitioning: Set partition count to at least 2x your consumer count, targeting 10MB/s per partition for throughput. Repartitioning is expensive and disruptive.
Routing: Use aggregate ID as the message key; never publish with null keys. Same key guarantees same partition and preserves ordering within an entity.
Compression: Enable LZ4 or zstd to reduce network and storage costs 60–80% for JSON payloads. The CPU cost of compression is negligible on modern hardware.
Durability: Set producer acks=all and enable.idempotence=true for the outbox relay. For higher throughput, tune BatchSize to 256KB and BatchTimeout to 10ms.
Monitoring: Monitor consumer lag; alert when any group falls >5 minutes behind. Monitor outbox backpressure (pending rows); alert when pending rows exceed your relay's throughput. Monitor DLQ accumulation; alert when any DLQ topic has messages >15 minutes old.
Testing: End-to-end test with testcontainers-go. Use FOR UPDATE SKIP LOCKED in the relay to safely run multiple instances horizontally.
Sagas: Establish a runbook for stuck sagas. A saga stuck for >10 minutes should trigger an alert for human investigation.
Scaling: If you consistently exceed 10K events/sec, migrate to CDC (Debezium) to eliminate polling overhead and achieve 50K+ events/sec throughput. ^{[Transactional outbox]}

Event schema evolution

Add a schema_version field to event payloads. When changing schemas, deploy consumers with support for both old and new versions first, then roll out producers with the new version. This prevents deserialization failures from version mismatches.

Safe schema changes: adding optional fields, removing unused fields. Unsafe changes: removing required fields, renaming fields, changing types without migration. At scale (>100K events/day), a schema registry (Confluent, AWS Glue, or Protobuf) enforces compatibility rules at publish time, preventing incompatible producers from pushing events to Kafka. Without a registry, schema evolution mistakes can silently corrupt your event stream and break all downstream consumers.

Disaster recovery and replay

Monitoring for failures: Three metrics reveal problems. Consumer lag: if a consumer group falls 50K+ messages behind for >5 minutes, something is stuck. Outbox backpressure: if status='PENDING' rows exceed your relay's throughput, the relay or broker is down. DLQ accumulation: messages in DLQ >15 minutes indicate poison data.

Event replay: When you discover a bug, reset your consumer group's offset to a specific timestamp using kafka-consumer-groups --reset-offsets --to-datetime, redeploy with the fix, and idempotent consumers will safely reprocess. The outbox table serves as an audit log for reconstructing events older than Kafka's retention window.

Cross-region failover: Replicate topics using MirrorMaker 2. On failover, point consumers to the secondary and reset to the last committed offset; idempotent consumers tolerate duplicates. Bound sync lag (RPO) based on your business requirements.

Real-world gotchas

Connection pool exhaustion: The outbox relay and API share a database pool. If the relay holds connections during large batches, the API starves. Give the relay its own small, dedicated pool (2–3 connections). Use SKIP LOCKED so the relay skips rows held by the API, preventing blocking.

Saga compensation: When a saga step fails, distinguish between transient failures (rate limiting, network timeout) and permanent ones (validation error). For transient failures, park the saga (set status = PARKED, retry_after = NOW() + 60 seconds) rather than compensating immediately. A periodic watchdog job resumes parked sagas when retry_after expires. This prevents false rollbacks while keeping the pipeline moving.

Conclusion: when to build this, when to skip it

Event-driven systems are differently reliable — they shift failures from synchronous (fail immediately) to asynchronous (fail silently for days or weeks). Implement the outbox pattern, idempotent consumers, saga compensation, and DLQ monitoring from the start. Learning these patterns through production outages is expensive and painful.

For analytical events (logging, metrics), simple async publish to Kafka is fine. For transactional data (financial orders, inventory, payments): build the outbox pattern and idempotent consumers from day one. Don't prematurely optimize simple CRUD services with event-driven complexity; scale to these patterns once you have >500 requests/sec and true multi-service dependencies. The complexity is real, but so is the reliability payoff.

Frequently Asked Questions

What is the dual-write problem in event-driven systems?

The dual-write problem occurs when you need to atomically update a database and publish an event to a message broker, but they are separate systems with no shared transaction. If either operation fails independently, your database state and published events become inconsistent.

How does the outbox pattern solve dual-write consistency?

The outbox pattern writes events to an outbox table in the same database transaction as the business data change. A separate poller or CDC process reads the outbox and publishes to Kafka, guaranteeing events are published if and only if the transaction committed.

When should I use Kafka vs RabbitMQ vs SQS?

Use Kafka for event sourcing, stream processing, and high-throughput workloads requiring message retention and replay. Use RabbitMQ for complex routing, task queues, and priority queues. Use SQS for simple pub/sub in serverless AWS environments where operational simplicity matters more than replay capability.

What is the saga pattern for distributed transactions?

A saga coordinates a multi-service transaction as a sequence of local transactions, each publishing an event that triggers the next step. If any step fails, compensating transactions undo the previous steps. This replaces two-phase commit, which doesn't scale across microservices.

Keep Reading

Message Queue Comparison — Kafka vs RabbitMQ vs SQS vs Redis Streams, with throughput benchmarks and decision criteria
Idempotency Patterns for Distributed Systems — The idempotency key pattern, dedup tables, and the database constraints that make at-least-once delivery safe
Microservices Architecture Patterns — Service decomposition, data ownership boundaries, and the communication patterns that connect event-driven services

Was this article helpful?

Your feedback directly shapes our editorial depth and technical accuracy.