3. Outbox pattern

Posted Apr 12, 2024

By Miloslav Žežulka

7 min read

3. Outbox pattern

Retrospective

We’re quite far in this series. By now, you should have a good understanding of how Change Data Capture (CDC) operates internally, along with its limitations and strengths. We’ve also explored database-level support for CDC, using PostgreSQL as a key example.

Now, we’ll turn our attention to a more specific application of CDC: microservices. For a motivating illustration, I highly recommend reading this Debezium article before continuing.

A fundamental challenge in distributed systems is the need for reliable information propagation across all participating services, a task that inherently involves trade-offs. If you haven’t heard of the CAP theorem, there are many excellent resources online which explain it.

Let’s revisit our fictitious beehive management system, featuring two key services: beehive-control, responsible for managing user data, especially their hives, and beehive-external, which gathers data from external sources.

beehive-external only requires a chosen subset of user data, which we can imagine as a ‘remote materialized view’ of the source data owned by beehive-control. These change data events are transferred via a message broker; in the example concluding our series, we’ll be using Redis Streams.

Our objective is to ensure that beehive-external is informed of any changes to the source hive table, including what exactly happened.

Simple solutions are yet again not enough!

Initially, one might consider simply capturing and publishing each change to the message broker as a complete solution. However, this approach works reliably only in ideal circumstances. We need a mechanism to ensure that changes eventually lead to a consistent state: either both beehive-control and beehive-external update their respective data stores, or neither does.

This characteristic is what distributed systems theory terms eventual consistency. It implies that while changes aren’t guaranteed to be instantaneous or atomic, the system is designed to avoid permanent data inconsistencies.

The Outbox Pattern in Domain-Driven Design (DDD)

The core principle of this approach involves having an outbox table within the data owner’s database. For our purposes, when a modification occurs in the hive table, a corresponding record is also inserted into the outbox table – crucially, as part of the same database transaction.

Important terminology note: the Outbox pattern is firmly rooted in the principles of Domain-Driven Design (DDD), which itself operates independently of relational table concepts. DDD places significant emphasis on the ubiquitous language – a shared vocabulary specific to the business domain – which defines fundamental units of information known as aggregates. Changes to these aggregates are then persisted as an event stream, a concept often synonymous with event-driven architecture within the DDD context.

Here is what an outbox table typically looks like:

        
      
CREATE TABLE outbox (
    id uuid NOT NULL,
    aggregate_type varchar(255) NOT NULL,
    aggregate_id varchar(255) NOT NULL,
    type varchar(255) NOT NULL,
    payload jsonb NOT NULL
);

Column	Description
`id`	Unique identifier visible to consumers for managing duplicate messages.
`aggregate_type`	Root aggregate type in DDD (e.g. `hive`). May vary based on domain complexity.
`aggregate_id`	Reference to the specific modified aggregate instance.
`type`	Event type (e.g. `HiveEstablished`). Avoid CRUD-like types per DDD principles.
`payload`	Type-agnostic field for event data. Eliminates need for multiple outbox tables per aggregate.

For every change we want to propagate, we will also write a corresponding new row to the outbox table within the same transaction. This outbox table is then monitored – using CDC in our scenario – and its contents are published to downstream consumers.

We will define our beehive system aggregates in the concluding article.

Outbox pattern downsides

Now, let’s turn our attention to most common concerns.

Unlimited outbox table size

The mechanism described so far overlooks a crucial detail: the outbox table could potentially grow indefinitely with each new event.

We have two primary strategies that address this issue:

Immediate deletion: The entry in the outbox can be deleted immediately as part of the same transaction that inserted it. (The rationale for this approach is detailed below.)
Delayed deletion: The entry can be manually deleted only after the consuming service acknowledges the change, or alternatively, the row can be deleted after a fixed time period, whichever happens first.

The reason why option 1 can be effective is that, while no persistent rows will be visible in the outbox table, it’s important to remember that before committing any modifications to user data, PostgreSQL always logs all changes to the WAL. Consequently, Debezium can still capture these changes even if no physical record persists in the outbox.

However, I personally think that this approach does more harm than good because:

It is inherently awkward to debug; remember, there would be nothing in the outbox table.
There’s no assurance that Debezium will capture all changes, particularly in high-throughput environments where the Write-Ahead Log (WAL) might be overwritten rapidly. Admittedly, this is more of a problem of an unsuitable PostgreSQL or Debezium setup, but this point nevertheless does underline the debugging problem above.
We have no mechanisms to verify if a client has processed a message.
- This may be indeed desirable, as we want to have as decoupled an architecture as possible. My criticism is more towards the fact that we can’t even consider such a possibility with this implementation.
Replay of events can be a relatively tedious operation.

The second strategy addresses these concerns, albeit potentially introducing more architectural coupling, especially if we choose to and are able to implement client acknowledgements.

Handling Potential Debezium Failures

In an ideal scenario, upon restarting after a failure, Debezium will resume processing from its last known position (typically by referencing the replication slot) and catch up on any missed changes.

What if Debezium lags too much behind?

This becomes a significant concern primarily if we decide for the immediate deletion of entries from the outbox table; the rationale for this should be clear by now. In such cases, reinitializing Debezium might be the only viable solution. With delayed deletion, we can add a mechanism similar to a dead-letter queue redrive, though explaining it in detail is beyond this article’s scope.

Handling Message Broker Failures

The direct answer is: handling this is your responsibility. This is where complexities can arise, particularly if you choose not to use Kafka Connect which has mechanisms to handle outages. You would then need to design a solution, potentially a custom one, tailored to your specific sink (=the destination to which Debezium sends CDC events).

Reliable information propagation

How do we know the message was really processed? This is an aspect where the Outbox pattern, in its standard form, doesn’t provide a direct solution, as it wasn’t originally intended for this purpose. It offers the fundamental structure, but the responsibility of implementing additional tracking or bookkeeping mechanisms on the outbox owner’s side falls upon us.

One idea might be to have an async command channel through which subscribing services can send confirmation messages to the outbox owner once they’ve successfully processed an event, indicating that it is safe to delete the corresponding outbox entry.

To conclude our series, let’s proceed to putting everything to practice.

Summary

1. What is the Outbox pattern?

A design pattern for reliable event propagation in distributed systems. Domain events are not only persisted to user tables but also to a dedicated outbox table within the same transaction. The contents of this outbox table are then monitored by CDC and reliably delivered to other services, helping achieve eventual consistency.

2. How to manage outbox table growth?

Either using immediate deletion or delayed deletion. Immediate deletion saves space but complicates debugging and risks missing events, while delayed deletion offers better reliability.

3. What about system failures?

Debezium normally resumes from its last position using replication slots. If using immediate deletion and Debezium lags too far behind, you might need complete reinitialization. Message broker failure handling depends on your architecture (Kafka Connect has built-in resilience).

programming

This post is licensed under CC BY 4.0 by the author.