Back to the series parent

We’re quite far in this series. By now, you should have a good understanding of how Change Data Capture (CDC) operates internally, along with its limitations and strengths. We’ve also explored database-level support for CDC, using PostgreSQL as a key example.

Now, we’ll turn our attention to a more specific application of CDC: microservices. For a motivating illustration, I highly recommend reading this Debezium article before continuing.

A fundamental challenge in distributed systems is the need for reliable information propagation across all participating services, a task that inherently involves trade-offs. If you haven’t heard of the CAP theorem, there are many excellent resources online which explain it.

Let’s revisit our fictitious beehive management system, featuring two key services: beehive-control, responsible for managing user data, especially their hives, and beehive-external, which gathers data from external sources.

beehive-external only requires a chosen subset of user data, which we can imagine as a ‘remote materialized view’ of the source data owned by beehive-control. These change data events are transferred via a message broker; in the example concluding our series, we’ll be using Redis Streams.

Hive table change information propagation

Our objective is to ensure that beehive-external is informed of any changes to the source hive table, including what exactly happened.

Simple solutions are yet again not enough!

Initially, one might consider simply capturing and publishing each change to the message broker as a complete solution. However, this approach works reliably only in ideal circumstances. We need a mechanism to ensure that changes eventually lead to a consistent state: either both beehive-control and beehive-external update their respective data stores, or neither does.

This characteristic is what distributed systems theory terms eventual consistency. It implies that while changes aren’t guaranteed to be instantaneous or atomic, the system is designed to avoid permanent data inconsistencies.

The Outbox Pattern in Domain-Driven Design (DDD)

The core principle of this approach involves having an outbox table within the data owner’s database. For our purposes, when a modification occurs in the hive table, a corresponding record is also inserted into the outbox table – crucially, as part of the same database transaction.

Important terminology note: the Outbox pattern is firmly rooted in the principles of Domain-Driven Design (DDD), which itself operates independently of relational table concepts. DDD places significant emphasis on the ubiquitous language – a shared vocabulary specific to the business domain – which defines fundamental units of information known as aggregates. Changes to these aggregates are then persisted as an event stream, a concept often synonymous with event-driven architecture within the DDD context.

Here is what a DDD outbox table typically looks like:

CREATE TABLE outbox (
    id uuid NOT NULL,
    aggregate_type varchar(255) NOT NULL,
    aggregate_id varchar(255) NOT NULL,
    type varchar(255) NOT NULL,
    payload jsonb NOT NULL
);

Let’s dissect the table column by column:

  • id: A unique identifier that will always be visible to consumers, enabling them to manage potential duplicate messages.
  • aggregate_type: In DDD terminology, this indicates the root aggregate type. In our examples, this will always be the hive entity, though aggregates can be (and probably will be) more complex depending on your domain.
  • aggregate_id: A reference to the specific aggregate instance that has been modified.
  • type: This field specifies the type of event that occurred, for instance, HiveEstablished, HiveRelocated, HiveDisposed, etc. Notably, DDD principles advise against using generic CRUD-like event types such as HiveNameEdited or HiveDeleted, which naturally follow from PostgreSQL CDC.
  • payload: This column is intentionally type-agnostic, designed to hold the actual event data, which can vary depending on the values of other columns. This pragmatic design avoids the need to create a dedicated outbox table for each distinct root aggregate.

For every change we want to propagate, we will also write a corresponding new row to the outbox table within the same transaction. This outbox table is then monitored – using CDC in our scenario – and its contents are published to downstream consumers.

We will define our beehive system aggregates in the concluding article.

Outbox pattern downsides

Now, let’s turn our attention to most common concerns:

The outbox table will grow out of bounds.

Indeed, the mechanism described so far overlooks a crucial detail: the outbox table could potentially grow indefinitely with each new event.

We have two primary strategies that address this issue:

  1. Immediate Deletion: The entry in the outbox can be deleted immediately as part of the same transaction that inserted it. (The rationale for this approach is detailed below.)
  2. Delayed Deletion: The entry can be manually deleted only after the consuming service acknowledges the change, or alternatively, the row can be deleted after a fixed time period, whichever happens first.

The reason why option 1 can be effective is that, while no persistent rows will be visible in the outbox table, it’s important to remember that before committing any modifications to user data, PostgreSQL always logs all changes to its Write-Ahead Log (WAL). Consequently, Debezium can still capture these changes even if no physical record persists in the outbox.

However, I personally think that this approach does more harm than good because:

  • It is inherently awkward to debug. (Remember, there is nothing in the outbox table.)
  • There’s no assurance that Debezium will capture all changes, particularly in high-throughput environments where the Write-Ahead Log (WAL) might be overwritten rapidly.
  • We have no mechanisms to verify if a client has processed a message.
    • This may be indeed desirable, as we want to have as decoupled an architecture as possible. My criticism is more towards the fact that we can’t even consider such a possibility with this implementation
  • Replay of events can be a relatively tedious operation.

The second strategy addresses these concerns, albeit potentially introducing more architectural coupling, especially if we choose to and are able to implement client acknowledgements.

Handling Potential Debezium Failures

In an ideal scenario, upon restarting after a failure, Debezium will resume processing from its last known position (typically by referencing the replication slot) and catch up on any missed changes.

What if Debezium lags too much behind?

This becomes a significant concern primarily if decide for the immediate deletion of entries from the outbox table; the rationale for this should be clear by now. In such cases, reinitializing Debezium might be the only viable solution. However, with the delayed deletion approach, we can implement a mechanism analogous to a dead-letter queue redrive, though a detailed explanation of that is beyond the scope of this article.

Handling Message Broker Failures

The direct answer is: handling this is your responsibility. This is where complexities can arise, particularly if you choose not to use Kafka Connect which has mechanisms to handle outages. You would then need to design a solution, potentially a custom one, tailored to your specific sink (the destination to which Debezium sends CDC events).

How do we know the message was really processed?

This is an aspect where the Outbox pattern, in its standard form, doesn’t provide a direct solution, as it wasn’t originally intended for this purpose. It offers the fundamental structure, but the responsibility of implementing additional tracking or bookkeeping mechanisms on the outbox owner’s side falls upon us.

One idea might be to have an async command channel through which subscribing services can send confirmation messages to the outbox owner once they’ve successfully processed an event, indicating that it is safe to delete the corresponding outbox entry.

To conclude this part of our discussion, let’s proceed to putting everything to practice.