Exactly-Once Upon a Time

Exactly-Once Upon a Time

Implementing exactly-once processing in a distributed system

Exactly-once delivery is impossible in a distributed system. This unfortunate reality is shown by a variety of mind experiments, among which the two generals' problem is one of the most famous.

However, this is where we engineers had to set aside our differences and cooperate for the greater good. Namely, we solved the exactly-once delivery problem in a brilliant way: by avoiding caring about message delivery altogether! First of all, message delivery it's not a well-defined concept in a distributed system that operates across a network. Is the message delivered when the TCP connection is opened? Or when all the bytes were transmitted to the recipient? Or even when the connection is gracefully closed and all the network resources freed?

It is clearly troublesome to solve a problem if we are unable to even define what the ideal outcome is: luckily, in IT we are usually not concerned about message delivery per se, but rather about message processing and its side effects. This is a brilliant realization because it means that we can swiftly substitute our original problem with another, more approachable one: exactly-once processing.

Divide and Conquer

Borrowing the terms and tools that were taught to me during my Calculus class, I could say that exactly-once processing is verified if and only if the message is delivered once.  A rather difficult problem on its own. Still, it is possible to split it into two more manageable subproblems:

  1. process the message at least once
  2. process the message at most once

If both conditions are satisfied at the same time, then it is apparent that we reached our goal of processing the message only one time! Let's face them individually now.

At Least Once Processing

In practical real-world situations, when delivering a message to a certain endpoint, we can rely on the fact that either the recipient will acknowledge the message returning an OK or KO status, or at the very least we established a proper timeout after which we will assume that the message was never delivered and processed. Excluding the possibility of  fatal errors on the recipient side, we can always retry to send the message until we receive some confirmation from our counterpart:

A payment request is retried until a confirmation response is sent back.

et voilà, we already satisfied the first half of the condition.

At Most Once Processing

Of course, it's possible that our recipient received some of our messages, but we failed to acknowledge them in time. With no additional context, the receiver may process some of the messages multiple times. This is not a problem if the operation to be executed is idempotent (e.g. updating the user's address), but we certainly do not want to process the same payment multiple times. If only there was a way to handle non-idempotent operations as if they were idempotent!

Well, apparently there is a trick to reproduce the idempotency attribute. For example, we could attach a unique piece of information to our payment request, like a UUID, allowing the consumer to compare different messages and filter out duplicate retries.

Diagram adapted from Gergely Orosz: Designing a Payment System

As you can see from the diagram, even if the caller sends multiple identical requests as part of its retry strategy, the recipient is able to detect when an idempotency key was already seen and politely reject the duplicate request.

References