# Article

## Brief introduction to the Topos Reliable Broadcast protocol

In this post, we will give a brief introduction to reliable broadcast and how we use it in Topos.

The first thing to mention about reliable broadcast is that it can be implemented in the asynchronous model, meaning that the correctness of the algorithm is independent of any timing assumptions. Under asynchrony, given that the adversary is bounded by a threshold (typically, in a system of $n$ processes, we assume an adversary who can only control $f$ processes such that $3f \leq n-1$), the protocol is not vulnerable to the adversary delaying messages.

The properties of a reliable broadcast are the following:

• Validity: If a correct process $p$ broadcasts a message $m$, then every correct process eventually delivers $m$.
• Consistency: If a correct process $p$ delivers a message $m$ and another correct process $q$ delivers a message $m'$, then $m = m'$.
• Totality: If a correct process delivers a message, then every correct process eventually delivers a message.

The combination of the Consistency and Totality properties makes up for the agreement property.

Given these properties, let’s build a reliable broadcast!

We will show a simple and efficient reliable broadcast which tolerates a $0 \leq f < \frac{n}{3}$ fraction of Byzantine processes in a system of a total of $n$ processes.

The broadcast primitive is composed of steps during which processes wait until they received enough messages before sending certain messages themselves:

• Step 1 (Broadcast): a process broadcasts a message $m$ to the rest of the system.
• Step 2 (Echo): upon reception of the message $m$ for the first time, a process echoes the message to all the processes with an $(Echo,m)$ message.
• Step 3 (Ready): upon reception of more than $\frac{n+f}{2}$ $(Echo,m)$ messages or $f+1$ $(Ready, m)$ messages, a correct process sends a $(Ready,m)$ message to everyone.
• Step 4 (Delivery): upon receiving $2f+1$ $(Ready,m)$ messages, a correct process delivers the message $m$.

Performance wise, this primitive has quadratic message complexity so it does not scale well.

But more on this later!

Now that we have our reliable broadcast, let’s prove its correctness:

Proof (Validity): If the sender of message $m$ is correct, then all correct processes will send an $(Echo,m)$ message. All correct processes receive at least $n-f$ $(Echo,m)$ messages and since $n-f > \frac{n+f}{2}$, it triggers the emission of $n-f$ $(Ready,m)$ messages, which in turn will trigger delivery of the message $m$ by those processes.

Proof (Consistency): For a correct process $p$ to deliver message $m$, it needs to receive $2f+1$ $(Ready,m)$ messages. A set $Q_1$ of cardinality $\vert Q_1\vert = 2f+1$ is a Byzantine quorum and two such quorums intersect in at least one correct process because from $\vert Q_1 \vert = \vert Q_2 \vert =2f+1$, it follows that $\vert Q_1 \cap Q_2 \vert \geq f+1$. Now consider another correct process $q$ which similarly received $2f+1$ $(Ready,m')$ messages for message $m'$. Since the quorums of $p$ and $q$ intersect in a correct process, this correct process sent the same $Ready$ message to both $p$ and $q$, hence $m = m'$.

Proof (Totality): Again, to deliver a message $m$, a correct process $p$ must have received $2f+1$ $(Ready,m)$ messages. This means that at least $f+1$ of such $Ready$ messages came from correct processes meaning that these processes have already received more than $\frac{n+f}{2}$ $(Echo,m)$ messages or $f+1$ $(Ready, m)$ messages (as per Step 3 above). By the Consistency property, these processes did not send $Ready$ messages for a different message $m’$ so it is not possible that message $m’$ could have $\frac{n+f}{2}$ $(Echo,m')$ messages nor $f+1$ $(Ready,m’)$ messages.

We have shown that this simple reliable broadcast is correct!

# Can we do better?

As we have seen throughout this post, the reliable broadcast primitive has quadratic message complexity. Thus, if the number of processes increases, the number of messages explodes and processes become overwhelmed. A process can only handle so much, so these algorithms are not scalable. Direct communication between participants incur very large communication cost. A natural way to reduce communication cost between processes is to have hierarchical communication, such that each process only communicates with a relatively small number of other processes (compared to the size of the system). The per-process load is reduced, at the cost of communication latency. For instance, if the processes are organized in a tree structure, the communication latency is logarithmic in the size of the system.

In Topos, the Transmission Control Engine (TCE) uses gossip to propagate messages, hence processes are organized in a tree to minimize communication overhead. Furthermore, the Topos Reliable Broadcast trades deterministic guarantees for probabilistic ones, while preserving good reliability guarantees, as they allow to further reduce the overhead which makes it particularly well suited for large systems. By making the protocol probabilistic, we need to revise the properties of the reliable broadcast.

We need weaker properties to account for this transition from a deterministic to a probabilistic algorithm. Given a failure probability $\epsilon$ (e.g., $\epsilon < 10^{-12}$), our three properties become:

• Validity: If a correct process $p$ broadcasts a message $m$, then every correct process eventually delivers $m$ with probability at least $(1 − \epsilon)$.
• Consistency: If a correct process delivers a message $m$ and another correct process delivers a message $m’$, then $m = m'$ with probability at least $(1 − \epsilon)$.
• Totality: If a correct process delivers a message, then every correct process eventually delivers a message with probability at least $(1 − \epsilon)$.

Probabilistic broadcast has weaker guarantees than its deterministic counterparts but the failure probability $\epsilon$ can be made arbitrarily small.

Furthermore, it allows the TCE to scale very easily as the per-process communication is only logarithmic and the total communication complexity is quasilinear in the size of the system!

In a future post, we will dive in the core details of the Topos Reliable Broadcast.

We highly recommend checking out Introduction to Reliable and Secure Distributed Programming!