https://jack-vanlightly.com/blog/2018/10/2/understanding-how-apache-pulsar-works

2021 UPDATE: I have revamped this post having now become a BookKeeper committer and a Pulsar contributor. I have also formally verified the BookKeeper protocol in TLA+ and understand how everything works at a deeper level.

The aim of this post is to provide a high level description of how Apache Pulsar works internally. It should give you a decent mental model of its architecture and how it offers its guarantees. This post is not for people who want to understand how to use Apache Pulsar.

Claims

The main claims that I am interested in are:

Apache Pulsar chooses consistency over availability as does its sister projects BookKeeper and ZooKeeper. Every effort is made to give strong consistency.

We'll be taking a look at Pulsar's design to see if those claims are valid. In the next post we'll put the implementation of that design to the test. I won’t cover geo-replication in this post, we’ll look at that another day, we’ll just focus on a single cluster.

Multiple layers of abstraction

Apache Pulsar has the high level concept of topics and subscriptions and at its lowest level data is stored in binary files which interleave data from multiple topics distributed across multiple servers. In between are a myriad of details and moving parts. I personally find it easier to understand the Pulsar architecture if I separate it out into different layers of abstraction, so that’s what I’ll do in this post.

Let's take a journey down the layers.

Fig 1. Layers of abstraction

Layer 1 - Topics, Subscriptions and Cursors

This is not a post about messaging architectures that you can build with Apache Pulsar. We’ll just cover the basics of what topics, subscriptions and cursors are but not any depth about the wider messaging patterns that Pulsar enables.