How we improved performance and scalability by migrating to Apache Pulsar

https://www.mparticle.com/blog/apache-pulsar-migration/

Behind every enterprise software application sits a complex and dynamic service architecture that, by and large, the people using this software don’t care about. Ideally, they shouldn’t have to. End users are typically only concerned with whether the software does what it purports to do. The best software backends behave like the hulking mass of an iceberg that rests underwater––unseen, undetected, yet quietly keeping the visible portion afloat.

Every month, mParticle processes 108 billion API calls, and receives and forwards more than 686 billion events to help companies around the world like CKE Restaurants, Airbnb, and Venmo manage their customer data at scale. Executing this consistently and with razor-thin fault tolerance means that every component within our backend has to be completely optimized for the process that it executes. So back in 2020, when we noticed that our Audiences service was experiencing some data degradation as a result of the message queue that powered it (Amazon SQS at the time), our engineering team knew they had to search for another solution.

The search for a new streaming engine

Within mParticle’s backend, streaming engines are involved in the process of moving data from locations where it is ingested, like client SDKs and a server-to-server feeds, to services that will consume data, such as Audiences. Since mParticle ingests data from multiple sources and delivers this data to multiple services in real time, multiple queues are necessary to handle message delivery operations between clients and services within our architecture.

Amazon SQS and concurrency issues

Prior to adopting Apache Pulsar, Amazon SQS powered data transfers between the components within the mParticle Audience service. While SQS met the needs of most of our use cases at the time (and continues to do so on other services within our architecture), it lacked the ability to serialize incoming messages in a way that would prevent overwrites in the read/modify/write cycles that power Audiences.

Using Amazon SQS, when multiple services (like client SDKs and S2S feeds, for example) would perform read/modify/write tasks against a common database at the same time, it would be possible for them to overlap, resulting in a race condition that looks like this:

Service A reads, and begins working
Service B reads, and begins working
Service A writes to the database
Service B overwrites service A’s entry to the database

These conditions were not occurring at a very high rate overall. In the case of a widely-used core service like Audiences, however, this concurrency issue could potentially result in an unacceptable number of inaccurate entries over time. This was especially apparent during the development of our Calculated Attributes feature, which allows customers to run a calculation on a particular user or event attribute. When a Calculated Attribute is created, mParticle continuously updates the value for this attribute automatically using raw data streams of user and event data. The demands of this feature would mean the unwanted data overwrites that SQS was causing would not be acceptable.

Streamline your website integrations

Learn which third-party integrations you could replace by adopting mParticle

Vetting new solutions: Apache Kafka vs. Apache Pulsar

To get around this concurrency challenge, our engineers looked for a solution that would allow us to serialize incoming messages in Audiences as well as other services, and ensure that only one read/modify/write task could run at any given time. Our team explored a host of data streaming services that delivered this capability, and identified two frontrunners, both from Apache: Kafka, and Pulsar.

These solutions stood out from the pack because both allow users to supply a key on which incoming data streams are to be separated. By using mParticle ID (MPID) as this identifier, we can ensure that data streams from the same user are processed sequentially rather than concurrently. This would prevent the errors and data consistency issues we were experiencing when processing simultaneous requests from the same user using SQS.