Crunch, Data Engineering and Analytics Conference, October 29-31, 2018 Budapest
Ivan Kelly

Ivan Kelly

Software Engineer at Streamlio

Bio:

Ivan Kelly is a software engineer at Streamlio, a startup dedicated to providing a next-generation integrated real-time stream processing solution, based on Heron, Apache Pulsar (incubating), and Apache BookKeeper. Ivan has been active in Apache BookKeeper since its very early days as a project in Yahoo! Research Barcelona. Specializing in replicated logging and transaction processing, he is currently focused on Streamlio’s storage layer.

Talk:

Infinite* Topic Backlogs with Apache Pulsar

Topics:
streaming
logs
messaging
pulsar
real-time
Level:
Intermediate

Messaging systems are an essential part of any real time analytics engine. A common pattern is to feed a user event stream into a processing engine, show the result to the user, capture feedback from the user, push the feedback back into the event stream, and so on. The quality of the result shown to the user is often a function of the amount of data in the event stream, so the more your event stream scales, the better you can serve your users.

Messaging systems have recently started to push into the field of long term data storage and event stores, where you cannot compromise on retention. If data is written to the system, it must stay there.Infinite retention can be challenging for a messaging system. As data grows for a single topic, you need to start storing different parts of the backlog on different sets of machines without losing consistency.

In this talk I will describe how Pulsar uses segment oriented architecture. It provides a unit of consensus called a ledger. Pulsar strings together a number of ledgers to build the complete topic backlog. Each ledger in the topic backlog is independent of all previous ledgers with regards to location. This allows us to scale the size of the topic retention simply by adding more machines. When the storage node is added to a Pulsar cluster, the brokers will detect it, and gradually start writing new data to the new node. There’s no disruptive rebalancing operation necessary.

Of course adding more machines will eventually get very expensive. This is where tiered storage comes in. With tiered storage, parts of the topic backlog can be moved to cheaper storage such as Amazon S3 or Google Cloud Storage. We will discuss the architecture of tiered storage, and how it is a natural continuation of Pulsar’s segment oriented architecture.