Apache Pulsar Architecture
Apache Pulsar is a publish-subscription message system, consisting of the broker, Apache BookKeeper, producer, consumer, and other components.
Producer: message producer, responsible for publishing messages to topics.
Consumer: message consumer, responsible for subscribing to messages from topics.
Broker: a stateless service layer, responsible for receiving and delivering messages, cluster load balancing, and other operations. The broker does not persist metadata, and therefore, it can be taken online and offline quickly.
Apache BookKeeper: a stateful persistence layer, consisting of a set of bookie storage nodes that can store messages persistently.
Apache Pulsar adopts the compute-storage separation model in architectural design. The computing logic related to message publishing and subscription is completed in the broker, and the data is stored on the bookie nodes in the Apache BookKeeper cluster.
Topics and Partitions
A topic is the name of a category, and messages can be stored and published in a topic. Producers write messages to topics, and consumers read messages from topics.
In Apache Pulsar, topics are classified into partition topics and non-partition topics. A non-partition topic can be understood as a topic with one partition. In Apache Pulsar, a topic is actually a virtual concept. Creating a topic with 3 partitions actually creates 3 partition topics. Messages sent to this topic will be distributed to multiple partition topics corresponding to the topic.
For example, a producer sends messages to a topic named my-topic with three partitions. In the data stream, the messages are sent to three partition topics: my-topic-partition-0, my-topic-partition-1, and my-topic-partition-2 evenly or based on certain rules (if a key is specified).
During data persistence of partition topics, partitions are logical concepts, and the actual storage units are segments.
As shown in the following figure, the data of partition Topic1-Part2 consists of N segments. The segments are evenly distributed and stored on multiple bookie nodes in the Apache BookKeeper cluster. Each segment has 3 replicas.
Physical Partitions and Logical Partitions
The comparison between logical partitions and physical partitions is as follows:
Physical partitions: Computing and storage are coupled. For fault tolerance, the physical partitions need to be copied. In the case of scale-out, physical partitions need to be added to achieve load balancing.
Logical partitions: Logical partitions are physical shards, and the computing layer and the storage layer are isolated. This structure gives Apache Pulsar the following advantages:
Broker and bookie nodes are independent of each other, which facilitates independent scale-out and independent fault tolerance.
The broker is stateless, so it can be taken online and offline quickly, and is more suitable for cloud native scenarios.
Partition storage is not limited to the storage capacity of a single node.
Partition data is evenly distributed, and a single partition with a large amount of data does not cause a barrel effect for the entire cluster.
When the storage capacity is insufficient for scale-out, new nodes can be quickly used to balance the storage load.