Scenarios
Data compression can reduce network I/O throughput and disk usage. By following this document, you can learn about the message formats supported by data compression and configure data compression as needed.
Message Format
Currently, TDMQ for CKafka (CKafka) supports two types of message formats: V1 and V2 (introduced in version 0.11.0.0). CKafka is compatible with the formats for versions 0.9, 0.10, 1.1, 2.4, 2.8, and 3.2.
Different versions correspond to different configurations. The details are as follows:
The purpose of message format conversion is primarily to ensure compatibility with early versions of consumer programs. In a CKafka cluster, multiple versions of message formats (V1/V2) are typically stored simultaneously.
The broker side converts new version messages to an early version format, which involves decompression and recompression of messages.
Message format conversion significantly impacts performance. Besides adding extra compression and decompression operations, it causes CKafka to lose its excellent zero-copy feature. Therefore, it is essential to ensure the uniformity of message formats.
Zero-copy: During data transmission between the disk and network, it avoids expensive kernel-mode data copying, thereby achieving quick data transmission.
Compression Algorithm Comparison
The Snappy algorithm is officially recommended to reduce the impact on and maintain the stability of CPU performance.
The analysis process is as follows:
A compression algorithm is evaluated based on two major metrics: compression ratio and compression/decompression throughput.
CKafka versions earlier than 2.1.0 support three compression algorithms: GZIP, Snappy, and LZ4.
In actual usage of CKafka, the performance metrics of the three algorithms are compared as follows:
Compression ratio: LZ4 > GZIP > Snappy
Throughput: LZ4 > Snappy > GZIP
The physical resource usage is as follows:
Bandwidth: Snappy occupies the most network bandwidth as it has the lowest compression ratio.
CPU: During compression, Snappy uses more CPU; during decompression, GZIP uses more CPU.
Under normal circumstances, the recommended order of the three compression algorithms is: LZ4 > GZIP > Snappy.
By long-term testing in the live network, it is found that the above conclusion is correct in most cases. However, in certain extreme scenarios, the LZ4 compression algorithm may cause an increase in CPU load.
Analysis shows that different source data of services leads to different performance of compression algorithms. Therefore, users sensitive to CPU metrics are recommended to use the Snappy algorithm that is more stable.
Note:
The GZIP algorithm is not recommended for CKafka. Enabling GZIP compression consumes additional CPU resources on the CKafka server. According to load testing data, if GZIP compression is enabled, it is advised to reserve approximately 75% bandwidth buffer. (The reserved ratio is for reference only. The actual ratio is to be determined based on monitoring data.)
For example, for an instance with a bandwidth of 40 MB/s, after enabling GZIP compression, you are advised to increase the bandwidth to 160 MB/s (40/(1 - 75%) = 160).
Data Compression Configuration
Producers can configure data compression as instructed below:
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("acks", "all");
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("compression.type", "snappy");
Producer<String, String> producer = new KafkaProducer<>(props);
In most cases, the broker only stores messages received from the producer without any modifications.
Notes
When data is sent to CKafka, compression.codec cannot be set.
Version 1.1 and earlier do not support the GZIP compression format by default.
GZIP compression involves high CPU consumption. The use of GZIP will cause all messages to be invalid. GZIP compression is not recommended for CKafka.
Once enabled, GZIP consumes significant CPU resources, becoming a bandwidth bottleneck. If GZIP is enabled, it is recommended to increase the values of linger.ms and batch.size for the producer.
Programs cannot run normally when LZ4 is enabled. The possible cause is an incorrect message format. Please check the CKafka version and whether the applied message format is correct.
The SDK settings vary with the CKafka client. You can set the message format version by querying in the open-source community (such as Instructions for the C/C++ Client).