tencent cloud

Feedback

CKafka Data Reliability Description

Last updated: 2024-01-09 15:02:47
    This document describes the factors that affect the reliability of CKafka from the perspectives of the producer, the server (CKafka), and the consumer, respectively, and provides corresponding solutions.

    What should I do if data gets lost on the producer?

    Causes of data loss

    When the producer sends data to CKafka, the data may get lost due to network jitters, and CKafka will not receive the data. Other possible causes are as follows:
    The network load is high or the disk is busy, and the producer does not have a retry mechanism.
    The purchased disk capacity is exceeded. For example, if the disk capacity of an instance is 9,000 GB and it is not expanded promptly after being used up, data cannot be written to CKafka.
    Sudden or continuously increased peak traffic exceeds the purchased peak throughput. For example, if the peak throughput of the instance is 100 MB/sec, but it is not scaled up promptly after the peak throughput is exceeded for a long period of time, data writes to CKafka will become slower. In this case, if the producer has a queuing timeout mechanism in place, data cannot be written to CKafka.

    Solutions

    Enable the retry mechanism on the producer for important data.
    To avoid data loss caused by improper disk usage, set monitoring and alarm policies as preventive measures when configuring the instance.
    When the disk capacity is used up, upgrade the instance timely in the console. Upgrading Ckafka instances of Standard Edition will not interrupt the service. The disk capacity can be expanded separately. You can also shorten the message retention period to reduce disk usage.
    To minimize the loss of messages on the producer, you can fine-tune the size of the buffer by using buffer.memory and batch.size (in bytes). A larger buffer is not necessarily better. When the producer fails for any reason, more data in the buffer means more garbage to be recycled, which slows down data recovery. Pay close attention to the number of messages produced by the producer and the average message size (through the rich set of monitoring metrics available in CKafka).
    Configure acknowledgment (ACK) for the producer.
    When the producer sends data to the leader, it can set the data reliability level by using the request.required.acks and min.insync.replicas parameters.
    When acks = 1 (default value), the leader in the ISR has successfully received a message sent by the producer, and the next message can be sent. If the leader goes down, the data unsynced to its followers will get lost.
    When acks = 0, the producer sends the next message without waiting for acknowledgment from the broker. In this case, data transfer efficiency is the highest, but data reliability is the lowest.
    Note:
    When the producer is configured with acks = 0, if the current instance is throttled, in order for the server to provide services normally, the server will actively close the connection to the client.
    When acks = -1 or acks = all, the producer needs to wait for the acknowledgment of message receipt from all the followers in the ISR before sending the next message, which ensures the highest reliability.
    Even if acks is configured as above, there is no guarantee that data will never get lost. For example, when there is only one leader in the ISR (the number of members in the ISR may increase or decrease in certain circumstances, and in some cases, only one leader is left), the value of acks will be 1. Therefore, you also need to configure the min.insync.replicas parameter in the CKafka console by enabling the advanced configuration in Topic Management > Edit Topic. This parameter specifies the minimum number of replicas in the ISR, and its default value is 1. It only takes effect when acks = -1 or acks = all.

    Recommended parameter values

    These parameter values are for reference only, and the actual values depend on the actual conditions of your business.
    Retry mechanism: message.send.max.retries=3;retry.backoff.ms=10000;
    Guarantee of high reliability: request.required.acks=-1;min.insync.replicas=2;
    Guarantee of high performance: request.required.acks=0;
    Reliability + performance: request.required.acks=1;

    What should I do if data gets lost on the broker (CKafka)?

    Causes of data loss

    The partition's leader goes down before the followers complete the data backup. Even if a new leader has been selected, data will get lost because it has not been backed up yet.
    Open-source Kafka stores data to disks in an async manner. Specifically, data is first stored in PageCache before persistence. If the broker disconnects, restarts, or fails, the data stored in PageCache will get lost because it has not been stored persistently to the disks yet.
    Stored data may get lost due to disk failures.

    Solutions

    Open-source Kafka has multiple replicas that are used to ensure data integrity. Data will get lost only if multiple replicas and brokers fail at the same time, so data reliability is higher than that in the single-replica case. Therefore, CKafka requires at least two replicas for a topic and supports configuring three replicas.
    CKafka performs data flushing by configuring more reasonable parameters, such as log.flush.interval.messages and log.flush.interval.ms.
    In CKafka, the disk is specially designed to ensure that data reliability will not be compromised even if the disk is partially damaged.

    Recommended parameter values

    Whether a replica that is not in sync status can be elected as a leader: unclean.leader.election.enable=false // Disabled

    What should I do if data gets lost on the consumer?

    Causes of data loss

    The offset is committed before data is consumed. If the consumer goes down in the process but the offset has been updated, the consumer will miss a data entry, and the consumer group will have to reset the offset in order to retrieve it.
    The consumption speed differs significantly from the production speed, but the message retention period is too short; therefore, the message will be deleted upon expiration before it is consumed.

    Solutions

    Configure the auto.commit.enable parameter appropriately. When it is set to true, the commit is performed automatically. We recommend that you use the scheduled commit feature to avoid committing offsets frequently.
    Monitor the consumer and correctly adjust the data retention period. Monitor the consumption offset and the number of unconsumed messages, and configure an alarm to prevent messages from being deleted upon expiration due to slow consumption.

    Troubleshooting data loss

    Printing partition and offset information locally for troubleshooting

    Below is the code for printing information:
    Future<RecordMetadata> future = producer.send(new ProducerRecord<>(topic, messageKey, messageStr));
    RecordMetadata recordMetadata = future.get();
    log.info("partition: {}", recordMetadata.partition());
    log.info("offset: {}", recordMetadata.offset());
    If the partition and offset can be printed out, the currently sent message has been correctly saved on the server. At this time, you can use the message query tool to query the information of the relevant offset.
    If the partition and offset information cannot be printed out, the message has not been saved on the server, and the client needs to retry.
    Contact Us

    Contact our sales team or business advisors to help your business.

    Technical Support

    Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

    7x24 Phone Support