tencent cloud

Abnormal Consumer Isolation
Last updated: 2025-12-24 15:03:03
Abnormal Consumer Isolation
Last updated: 2025-12-24 15:03:03
In certain scenarios, a consumer may fail to process messages properly, but due to a stable network connection, the server fails to detect the consumer's anomaly and continues to push messages to the consumer. To mitigate abnormal message backlogs caused by such situations, TDMQ for Apache Pulsar provides an abnormal consumer isolation mechanism. This document details applicable scenarios and the implementation schemes of this mechanism.

Scenarios and Use Cases

Scenarios

When you use the Shared mode, the consumer may experience blocked consumption logic due to machine or system-level issues, such as file system or disk I/O bottlenecks. Although the consumption logic is blocked, the network connection between the client and the server remains intact. In this scenario, the server does not consider the consumer offline and continues to push messages. However, since the consumer's consumption logic is blocked, all pushed messages remain unacknowledged, causing a message backlog that cannot be automatically resolved.
When the above situation occurs, you usually need to proactively restart the client or shut down the problematic consumer to restore normal operations.
To resolve such issues, the abnormal consumer isolation feature enables the server to proactively detect and isolate abnormal consumers.

Inapplicable Scenarios

1. Message blocking is caused by issues with the messages themselves. For example, a message format error may cause parsing failures during consumption, which in turn blocks the consumption logic.
2. In such cases, when the feature is enabled, the server will isolate blocked consumers and redeliver abnormal messages. This may result in problematic messages being repeatedly delivered to other consumers, potentially triggering a cascading blocking effect among more consumers.
3. When the message processing time cannot be determined, it is impossible to set an accurate acknowledgment timeout period, which may result in consumers being incorrectly isolated.
4. In scenarios where consumers fail to send acknowledgments after completing message consumption, this issue must be resolved on the client side and cannot rely on this feature as a fallback mechanism.

Use Cases

1. The abnormal consumer isolation feature only takes effect in the Shared mode.
2. The acknowledgment timeout period should be determined based on the actual business consumption time. Generally, it is significantly longer than the standard business consumption time to avoid incorrect isolation of consumers. A minimum of 5 minutes is recommended. If the standard business consumption time exceeds 5 minutes, set the timeout to a larger value.
3. For the inapplicable scenarios mentioned above, implement exception handling in consumption logic to prevent consumption blocking caused by issues such as message parsing failures. Consider using retry and dead letter mechanisms. For details, see Retry and Dead Letter Mechanisms.
4. The server checks for acknowledgment timeouts through a scheduled task that runs every 30 seconds. If the acknowledgment timeout period is set to 5 minutes, it may take up to 5 minutes and 30 seconds to mark the consumer as isolated.
5. The core purpose of this feature is to serve as a fallback for scenarios where consumer consumption is completely blocked. It is not intended to implement server-side active redelivery of unacknowledged messages. Avoid using this feature as a server-side active redelivery mechanism. Currently, TDMQ for Apache Pulsar does not support server-side message redelivery upon timeout.
6. The client must prevent missing acknowledgments (such as failing to send an acknowledgment after message consumption). If a consumer fails to send acknowledgment messages after consumption, the server may mistakenly identify the consumer as an abnormal consumer and incorrectly add it to the blocklist. Additionally, this can cause already consumed messages to be repushed to other consumers, resulting in duplicate consumption.
7. This feature does not take effect in the Key_Shared mode. The Key_Shared mode has strict requirements for business scenarios. If you cannot ensure compliance with its applicable scenarios, avoid using it in production environments. For details, see Subscription Modes.

Implementation Schemes



Scheme Description

1. When the server pushes messages to a consumer but does not receive an acknowledgment request within a specified duration, it adds the consumer to the blocklist and repushes all unacknowledged messages held by that consumer to other online consumer instances.
2. After a consumer is added to the blocklist, the server will no longer push messages to that consumer.
3. When a consumer in the blocklist acknowledges a message or is taken offline, the consumer is removed from the blocklist.
4. No more than 60% of consumers can be added to the blocklist. This limit is designed to prevent avalanche effects in abnormal scenarios, where all consumers would be blocked from normal message consumption.

Must-Knows

The server maintains the last acknowledgment time for each consumer but does not record acknowledgment timestamps for individual messages. Each time the server receives an acknowledgment (or unacknowledgment) request from a consumer, it updates the consumer's last acknowledgment time managed on the server. The acknowledgment timeout period configured in the console is used to compare with the last acknowledgment time. If the period is exceeded, the server marks the consumer as isolated.

Introduction to Related Concepts

What is an unacknowledged message?
An unacknowledged message is a concept in Shared and Key_Shared modes, referring to messages that the server has pushed to a consumer but for which it has not yet received acknowledgment requests.
Why is the number of unacknowledged messages that can be subscribed to limited?
The unacknowledged message IDs are stored in the server's memory. An excessive number of unack messages consumes more memory resources, which typically leads to increased memory usage and higher GC pressure. In extreme cases, it may even cause memory overflow that results in service unavailability, posing stability risks to the TDMQ for Apache Pulsar instance.



Was this page helpful?
You can also Contact Sales or Submit a Ticket for help.
Yes
No

Feedback