tencent cloud

Feedback

TaskManager Full GC Too Long

Last updated: 2023-11-07 16:32:30

    Overview

    A TaskManager of a Flink job is a JVM process with its own heap memory. Both storing the runtime state of Flink operators and other operations can cause the use of too much heap memory.
    When the JVM heap memory is about to be used up, full GC (a memory recovery mechanism) is triggered to release the space. If only a small size of memory is recovered each time and it is difficult to release the heap memory in time, full GC will be triggered frequently and continuously in the JVM. This operation will occupy a large amount of CPU time, making the execution threads of the job fail, and this event is triggered.
    Note
    This feature is in beta testing, so custom rules are not supported. This capability will be available in the future.

    Trigger conditions

    The system detects the full GC time of all TaskManagers of a Flink job every 5 minutes.
    If the increased full GC time of a TaskManager accounts for more than 30% of a detection period (the full GC time exceeds 1.5 minutes within 5 minutes), a severe full GC problem exists in the job, and this event is triggered.
    Note
    To avoid frequent alarms, at most one push of this event can be triggered per hour for each running instance ID of each job.

    Alarms

    You can configure an alarm policy as instructed in Configuring Event Alarms (Events) for this event to receive trigger and recovery notifications in real time.

    Suggestions

    If you receive a push notification of this event, we recommend you configure more resources for the job as instructed in Configuring Job Resources. For example, you can increase the TaskManager spec (increased max available space of the TaskManager heap memory to contain more state data), or set a larger operator parallelism (reduced amount of data processed by a TaskManager to reduce the memory used) for more efficient data processing.
    You can also adjust advanced Flink parameters as instructed in Advanced Job Parameters. For example, you can set taskmanager.memory.managed.size to a smaller value to increase the available heap memory. However, you must make adjustments under the guidance of an expert who fully understands the memory allocation mechanisms in Flink. Otherwise, this operation probably poses other issues.
    ‍If OutOfMemoryError: Java heap space or similar keywords are found in the job crash logs, you can enable the feature of Collecting Pod Crash Events, and set -XX:+HeapDumpOnOutOfMemoryError as described in the document, so that the local heap dump can be captured in time for analysis in case of an OOM crash of the job.
    If OutOfMemoryError: Java heap space is not found in the logs, and the job is properly running, we recommend you configure alarms for the job, and add job failure event in the alarm rules of Stream Compute Service to timely receive job failure event pushes.
    If the problem persists after all above methods are used, submit a ticket to contact the technicians for help.
    Contact Us

    Contact our sales team or business advisors to help your business.

    Technical Support

    Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

    7x24 Phone Support