tencent cloud

Feedback

JobManager Full GC Too Long

Last updated: 2023-11-07 15:43:23

    Overview

    The JobManager of a Flink job manages and schedules the whole job. It is a JVM process with its own heap memory. For a source connector using the FLIP-27 interface, its enumerator will record the shard information in the heap memory. Too many shards may result in the use of too much heap memory, affecting the stability of the job as a whole.
    When the JVM heap memory is about to be used up, full GC (a memory recovery mechanism) is triggered to release the space. If only a small size of memory is recovered each time and it is difficult to release the heap memory in time, full GC will be triggered frequently and continuously in the JVM. This operation will occupy a large amount of CPU time, making the execution threads of the job fail, and this event is triggered.
    Note
    This feature is in beta testing, so custom rules are not supported. This capability will be available in the future.

    Trigger conditions

    The system detects the full GC time of the JobManager of a Flink job every 5 minutes.
    If the increased full GC time of the JobManager accounts for more than 30% of a detection period (the full GC time exceeds 1.5 minutes within 5 minutes), a severe full GC problem exists in the job, and this event is triggered.
    Note
    To avoid frequent alarms, at most one push of this event can be triggered per hour for each running instance ID of each job.

    Alarm configuration

    You can configure an alarm policy as instructed in Configuring Event Alarms (Events) for this event to receive trigger and clearing notifications in real time.

    Suggestions

    If you receive a push notification of this event, we recommend you configure more resources for the job as instructed in Configuring Job Resources. For example, you can increase the JobManager spec (increase max available space of the JobManager heap memory to contain more state data).
    If you use MySQL CDC, we recommend you increase the size per shard in the WITH parameter (set scan.incremental.snapshot.chunk.size to a larger value) to avoid the JobManager heap memory from being used up due to too many shards.
    If OutOfMemoryError: Java heap space is not found in the logs, and the job is properly running, we recommend you configure alarms for the job, and add job failure event in the alarm rules of Stream Compute Service to timely receive job failure event pushes.
    If the problem persists after all above methods are used, submit a ticket to contact the technicians for help.
    
    Contact Us

    Contact our sales team or business advisors to help your business.

    Technical Support

    Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

    7x24 Phone Support