tencent cloud

Feedback

Job Failure

Last updated: 2023-11-07 16:48:29

    Overview

    A job failure event in Stream Compute Service indicates that the status of a Flink job changes from running to failed or restarting, which may cause interrupted data processing, output delay in the downstream, and other issues.

    Conditions

    Trigger

    1. The status of a Flink job changes from RUNNING to FAILED or RESTARTING. Later, the Flink JobManager will recover the job in about 10s, with the running instance ID after recovery remaining unchanged.
    2. A Flink job is restarted too many times or too frequently, exceeding the limit (the threshold is generally controlled by restart-strategy.fixed-delay.attempts and defaults to 5, and we recommend you increase it in a production environment) given in the Restart Policies. This will result in the exit of both the JobManager and the TaskManagers, and the system will try to recover the job from the last successful checkpoint within about 2 minutes, with the running instance ID ‍after recovery increased by 1.

    Clearing

    After the Flink or Stream Compute Service system recovers the job back to RUNNING, a failed job recovery event will be generated, indicating the end of this event.

    Alarms

    You can configure an alarm policy for this event to receive trigger and clearing notifications in real time.

    Suggestions

    You can search for exception logs under the instance ID of the job for which the event is generated, as instructed in Diagnosis with Logs. Generally speaking, error messages before and after the keywords from RUNNING to FAILED contain the direct causes of the job failure. We recommend you analyze the issue based on these error messages together with the logs of the JobManager and the TaskManagers.
    If the problem is still not found with the above diagnosis, please check as instructed in Viewing Monitoring Information whether resource overuse exists. You can focus on TaskManager CPU usage, heap memory usage, full GC count, full GC time, and other critical metrics to check whether exceptions exist.
    Contact Us

    Contact our sales team or business advisors to help your business.

    Technical Support

    Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

    7x24 Phone Support