tencent cloud

Feedback

Viewing Critical Events

Last updated: 2023-11-07 16:38:39

    Overview

    Various events may occur during the running of a job, such as start, job running failure, checkpointing failure, and other exceptions. A comprehensive events page is provided in the Stream Compute Service console, allowing you to view and subscribe to critical events.
    On the events page, you can select a target event type, and further filter events by running instance ID and time range. You can click Reset filter ‍to clear filters and reset to defaults and pull the latest events.
    Note
    To avoid ‍the generation of too many events, the max time range for filtering is limited to 7 days within the past 90-day period.

    Event types

    Job start and stop

    When you click Publish draft on the Development & Testing page of a job, or when the job exits due to crash and the event is detected, the system will try to start the job, and automatically create a new instance ID for this run. Later, you will see a new start event on the events page. When you stop or restart the job, or it crashes and exits, a stop event with the above-mentioned instance ID will be generated. The job start time and end time refer to the time points when the internal process of the job is completed, but not the time points when you operate on the UI.
    For example, the information in the figure below indicates that the instance is started on 2021-11-10 16:49:30 and stopped on 2021-10-10 16:55:52 by you or the system.

    Job running failure and recovery

    When a job is restarted during its running (its status changes from RUNNING to RESTARTING or FAILED), a job running failure event will be generated. If the job is RUNNING again, a failed job recovery event will be generated.
    You can select Operation > Solution to view ‍causes of and solutions to the event. You can also configure alarms for job running failure events.

    Checkpointing failure and recovery

    If checkpointing is enabled for a job, and a checkpoint fails to be taken, a checkpoint failure event will be generated. If the checkpoint succeeds later, a failed checkpoint recovery event will be generated.
    You can select Operation > Solution to view ‍causes of and solutions to the event. You can also configure alarms for checkpoint failure events.

    Job exception events (in beta)

    The Stream Compute Service backend continuously monitors and analyzes the running of jobs. When a job encounters severe exceptions (such as TaskManager full GC too long, CPU load too high, and abnormal Pod exit), the corresponding events will be pushed to you for reference, so that you can determine whether the job is properly running.
    Note
    To avoid bothering you unnecessarily, at most one job exception event (other than the abnormal Pod exit) will be pushed per hour.
    This feature is in beta testing. It supports detecting severe problems only, and thresholds cannot be adjusted. It will be further improved and upgraded. Please stay tuned.
    
    Contact Us

    Contact our sales team or business advisors to help your business.

    Technical Support

    Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

    7x24 Phone Support