tencent cloud

Feedback

Diagnosis with Logs

Last updated: 2023-11-07 17:52:44

    Overview

    In the Stream Compute Service console, two categories of logs are available: start logs and running logs.
    Start logs: When SQL, JAR, or other types of jobs are submitted in a cluster, the startup process of generating a Flink execution graph starts first, and logs generated in this process are referred to as start logs. When a job fails to start, a yellow triangle with an exclamation mark point (⚠️) will appear next to its name in the console, over which you can move the pointer to view details. You can also read the log context of the errors on the logs page.
    Running logs: After the execution graph of a job is generated, its JobManager and TaskManagers will be started, and the execution graph will be submitted to the cluster for execution. From this point, the job status becomes "running", and logs printed by the JobManager and TaskManagers are called running logs.

    Keywords of common exceptions

    Job failure causes

    You can search by from RUNNING to FAILED to identify the direct cause of a job crash, and the information following Caused by in the stack trace represents the failure details.

    OOM

    If java.lang.OutOfMemoryError appears, it is probably that OOM has occurred in the heap memory. In this case, you need to increase the operator parallelism (CUs) of the job and optimize the memory usage to avoid OOM.

    JVM exit ‍and other fatal errors

    The following keywords are generally followed by a process exit code and can help identify fatal JVM or Akka errors that cause a JVM to be forcibly shut down.
    exit code OR shutting down JVM OR fatal OR kill OR killing
    For example, the fatal error of ZooKeeper connection loss shown in the figure below hits the keyword fatal.

    Checkpoint failure (timeout)

    The following keywords indicate that a checkpoint fails. In this case, please analyze the issue based on the specific causes. For example, declined represents a checkpoint failure due to resource unavailability (the job is not running), the existence of FINISHED operators, checkpoint timeout, incomplete checkpoint files, or other reasons.
    Checkpoint was declined
    Checkpoint was canceled
    Checkpoint expired
    job has failed
    Task has failed
    Failure to finalize

    Timeout/Failure

    The following keywords indicate that an access timeout may occur to an external system (such as MySQL or Kafka) due to network failure or other reasons. The results provided may contain much configuration content. Please check whether this represents an error. For example, Timeout expired while fetching topic metadata for Kafka represents an initialization timeout, and Communications link failure for MySQL represents disconnection (which may be a client timeout due to no data inflow for a long period).
    java.util.concurrent.TimeoutException
    timeout
    failure
    timed out
    failed

    Exception

    Exception indicates that an exception may have occurred. For example, the start logs of a Flink job in the following figure indicates that the job fails to be submitted due to an exception. Search by Exception will display specific exceptions following Caused by ‍in the stack traces at all levels.
    Note
    Not all logs containing Exception can be found by search due to keyword segmentation rules.

    WARN and ERROR logs

    In general, you can search for all logs containing WARN or ERROR, where many results may be found. Please filter the information as needed. For example, ‍some logs may contain WARN and ERROR themselves and do not represent errors.

    Ignorable errors

    The following common errors in the Stream Compute Service logs do not affect the running of jobs and can be skipped during troubleshooting:
    WARN org.apache.flink.core.plugin.PluginConfig - The plugins directory [plugins] does not exist.
    
    WARN org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in specified JAAS configuration file: '/tmp/jaas-00000000.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it.
    
    ERROR org.apache.flink.shaded.curator4.org.apache.curator.ConnectionState - Authentication failed
    
    WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    
    WARN org.apache.flink.kubernetes.utils.KubernetesInitializerUtils - Ship directory /data/workspace/.../shipFiles is not exists. Ignoring it.
    
    WARN org.apache.flink.configuration.GlobalConfiguration - Error while trying to split key and value in configuration file /opt/flink-1.11.0/conf/flink-conf.yaml
    
    WARN org.apache.flink.shaded.curator4.org.apache.curator.utils.ZKPaths - The version of ZooKeeper being used doesn't support Container nodes. CreateMode.PERSISTENT will be used instead.
    
    WARNING: Unable to load JDK7 types (annotations, java.nio.file.Path): no Java7 support added
    
    Contact Us

    Contact our sales team or business advisors to help your business.

    Technical Support

    Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

    7x24 Phone Support