Crash-Scene Data Collection Feature Overview
On the Stream Compute Service (SCS) platform, Flink JobManager and TaskManager run in standalone containers (pods). When a TaskManager or JobManager pod encounters problems and exits, the crash scene will be cleaned up instantly, making it difficult for fault localization.
All JobManager and TaskManager logs during the current job run are collected to Cloud Log Service (CLS) for the user, and logs can be viewed and searched in the console (for detailed operations, see View Job Log Information). In addition to the logs, crash-scene data also includes OOM Dump files, JVM crash logs, and other files written by the program while running. These are quite useful for locating problems.
Therefore, we provide the pod crash-scence data collection feature. When a user enables this feature for a certain job, all files in the log directory (/opt/flink/log) will be packaged and uploaded to the cluster-bound COS bucket for user analysis whenever the Flink TaskManager and JobManager for the job terminate normally or with an exception.
Note:
This feature is not currently supported in some old clusters. If you need this feature but your cluster does not support it, submit a ticket to upgrade the cluster. Enabling Methods
Pod crash-scene data collection will upload the crash-scene data after each TaskManager and JobManager exits to the cluster-bound COS bucket. To avoid too much storage overhead, this feature is not enabled by default.
You can add the following content in the Advanced Parameters for the job to enable the pod crash-scene data collection feature: flink.kubernetes.diagnosis-collection-enabled: true
Note:
After this feature is enabled, any files written to the /opt/flink/log directory will be collected and uploaded.
If you need to collect a heap memory dump and perform subsequent analysis when an OOM (memory overflow) error occurs in the Flink TaskManager, you can add the following content in advanced parameters:
env.java.opts.taskmanager: -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/opt/flink/log/taskmanager.hprof -XX:ErrorFile=/opt/flink/log/taskmanager.err
If you need to use the Java Flight Recorder to collect JVM operation status for a period after startup, you can also add the following parameters (the duration parameter can be modified to the desired collection time as needed):
env.java.opts.taskmanager: -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/opt/flink/log/taskmanager.hprof -XX:ErrorFile=/opt/flink/log/taskmanager.err -XX:+FlightRecorder -XX:StartFlightRecording=duration=400s,filename=/opt/flink/log/taskmanager.jfr
Viewing a Collected File
All collected pod crash-scene files will be automatically packaged and uploaded to the cluster-bound COS bucket under the /oceanus-diagnosis/ directory. The directory structure is:
JobManager: /oceanus-diagnosis/cluster-id/job-id/run-id/jobmanager-timestamp.tgz
TaskManager: /oceanus-diagnosis/cluster-id/job-id/run-id/taskmanager-1-taskmanagerid.tgz