Releases Notes and Announcements
- Release Notes
Product Introduction
Purchase Guide
Getting Started
Operation Guide
- Managing Jobs
- Developing Jobs
- Monitoring Jobs
- Job Logs
- Events and Diagnosis
- Crash-Scene Data Collection
- Managing Metadata
- Managing Checkpoints
- Tuning Jobs
- Managing Dependencies
- Managing Clusters
- Managing Permissions
SQL Developer Guide
- Overview
- Glossary and Data Types
- DDL Statements
- DML Statements
- Merging MySQL CDC Sources
- Database Synchronization (SQL) Capability
- Connectors
- SET Statement
- Operators and Built-in Functions
- Identifiers and Reserved Words
Python Developer Guide
ETL Developer Guide
- Overview
- Glossary
- Connectors
FAQ
Contact Us

Pod Crash-Scene Data Collection

ダウンロード

フォーカスモード

フォントサイズ

最終更新日: 2026-06-05 14:35:48

Crash-Scene Data Collection Feature Overview
On the Stream Compute Service (SCS) platform, Flink JobManager and TaskManager run in standalone containers (pods). When a TaskManager or JobManager pod encounters problems and exits, the crash scene will be cleaned up instantly, making it difficult for fault localization.
All JobManager and TaskManager logs during the current job run are collected to Cloud Log Service (CLS) for the user, and logs can be viewed and searched in the console (for detailed operations, see View Job Log Information). 
In addition to the logs, crash-scene data also includes OOM Dump files, JVM crash logs, and other files written by the program while running. These are quite useful for locating problems.
Therefore, we provide the pod crash-scence data collection feature. When a user enables this feature for a certain job, all files in the log directory (/opt/flink/log) will be packaged and uploaded to the cluster-bound COS bucket for user analysis whenever the Flink TaskManager and JobManager for the job terminate normally or with an exception.
Note:
This feature is not currently supported in some old clusters. If you need this feature but your cluster does not support it, submit a ticket to upgrade the cluster.
Enabling Methods
Pod crash-scene data collection will upload the crash-scene data after each TaskManager and JobManager exits to the cluster-bound COS bucket. To avoid too much storage overhead, this feature is not enabled by default.
You can add the following content in the Advanced Parameters for the job to enable the pod crash-scene data collection feature:
flink.kubernetes.diagnosis-collection-enabled: true
Note:
 After this feature is enabled, any files written to the /opt/flink/log directory will be collected and uploaded.
If you need to collect a heap memory dump and perform subsequent analysis when an OOM (memory overflow) error occurs in the Flink TaskManager, you can add the following content in advanced parameters:
env.java.opts.taskmanager: -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/opt/flink/log/taskmanager.hprof -XX:ErrorFile=/opt/flink/log/taskmanager.err
If you need to use the Java Flight Recorder to collect JVM operation status for a period after startup, you can also add the following parameters (the duration parameter can be modified to the desired collection time as needed):
env.java.opts.taskmanager: -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/opt/flink/log/taskmanager.hprof -XX:ErrorFile=/opt/flink/log/taskmanager.err -XX:+FlightRecorder -XX:StartFlightRecording=duration=400s,filename=/opt/flink/log/taskmanager.jfr
Viewing a Collected File
All collected pod crash-scene files will be automatically packaged and uploaded to the cluster-bound COS bucket under the /oceanus-diagnosis/ directory. The directory structure is:
JobManager: /oceanus-diagnosis/cluster-id/job-id/run-id/jobmanager-timestamp.tgz
TaskManager: /oceanus-diagnosis/cluster-id/job-id/run-id/taskmanager-1-taskmanagerid.tgz
﻿

ヘルプとサポート

この記事はお役に立ちましたか？

営業担当者にお問い合わせいただくかチケットを提出してサポートを求めることができます。

フィードバック

tencent cloud

Stream Compute Service

Pod Crash-Scene Data Collection

Crash-Scene Data Collection Feature Overview

Enabling Methods

Viewing a Collected File

ヘルプとサポート