tencent cloud

Elastic MapReduce

Release Notes and Announcements
Release Notes
Announcements
Security Announcements
Product Introduction
Overview
Strengths
Architecture
Features
Use Cases
Constraints and Limits
Technical Support Scope
Product release
Purchase Guide
EMR on CVM Billing Instructions
EMR on TKE Billing Instructions
EMR Serverless HBase Billing Instructions
Getting Started
EMR on CVM Quick Start
EMR on TKE Quick Start
EMR on CVM Operation Guide
Planning Cluster
Administrative rights
Configuring Cluster
Managing Cluster
Managing Service
Monitoring and Alarms
TCInsight
EMR on TKE Operation Guide
Introduction to EMR on TKE
Configuring Cluster
Cluster Management
Service Management
Monitoring and Ops
Application Analysis
EMR Serverless HBase Operation Guide
EMR Serverless HBase Product Introduction
Quotas and Limits
Planning an Instance
Managing an Instance
Monitoring and Alarms
Development Guide
EMR Development Guide
Hadoop Development Guide
Spark Development Guide
Hbase Development Guide
Phoenix on Hbase Development Guide
Hive Development Guide
Presto Development Guide
Sqoop Development Guide
Hue Development Guide
Oozie Development Guide
Flume Development Guide
Kerberos Development Guide
Knox Development Guide
Alluxio Development Guide
Kylin Development Guide
Livy Development Guide
Kyuubi Development Guide
Zeppelin Development Guide
Hudi Development Guide
Superset Development Guide
Impala Development Guide
Druid Development Guide
TensorFlow Development Guide
Kudu Development Guide
Ranger Development Guide
Kafka Development Guide
Iceberg Development Guide
StarRocks Development Guide
Flink Development Guide
JupyterLab Development Guide
MLflow Development Guide
Practical Tutorial
Practice of EMR on CVM Ops
Data Migration
Practical Tutorial on Custom Scaling
API Documentation
History
Introduction
API Category
Cluster Resource Management APIs
Cluster Services APIs
User Management APIs
Data Inquiry APIs
Scaling APIs
Configuration APIs
Other APIs
Serverless HBase APIs
YARN Resource Scheduling APIs
Making API Requests
Data Types
Error Codes
FAQs
EMR on CVM
Service Level Agreement
Contact Us

Flink Overview

PDF
Focus Mode
Font Size
Last updated: 2025-01-03 15:02:25
Flink is an open-source distributed, high-performance, highly available, and accurate data stream execution engine. It provides diverse features such as data distribution, data communication, and fault tolerance for distributed computations over data streams. Based on the stream execution engine, Flink provides APIs at a higher abstraction layer for you to write distributed jobs.
Distributed: Flink can run on multiple machines.
High-performance: Flink has a high processing performance.
Highly available: Flink supports an automatic application restart mechanism.
Accurate: Flink can ensure the accuracy of data processing.

As shown in the figure, the data source on the left contains real-time logs or data in the database, file system, or key-value storage system. Flink in the middle organizes the data and outputs the computed data to the destination on the right, which can be an application system or storage system. In summary, Flink has three core components:
Data source: The data source on the left.
Transformations: Operators, which process data.
Data sink: Output component, which outputs the computed data to other application systems.

Use cases

Flink has the following three use cases:
1. Event-driven applications An event-driven application is stateful. It ingests data from one or more event streams and triggers computations, status updates, or external actions based on the events.

In a traditional architecture (left), an application needs to read/write data from/to a remote transactional database, such as MySQL. For an event-driven application, data and computation are co-located. The application only needs to access the local memory or disk to get data, so it delivers a higher throughput and lower latency.
Flink supports event-driven applications by virtue of the following features:
Efficient status management: Flink comes with state backends, which store the intermediate status information.
Rich windows: Flink has tumbling, sliding, and other windows.
Various notions of time: Flink supports event time, processing time, and ingestion time. Flink supports "at-least-once" and "exactly-once" fault tolerance levels.
2. Real-time data analytics applications Analytical jobs extract valuable information and metrics from raw data. Traditionally, batch queries are used, or events are recorded to form an application based on the limited data set. To get the analysis result of the latest data, this mode needs to add the data to the data set, perform the query or run the application again, and write the result to a storage system or generate a report.

3. Real-time data warehouse and ETL Extract, Transform, and Load (ETL) is a process that extracts, transforms, and loads data in a business system to a data warehouse. In traditional mode, the offline data warehouse centrally stores business data and performs ETL and runs other models at regular intervals based on the fixed computing logic to generate reports. It mainly builds T+1 offline data, pulls incremental data every day through periodic jobs, creates topic-level data related to different businesses, and provides the T+1 data query API externally.

The above figure compares the offline data warehouse ETL and real-time data warehouse. As can be seen, the offline mode is inferior in terms of computing and data real-timeness. Data becomes less valuable over time and needs to reach the user as soon as possible; therefore, real-time data warehouses are demanded. For more information on API layers, see the following documents:

Table API & SQL: The Table API is tightly integrated with the DataSet or DataStream. You can create a table through a DataSet or DataStream and convert it into a DataSet or DataStream through operations such as FILTER, SUM, JOIN, and SELECT. The SQL API is based on Apache Calcite at the underlying layer and is more flexible than other APIs, as Apache Calcite implements the SQL standard to allow for the direct use of SQL statements. The Table API and SQL API can work together as both of them return table objects.
DataStream API and DataSet API: They mainly process streaming data and batch data and encapsulate lower-level APIs to support higher-order functions such as FILTER, SUM, MAX, and MIN. They are easy to use and popular in actual applications.
Stateful Stream Processing: It provides time- and status-based control and is a bit complex and hard to use. It mainly applies to the logic of complex event processing.

Environment information

By default, Flink is deployed on the master and core nodes in a cluster. It is an out-of-the-box service.
After logging in, you can run the su hadoop command to switch to the hadoop user and then perform local tests.
The Flink software path is /usr/local/service/flink.
The log path is /data/emr/flink/logs.
For more information, see the community documentation.

Help and Support

Was this page helpful?

Help us improve! Rate your documentation experience in 5 mins.

Feedback