Data Sync

Last updated: 2019-11-06 17:01:21

PDF

Operation Scenarios

CKafka Connector is an efficient data transfer service implemented based on the open-source Kafka Connector.
Nowadays, during the ETL of data, Kafka is often used as message middleware for offline and real-time use cases, but there is no pipeline between the upstream and downstream of Kafka data to achieve seamless connection. For example, Flume or Logstash is used to collect data to Kafka, while Kafka then pulls or pushes the data to the target storage in other ways.

Kafka Connector is designed to build a scalable and reliable data flow channel around Kafka. It allows massive amounts of data to quickly pass through Kafka so as to interact with other sources or source/target data and build a low-latency pipeline for data transfer across instances.
You can enable mutual data transfer between any topics in different CKafka instances in the same region (through the console) or in different regions (through the TencentCloud APIs of CKafka). To implement data sync, you don't need to install or configure any additional hardware devices; instead, simply enter the corresponding CKafka instance ID and topic ID.

  • Currently, the console only supports data sync between topics within or across CKafka instances in the same region, and cross-AZ data sync will experience a delay of over 3 ms (depending on the distance between the AZs).
  • TencentCloud API supports cross-region sync which will experience a delay of over 10 ms (depending on the distance between the regions).

Prerequisites

  • This feature is currently under beta test. To try it out, submit a ticket for application.
  • As a pipeline for data transfer, the data reliability of CKafka Connector relies on upstream and downstream data reliability. For more information, see CKafka Data Reliability Description. If you have higher requirements for data reliability, please perform multiple backups of data in other ways.
  • Please configure monitors and alarms to keep updated with the data sync status. The traffic generated by data sync in CKafka Connector will occupy a certain percentage of the quota for peak throughput traffic of the instance.

Directions

Creating a Data Sync Instance

You can only create connectors in the same region to sync data at the topic level in the console (if you need cross-region sync, submit a ticket for application). You can select multiple topics. During sync, topic attributes such as the number of partitions and replicas will be replicated.

To achieve flexibility of data sync, CKafka does not check whether the sync between the source and target instances loops. For example, if you select the same instance and constantly sync the topics, infinite looping will occur, which will consume the traffic of your CKafka instance. Therefore, when creating a data sync instance, you are recommended to avoid looped sync between source and target instances.

Viewing Task Configuration

You can view the configuration of a data sync instance in the "Operation" column in the list.

  • Unassigned: A connector will be in this status if it is not assigned to any worker. This generally happens shortly after creating the connector or during rebalancing of the Connect cluster.
  • Paused: The task is paused, and the connector will not replicate data.
  • Running: The task is in progress, the connector has successfully generated the task configuration, and all relevant tasks are running well.
  • Failed: The connector cannot generate a new task configuration, or the task failed. If the connector cannot generate a new task configuration, but an old valid task configuration exists, the tasks will continue running according to the old configuration.
  • Terminating: The data sync instance is being terminated.

Manipulating a Data Sync Instance

Currently, all operations are async tasks which have a delay, and the task status may not change in real time.

  • Start: A paused instance can be restarted to continue syncing data from where the sync paused.
  • Pause: A started instance can be paused. If you find that the data sync service affects normal use of CKafka, you can pause data sync.
    • If data sync is restarted within 24 hours after being paused, the message sync progress at the time of pause can be restored.
    • Otherwise, data sync will start from the starting point you specify. If you don't specify one, sync will start from the latest data. (For more information on how to specify.
  • Delete: This operation is used to stop data sync, which will not affect the data already synced and the relevant CKafka instances.

Use Cases

  • Ecommerce: In the case of one full-data producer and multiple consumers at different rates, CKafka Connector can solve the I/O bottleneck caused by constant flushing of data without deploying two producers, which can significantly increase the cost effectiveness.
  • Data sync between two different CKafka instances: CKafka Connector enables smooth data sync in scenarios with one data producer and multiple consumers in different regions or AZs.