Hadoop DistCp (Distributed copy) is a tool used for large inter- and intra-cluster copying. It uses MapReduce to perform distribution, error handling, recovery, and reporting. With the parallel processing capabilities of MapReduce, Hadoop DistCp can quickly perform large-scale data migration through map tasks, each of which copies a portion of the files specified under the source path.
Since Hadoop-COS implements the semantics of the Hadoop Distributed File System (HDFS), Hadoop DistCp can help you easily migrate data between COS and HDFS. This document outlines this process below.
If the files in the COS bucket can be listed correctly, it means that Hadoop-COS has been installed and configured correctly, and the following steps can be performed.
hadoop fs -ls cosn://examplebucket-1250000000/
- You can authorize sub-accounts read/write permissions for resources in the COS bucket as needed. You are advised to authorize by referring to Notes on Principle of Least Privilege and Setting Sub-user Permissions. The common preset policies are as follows:
- The custom monitoring feature requires Cloud Monitoring to have permission to report metrics and read API operations. Please grant the QcloudMonitorFullAccess permission with caution.
Use Hadoop DistCp to migrate files in the
/ test directory of the local HDFS cluster to the COS bucket
hadoop distcp hdfs://10.0.0.3:9000/test cosn://hdfs-test-1250000000/
Hadoop DistCp starts MapReduce tasks to copy the files, and outputs a brief report like so:
hadoop fs -ls -R cosn://hdfs-test-1250000000/command to list the directories and files that have just been migrated to the bucket
Hadoop DistCp is a tool that supports copying data between different clusters and file systems. To copy COS files to a local HDFS cluster, simply use the object path in the COS bucket as the source path and the HDFS file path as the destination path.
hadoop distcp cosn://hdfs-test-1250000000/test hdfs://10.0.0.3:9000/
With this command line, you can migrate data from HDFS to COS, and vice versa.
Run the following command:
hadoop distcp -Dfs.cosn.impl=org.apache.hadoop.fs.CosFileSystem -Dfs.cosn.bucket.region=ap-XXX -Dfs.cosn.userinfo.secretId=AK**XXX -Dfs.cosn.userinfo.secretKey=XXXX -libjars /home/hadoop/hadoop-cos-2.6.5-shaded.jar cosn://bucketname-appid/test/ hdfs:///test/
SecretIdof the bucket owner’s account. The value can be obtained at Manage API Key.
SecretKeyof the bucket owner’s account. The value can be obtained at Manage API Key.
depdirectory at GitHub repository.
For other parameters, please see Hadoop-COS.
Hadoop DistCp supports a variety of parameters. For example, you can use
-m to specify the maximum number of concurrent Map tasks, and
-bandwidth to limit the maximum bandwidth used by each map. For more information, please see Apache Hadoop DistCp Guide.