Hadoop Distcp (Distributed copy) is a tool used for large inter- and intra-cluster copying. It uses MapReduce to perform distribution, error handling, recovery, and reporting. With the parallel processing capabilities of MapReduce, Hadoop Distcp can quickly perform large-scale data migration through map tasks, each of which copies a portion of the files specified under the source path.
Since Hadoop-COS implements the semantics of the Hadoop Distributed File System (HDFS), Hadoop Distcp can help you easily migrate data between COS and HDFS. This document outlines this process below.
If a correct list of COS buckets is returned, there are no issues with Hadoop-COS, and you can begin the steps outlined below.
hadoop fs -ls cosn://examplebucket-1250000000/
Use Hadoop Distcp to migrate files in the
/ test directory of the local HDFS cluster to the COS bucket
hadoop distcp hdfs://10.0.0.3:9000/test cosn://hdfs-test-1250000000/
Hadoop Distcp starts MapReduce tasks to copy the files, and outputs a brief report like so:
hadoop fs -ls -R cosn://hdfs-test-1250000000/command to list the directories and files that have just been migrated to the bucket
Hadoop Distcp is a tool that supports copying data between different clusters and file systems. To copy COS files to a local HDFS cluster, simply use the object path in the COS bucket as the source path and the HDFS file path as the destination path.
hadoop distcp cosn://hdfs-test-1250000000/test hdfs://10.0.0.3:9000/
With this command line, you can migrate data from HDFS to COS, and vice versa.
Run the following command:
hadoop distcp -Dfs.cosn.impl=org.apache.hadoop.fs.CosFileSystem -Dfs.cosn.bucket.region=ap-XXX -Dfs.cosn.userinfo.secretId=AK**XXX -Dfs.cosn.userinfo.secretKey=XXXX -libjars /home/hadoop/hadoop-cos-2.6.5-shaded.jar cosn://bucketname-appid/test/ hdfs:///test/
depdirectory which you can find on Github.
For any other parameters, see Hadoop-COS.
Hadoop Distcp supports a variety of parameters. For example, you can use
-m to specify the maximum number of concurrent Map tasks, and
-bandwidth to limit the maximum bandwidth used by each map. For more information, see the Apache Hadoop DistCp Guide.