Migrating Data Between HDFS and COS

Last updated: 2021-02-25 15:30:32

    Overview

    Hadoop DistCp (Distributed copy) is a tool used for large inter- and intra-cluster copying. It uses MapReduce to perform distribution, error handling, recovery, and reporting. With the parallel processing capabilities of MapReduce, Hadoop DistCp can quickly perform large-scale data migration through map tasks, each of which copies a portion of the files specified under the source path.

    Since Hadoop-COS implements the semantics of the Hadoop Distributed File System (HDFS), Hadoop DistCp can help you easily migrate data between COS and HDFS. This document outlines this process below.

    Prerequisites

    1. The Hadoop-COS plugin has been installed on the Hadoop cluster, and the access key for COS has been configured correctly. You can use the following Hadoop commands to check if COS access is normal:
      hadoop fs -ls cosn://examplebucket-1250000000/
      If the files in the COS bucket can be listed correctly, it means that Hadoop-COS has been installed and configured correctly, and the following steps can be performed.
    2. The COS access account must have read and write permission for the destination path of the COS bucket.

    Note:

    Directions

    Copying files from HDFS to a COS bucket

    Use Hadoop DistCp to migrate files in the / test directory of the local HDFS cluster to the COS bucket hdfs-test-1250000000.

    1. Run the following command to start the migration:
    hadoop distcp hdfs://10.0.0.3:9000/test cosn://hdfs-test-1250000000/

    Hadoop DistCp starts MapReduce tasks to copy the files, and outputs a brief report like so:

    1. Run the hadoop fs -ls -R cosn://hdfs-test-1250000000/ command to list the directories and files that have just been migrated to the bucket hdfs-test-1250000000.

    Copying files from a COS bucket to a local HDFS cluster

    Hadoop DistCp is a tool that supports copying data between different clusters and file systems. To copy COS files to a local HDFS cluster, simply use the object path in the COS bucket as the source path and the HDFS file path as the destination path.

    hadoop distcp cosn://hdfs-test-1250000000/test hdfs://10.0.0.3:9000/

    Using the DistCp command line to migrate data between HDFS and COS

    Note:

    With this command line, you can migrate data from HDFS to COS, and vice versa.

    Run the following command:

    hadoop distcp -Dfs.cosn.impl=org.apache.hadoop.fs.CosFileSystem -Dfs.cosn.bucket.region=ap-XXX  -Dfs.cosn.userinfo.secretId=AK**XXX  -Dfs.cosn.userinfo.secretKey=XXXX  -libjars /home/hadoop/hadoop-cos-2.6.5-shaded.jar  cosn://bucketname-appid/test/ hdfs:///test/

    Parameter description:

    • Dfs.cosn.impl: Always set it to org.apache.hadoop.fs.CosFileSystem.
    • Dfs.cosn.bucket.region: bucket region. You can view the region on the bucket list page of the COS console.
    • Dfs.cosn.userinfo.secretId: Enter the SecretId of the bucket owner’s account. The value can be obtained at Manage API Key.
    • Dfs.cosn.userinfo.secretKey: Enter the SecretKey of the bucket owner’s account. The value can be obtained at Manage API Key.
    • libjars: specifies the location of the Hadoop-COS JAR package. You download the package from the dep directory at GitHub repository.

    Note:

    For other parameters, please see Hadoop-COS.

    Additional Hadoop DistCp Parameters

    Hadoop DistCp supports a variety of parameters. For example, you can use -m to specify the maximum number of concurrent Map tasks, and-bandwidth to limit the maximum bandwidth used by each map. For more information, please see Apache Hadoop DistCp Guide.