tencent cloud

masukan

Importing/Exporting COS Using DataX

Terakhir diperbarui:2022-04-19 14:23:43

    Environmental Dependencies

    Download and Installation

    Downloading HADOOP-COS

    Download HADOOP-COS and the corresponding cos_api-bundle on Github.

    Downloading DataX package

    Download DataX on Github.

    Installing HADOOP-COS

    After HADOOP-COS is downloaded, copy hadoop-cos-2.x.x-${version}.jar and cos_api-bundle-${version}.jar to the Datax decompression paths plugin/reader/hdfsreader/libs/ and plugin/writer/hdfswriter/libs/.

    How to Use

    DataX configuration

    Modifying datax.py script

    Open the bin/datax.py script in the DataX decompression directory, and modify the CLASS_PATH variable in the script as follows:

    CLASS_PATH = ("%s/lib/*:%s/plugin/reader/hdfsreader/libs/*:%s/plugin/writer/hdfswriter/libs/*:.") % (DATAX_HOME, DATAX_HOME, DATAX_HOME)
    

    Configuring hdfsreader and hdfswriter in JSON configuration file

    A sample JSON file is as shown below:

    {
    "job": {
        "setting": {
            "speed": {
                "byte": 10485760
            },
            "errorLimit": {
                "record": 0,
                "percentage": 0.02
            }
        },
        "content": [{
            "reader": {
                "name": "hdfsreader",
                "parameter": {
                    "path": "testfile",
                    "defaultFS": "cosn://examplebucket-1250000000/",
                    "column": ["*"],
                    "fileType": "text",
                    "encoding": "UTF-8",
                    "hadoopConfig": {
                        "fs.cosn.impl": "org.apache.hadoop.fs.CosFileSystem",
                        "fs.cosn.userinfo.region": "ap-beijing",
                        "fs.cosn.tmp.dir": "/tmp/hadoop_cos",
                        "fs.cosn.userinfo.secretId": "COS_SECRETID",
                        "fs.cosn.userinfo.secretKey": "COS_SECRETKEY"
                    },
                    "fieldDelimiter": ","
                }
            },
            "writer": {
                "name": "hdfswriter",
                "parameter": {
                    "path": "/user/hadoop/",
                    "fileName": "testfile1",
                    "defaultFS": "cosn://examplebucket-1250000000/",
                    "column": [{
                            "name": "col",
                            "type": "string"
                        },
                        {
                            "name": "col1",
                            "type": "string"
                        },
                        {
                            "name": "col2",
                            "type": "string"
                        }
                    ],
                    "fileType": "text",
                    "encoding": "UTF-8",
                    "hadoopConfig": {
                        "fs.cosn.impl": "org.apache.hadoop.fs.CosFileSystem",
                        "fs.cosn.userinfo.region": "ap-beijing",
                        "fs.cosn.tmp.dir": "/tmp/hadoop_cos",
                        "fs.cosn.userinfo.secretId": "COS_SECRETID",
                        "fs.cosn.userinfo.secretKey": "COS_SECRETKEY"
                    },
                    "fieldDelimiter": ":",
                    "writeMode": "append"
                }
            }
        }]
    }
    }
    

    Notes:

    • Configure hadoopConfig as required for cosn.
    • Use defaultFS to specify the cosn path, e.g. cosn://examplebucket-1250000000/.
    • In fs.cosn.userinfo.region, enter the region where your bucket resides, such as ap-beijing. For more information, see Regions and Access Endpoints.
    • For COS_SECRETID and COS_SECRETKEY, use your own COS key information.

    The other fields can be the same as those for hdfs.

    Migrating data

    Save the configuration file as hdfs_job.json in the job directory by running

    bin/datax.py job/hdfs_job.json
    

    The resulting output is as shown below:

    2020-03-09 16:49:59.543 [job-0] INFO  JobContainer - 
           [total cpu info] => 
                  averageCpu                     | maxDeltaCpu                    | minDeltaCpu                    
                  -1.00%                         | -1.00%                         | -1.00%
                          
            [total gc info] => 
                   NAME                 | totalGCCount       | maxDeltaGCCount    | minDeltaGCCount    | totalGCTime        | maxDeltaGCTime     | minDeltaGCTime     
                   PS MarkSweep         | 1                  | 1                  | 1                  | 0.024s             | 0.024s             | 0.024s             
                   PS Scavenge          | 1                  | 1                  | 1                  | 0.014s             | 0.014s             | 0.014s             
    2020-03-09 16:49:59.543 [job-0] INFO  JobContainer - PerfTrace not enable!
    2020-03-09 16:49:59.543 [job-0] INFO  StandAloneJobContainerCommunicator - Total 2 records, 33 bytes | Speed 3B/s, 0 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 0.000s |  All Task WaitReaderTime 0.033s | Percentage 100.00%
    2020-03-09 16:49:59.544 [job-0] INFO  JobContainer - 
    Job start time                    : 2020-03-09 16:49:48
    Job end time                    : 2020-03-09 16:49:48
    Job duration                    :                 11s
    Average job traffic                    :                3B/s
    Recorded write speed                    :              0rec/s
    Recorded read count                    :                   2
    Read/Write failure count                    :                   0
    
    Hubungi Kami

    Hubungi tim penjualan atau penasihat bisnis kami untuk membantu bisnis Anda.

    Dukungan Teknis

    Buka tiket jika Anda mencari bantuan lebih lanjut. Tiket kami tersedia 7x24.

    Dukungan Telepon 7x24