Hadoop-cos-DistChecker tool

Last updated: 2020-09-25 16:21:27

    Feature description

    After migrating data from HDFS to COS by using the hadoop distcp command, you can use the Hadoop-cos-DistChecker tool to verify the integrity of the migrated directory. Based on the parallel processing capabilities of MapReduce, it can quickly check the source directory against the destination directory.

    Operating Environment

    • Hadoop-cos v5.8.2 or above. For more information, see hadoop-cos release.
    • Runtime environment for Hadoop MapReduce
    • For self-built Hadoop clusters, the "Hadoop-cos" dependency must be of the latest version (GitHub release 5.8.2 or above) to return CRC64 check code.
    • If you are using Tencent Cloud EMR suite, the Hadoop-cos version above is available only for clusters created after May 8, 2020. For earlier clusters, please [submit a ticket](https://console.cloud .tencent.com/workorder/category) for assistance.


    To run Hadoop-cos-DistChecker requires the CRC64 checksum of the object from Hadoop-COS (COSN file system). Therefore, you should first configure fs.cosn.crc64.checksum.enabled to true to do so. Once this tool finishes, set this value back to false to stop getting CRC64 checksum.

    The CRC64 checksum in Hadoop-COS is not compatible with the CRC32C checksum in HDFS, so after using this tool, be sure to set the above parameter to false. Otherwise, Hadoop-COS may fail to run in some scenarios where the file system getFileChecksum API is called.


    Parameter description

    • Source file list
      The source file path list is a list of subdirectories and files to be checked that you export by running the following command:

      hadoop fs -ls -R hdfs://host:port/{source_dir} | awk '{print $8}' > check_list.txt

      Its format is as follows:

    • Source directory: the directory where the source files are stored; it usually serves as the source path for data migration through the distcp command. For example, hdfs://host:port/source_dir is the source directory in the following sample:

      hadoop distcp hdfs://host:port/source_dir cosn://examplebucket-appid/dest_dir

      This is also the common parent directory in the source file path list, such as / benchmarks in the sample above.

    • Destination directory: the destination directory to check against.

    Command line format

    Hadoop-cos-DistChecker is a MapReduce task-based program, and can be submitted just like a MapReduce task:

    hadoop jar hadoop-cos-distchecker-2.8.5-1.0-SNAPSHOT.jar com.qcloud.cos.hadoop.distchecker.App <Absolute path of the source file list> <Absolute path representation of the source directory> <Absolute path representation of the destination directory> [optional parameters]

    Optional parameters represent the optional parameters for Hadoop.


    The example below describes how to use this tool by checking hdfs:// against cosn://hdfs-test-1250000000/benchmarks.

    First, run the following command:

    hadoop fs -ls -R hdfs:// | awk '{print $8}' > check_list.txt

    Export all the source paths to be checked to a check_list.txt file which stores the list of source file paths, as shown below:

    Then, put check_list.txt into HDFS by running

    hadoop fs -put check_list.txt hdfs://

    Run the Hadoop-cos-DistChecker to check hdfs:// against cosn://hdfs-test-1250000000/benchmarks, and output the result to the cosn://hdfs-test-1250000000/check_result path by using the following command:

    hadoop jar hadoop-cos-distchecker-2.8.5-1.0-SNAPSHOT.jar com.qcloud.cos.hadoop.distchecker.App hdfs:// hdfs:// cosn://hdfs-test-1250000000/benchmarks cosn://hdfs-test-1250000000/check_result

    Hadoop-cos-DistChecker will read the source file list and source directory, and run the MapReduce job to perform a distributed check. The final check result will be output to cosn://examplebucket-appid/check_result.

    The check report is as follows:

    hdfs://       hdfs://,cosn://hdfs-test-1250000000/benchmarks/TestDFSIO,None,None,None,SUCCESS,'The source file and the target file are the same.'
    hdfs://    hdfs://,cosn://hdfs-test-1250000000/benchmarks/TestDFSIO/io_control,None,None,None,SUCCESS,'The source file and the target file are the same.'
    hdfs://  hdfs://,cosn://hdfs-test-1250000000/benchmarks/TestDFSIO/io_control/in_file_test_io_0,CRC64,1566310986176587838,1566310986176587838,SUCCESS,'The source file and the target file are the same.'
    hdfs://  hdfs://,cosn://hdfs-test-1250000000/benchmarks/TestDFSIO/io_control/in_file_test_io_1,CRC64,-6584441696534676125,-6584441696534676125,SUCCESS,'The source file and the target file are the same.'
    hdfs://       hdfs://,cosn://hdfs-test-1250000000/benchmarks/TestDFSIO/io_data,None,None,None,SUCCESS,'The source file and the target file are the same.'
    hdfs://     hdfs://,cosn://hdfs-test-1250000000/benchmarks/TestDFSIO/io_data/test_io_0,CRC64,3534425600523290380,3534425600523290380,SUCCESS,'The source file and the target file are the same.'
    hdfs://     hdfs://,cosn://hdfs-test-1250000000/benchmarks/TestDFSIO/io_data/test_io_1,CRC64,3534425600523290380,3534425600523290380,SUCCESS,'The source file and the target file are the same.'
    hdfs://      hdfs://,cosn://hdfs-test-1250000000/benchmarks/TestDFSIO/io_write,None,None,None,SUCCESS,'The source file and the target file are the same.'
    hdfs://     hdfs://,cosn://hdfs-test-1250000000/benchmarks/TestDFSIO/io_write/_SUCCESS,CRC64,0,0,SUCCESS,'The source file and the target file are the same.'
    hdfs://   hdfs://,cosn://hdfs-test-1250000000/benchmarks/TestDFSIO/io_write/part-00000,CRC64,-4804567387993776854,-4804567387993776854,SUCCESS,'The source file and the target file are the same.'

    Check report format

    The check report is in the following format:

    Source file path in `check_list.txt`, absolute path of the source file, absolute path of the destination file, checksum algorithm, checksum of the source file, checksum of the destination file, check result, result description

    There are 7 check results:

    • SUCCESS: the source and destination files exist and are the same.
    • MISMATCH: the source and destination files exist but are different.
    • UNCONFIRM: the system cannot determine whether the source and destination files are the same. This may be because the destination file already existed in COS before the CRC64 feature was launched, and thus its CRC64 checksum cannot be obtained.
    • UNCHECKED: the check is not performed. This is mainly because the source file cannot be read, or its checksum cannot be computed.
    • SOURCE_FILE_MISSING: the source file does not exist.
    • TARGET_FILE_MISSING: the destination file does not exist.
    • TARGET_FILESYSTEM_ERROR: the destination file system is not CosN.


    Why is there a negative CRC64 checksum in the check report?

    A CRC64 checksum may contain 20 digits, which exceeds the range of the Java long type. However, they have the same underlying bytes. Therefore, when the long value is printed, it may be negative.

    Was this page helpful?

    Was this page helpful?

    • Not at all
    • Not very helpful
    • Somewhat helpful
    • Very helpful
    • Extremely helpful
    Send Feedback