Hadoop-cos-DistChecker tool

Last updated: 2020-03-03 09:51:04


Function description

Hadoop-cos-DistChecker is a service that uses hadoop distcp After migrating data from HDFS to COS, the command is used to verify the integrity of the migrated Directory. Based on the parallel ability of MapReduce, we can quickly check and compare the migration source Directory and the destination Directory.


Here, hadoop-cos depends on selecting the latest version of (GitHub Tag above 5.8.2) to support the acquisition of CRC64 check codes.


Parameter concept

List of source file paths

The source file list is used by the user hadoop fs -ls -R hdfs://host:port/{source_dir} | awk '{print $8}' > check_list.txt Export's son Directory and file list to be checked. The format of the example is as follows:


Source Directory

Directory, where the source file list is located, and this Directory is usually the same. distcp The source path when the command migrates the data. For example, hadoop distcp hdfs://host:port/source_dir cosn://bucket-appid/dest_dir , then hdfs://host:port/source_dir For the source of Directory.

This path is also the common parent Directory in the source file path list. For example, the public parent Directory of the above source file list is: /benchmarks .

Objective Directory

To be compared with Directory.

Command line format

Hadoop-cos-distchecker is a MapReduce instance program, which can be submitted according to the submission process of MapReduce instance:

hadoop jar hadoop-cos-distchecker-2.8.5-1.0-SNAPSHOT.jar com.qcloud.cos.hadoop.distchecker.App <Absolute path to the list of source files> <Absolute path representation of Source Directory> <The absolute path representation of Directory> [Hadoop optional parameters]

Use steps

The following is to verify hdfs:// and cosn://hdfs-test-1250000000/benchmarks As an example, introduce the steps of using the tool.

First, execute hadoop fs -ls -R hdfs:// | awk '{print $8}' > check_list.txt Export, the source path to be checked, will be transferred to a check_list.txt file, where the list of source file paths will be saved:

Then, put the check_list.txt into the HDFS: hadoop fs -put check_list.txt hdfs:// (a);

Finally, executing the Hadoop-cos-DistChecker, will hdfs:// and cosn://hdfs-test-1250000000/benchmarks Make a comparison, and then output the results and save them to cosn://hdfs-test-1250000000/check_result Under the path, the command format is as follows:

hadoop jar hadoop-cos-distchecker-2.8.5-1.0-SNAPSHOT.jar com.qcloud.cos.hadoop.distchecker.App hdfs:// hdfs:// cosn://hdfs-test-1250000000/benchmarks cosn://hdfs-test-1250000000/check_result

Distchecker will read the list of source files and the source Directory will execute MapReduce instance for distributed inspection, and the final check report will be output to cosn://bucket-appid/check_result Under the path.

The inspection report is as follows:

hdfs://    hdfs://,cosn://hdfs-test-1250000000/benchmarks/TestDFSIO,None,None,None,SUCCESS,'The source file and the target file are the same.'
hdfs://    hdfs://,cosn://hdfs-test-1250000000/benchmarks/TestDFSIO/io_control,None,None,None,SUCCESS,'The source file and the target file are the same.'
hdfs://    hdfs://,cosn://hdfs-test-1250000000/benchmarks/TestDFSIO/io_control/in_file_test_io_0,MD5,dee27f089393936ef42dbd3ebd85750b,dee27f089393936ef42dbd3ebd85750b,SUCCESS,'The source file and the target file are the same.'
hdfs://    hdfs://,cosn://hdfs-test-1250000000/benchmarks/TestDFSIO/io_control/in_file_test_io_1,MD5,526560d99bd99476e5a8e68f0ce87326,526560d99bd99476e5a8e68f0ce87326,SUCCESS,'The source file and the target file are the same.'
hdfs://    hdfs://,cosn://hdfs-test-1250000000/benchmarks/TestDFSIO/io_data,None,None,None,SUCCESS,'The source file and the target file are the same.'
hdfs://    hdfs://,cosn://hdfs-test-1250000000/benchmarks/TestDFSIO/io_data/test_io_0,CRC64,-1057373059199797567,-1057373059199797567,SUCCESS,'The source file and the target file are the same.'
hdfs://    hdfs://,cosn://hdfs-test-1250000000/benchmarks/TestDFSIO/io_data/test_io_1,CRC64,-1057373059199797567,-1057373059199797567,SUCCESS,'The source file and the target file are the same.'
hdfs://    hdfs://,cosn://hdfs-test-1250000000/benchmarks/TestDFSIO/io_write,None,None,None,SUCCESS,'The source file and the target file are the same.'
hdfs://    hdfs://,cosn://hdfs-test-1250000000/benchmarks/TestDFSIO/io_write/_SUCCESS,MD5,d41d8cd98f00b204e9800998ecf8427e,d41d8cd98f00b204e9800998ecf8427e,SUCCESS,'The source file and the target file are the same.'
hdfs://    hdfs://,cosn://hdfs-test-1250000000/benchmarks/TestDFSIO/io_write/part-00000,MD5,5f91c70529f8c9974bf7730c024c867f,5f91c70529f8c9974bf7730c024c867f,SUCCESS,'The source file and the target file are the same.'

Check report format

The inspection report is displayed in the following format:

Source file path in check_list.txt absolute path of source file, absolute path of destination file, Checksum algorithm, checksum value of source file, checksum value of destination file, check result, check result description

Among them, the inspection results are divided into the following seven categories:

  • SUCCESS: indicates that both the source and destination files exist and are consistent.
  • MISMATCH: indicates that both the source and destination files exist, but are not consistent.
  • UNCONFIRM: cannot confirm whether the source file and destination file are consistent. This state is mainly due to the fact that the file on COS may be a file that existed before the COS parity code feature Activate, and the CRC64 check value of the file cannot be obtained.
  • UNCHECKED: did not check. This state is mainly due to the fact that the source file cannot read or cannot read the checksum value of the source file.
  • The SOURCE_FILE_MISSING: source file does not exist.
  • The TARGET_FILE_MISSING: destination file does not exist.
  • The TARGET_FILESYSTEM_ERROR: destination file system is not a CosN file system.


1. Why does the check report show a negative CRC64 value?

Because the CRC64 value may be a 20-bit value, it is beyond the representation range of the Java long type, but its underlying bytes are consistent, while a negative representation appears when printing the long type.

2. Why is there both MD5 value check and CRC64 value check?

At present, the simple upload of COS still uses the MD5 value as the check code, while the CRC64 check code is only used for the check code of multipart upload files. Therefore, for two different types of files, different parity codes are used to check.