After migrating data from HDFS to COS by using the hadoop distcp
command, you can use the Hadoop-cos-DistChecker tool to verify the integrity of the migrated directory. Based on the parallel processing capabilities of MapReduce, it can quickly check the source directory against the destination directory.
- For self-built Hadoop clusters, the "Hadoop-cos" dependency must be of the latest version (GitHub release 5.8.2 or above) to return CRC64 check code.
- If you are using Tencent Cloud EMR suite, the Hadoop-cos version above is available only for clusters created after May 8, 2020. For earlier clusters, please [submit a ticket](https://console.cloud .tencent.com/workorder/category) for assistance.
To run Hadoop-cos-DistChecker requires the CRC64 checksum of the object from Hadoop-COS (COSN file system). Therefore, you should first configure fs.cosn.crc64.checksum.enabled
to true
to do so. Once this tool finishes, set this value back to false
to stop getting CRC64 checksum.
The CRC64 checksum in Hadoop-COS is not compatible with the CRC32C checksum in HDFS, so after using this tool, be sure to set the above parameter to
false
. Otherwise, Hadoop-COS may fail to run in some scenarios where the file system getFileChecksum API is called.
Source file list
The source file path list is a list of subdirectories and files to be checked that you export by running the following command:
hadoop fs -ls -R hdfs://host:port/{source_dir} | awk '{print $8}' > check_list.txt
Its format is as follows:
/benchmarks/TestDFSIO
/benchmarks/TestDFSIO/io_control
/benchmarks/TestDFSIO/io_control/in_file_test_io_0
/benchmarks/TestDFSIO/io_control/in_file_test_io_1
/benchmarks/TestDFSIO/io_data
/benchmarks/TestDFSIO/io_data/test_io_0
/benchmarks/TestDFSIO/io_data/test_io_1
/benchmarks/TestDFSIO/io_write
/benchmarks/TestDFSIO/io_write/_SUCCESS
/benchmarks/TestDFSIO/io_write/part-00000
Source directory: the directory where the source files are stored; it usually serves as the source path for data migration through the distcp
command. For example, hdfs://host:port/source_dir
is the source directory in the following sample:
hadoop distcp hdfs://host:port/source_dir cosn://examplebucket-appid/dest_dir
This is also the common parent directory in the source file path list, such as / benchmarks
in the sample above.
Destination directory: the destination directory to check against.
Hadoop-cos-DistChecker is a MapReduce task-based program, and can be submitted just like a MapReduce task:
hadoop jar hadoop-cos-distchecker-2.8.5-1.0-SNAPSHOT.jar com.qcloud.cos.hadoop.distchecker.App <Absolute path of the source file list> <Absolute path representation of the source directory> <Absolute path representation of the destination directory> [optional parameters]
Optional parameters
represent the optional parameters for Hadoop.
The example below describes how to use this tool by checking hdfs://10.0.0.3:9000/benchmarks
against cosn://hdfs-test-1250000000/benchmarks
.
First, run the following command:
hadoop fs -ls -R hdfs://10.0.0.3:9000/benchmarks | awk '{print $8}' > check_list.txt
Export all the source paths to be checked to a check_list.txt
file which stores the list of source file paths, as shown below:
Then, put check_list.txt
into HDFS by running
hadoop fs -put check_list.txt hdfs://10.0.0.3:9000/
Run the Hadoop-cos-DistChecker to check hdfs://10.0.0.3:9000/benchmarks
against cosn://hdfs-test-1250000000/benchmarks
, and output the result to the cosn://hdfs-test-1250000000/check_result
path by using the following command:
hadoop jar hadoop-cos-distchecker-2.8.5-1.0-SNAPSHOT.jar com.qcloud.cos.hadoop.distchecker.App hdfs://10.0.0.3:9000/check_list.txt hdfs://10.0.0.3:9000/benchmarks cosn://hdfs-test-1250000000/benchmarks cosn://hdfs-test-1250000000/check_result
Hadoop-cos-DistChecker will read the source file list and source directory, and run the MapReduce job to perform a distributed check. The final check result will be output to cosn://examplebucket-appid/check_result
.
The check report is as follows:
hdfs://10.0.0.3:9000/benchmarks/TestDFSIO hdfs://10.0.0.3:9000/benchmarks/TestDFSIO,cosn://hdfs-test-1250000000/benchmarks/TestDFSIO,None,None,None,SUCCESS,'The source file and the target file are the same.'
hdfs://10.0.0.3:9000/benchmarks/TestDFSIO/io_control hdfs://10.0.0.3:9000/benchmarks/TestDFSIO/io_control,cosn://hdfs-test-1250000000/benchmarks/TestDFSIO/io_control,None,None,None,SUCCESS,'The source file and the target file are the same.'
hdfs://10.0.0.3:9000/benchmarks/TestDFSIO/io_control/in_file_test_io_0 hdfs://10.0.0.3:9000/benchmarks/TestDFSIO/io_control/in_file_test_io_0,cosn://hdfs-test-1250000000/benchmarks/TestDFSIO/io_control/in_file_test_io_0,CRC64,1566310986176587838,1566310986176587838,SUCCESS,'The source file and the target file are the same.'
hdfs://10.0.0.3:9000/benchmarks/TestDFSIO/io_control/in_file_test_io_1 hdfs://10.0.0.3:9000/benchmarks/TestDFSIO/io_control/in_file_test_io_1,cosn://hdfs-test-1250000000/benchmarks/TestDFSIO/io_control/in_file_test_io_1,CRC64,-6584441696534676125,-6584441696534676125,SUCCESS,'The source file and the target file are the same.'
hdfs://10.0.0.3:9000/benchmarks/TestDFSIO/io_data hdfs://10.0.0.3:9000/benchmarks/TestDFSIO/io_data,cosn://hdfs-test-1250000000/benchmarks/TestDFSIO/io_data,None,None,None,SUCCESS,'The source file and the target file are the same.'
hdfs://10.0.0.3:9000/benchmarks/TestDFSIO/io_data/test_io_0 hdfs://10.0.0.3:9000/benchmarks/TestDFSIO/io_data/test_io_0,cosn://hdfs-test-1250000000/benchmarks/TestDFSIO/io_data/test_io_0,CRC64,3534425600523290380,3534425600523290380,SUCCESS,'The source file and the target file are the same.'
hdfs://10.0.0.3:9000/benchmarks/TestDFSIO/io_data/test_io_1 hdfs://10.0.0.3:9000/benchmarks/TestDFSIO/io_data/test_io_1,cosn://hdfs-test-1250000000/benchmarks/TestDFSIO/io_data/test_io_1,CRC64,3534425600523290380,3534425600523290380,SUCCESS,'The source file and the target file are the same.'
hdfs://10.0.0.3:9000/benchmarks/TestDFSIO/io_write hdfs://10.0.0.3:9000/benchmarks/TestDFSIO/io_write,cosn://hdfs-test-1250000000/benchmarks/TestDFSIO/io_write,None,None,None,SUCCESS,'The source file and the target file are the same.'
hdfs://10.0.0.3:9000/benchmarks/TestDFSIO/io_write/_SUCCESS hdfs://10.0.0.3:9000/benchmarks/TestDFSIO/io_write/_SUCCESS,cosn://hdfs-test-1250000000/benchmarks/TestDFSIO/io_write/_SUCCESS,CRC64,0,0,SUCCESS,'The source file and the target file are the same.'
hdfs://10.0.0.3:9000/benchmarks/TestDFSIO/io_write/part-00000 hdfs://10.0.0.3:9000/benchmarks/TestDFSIO/io_write/part-00000,cosn://hdfs-test-1250000000/benchmarks/TestDFSIO/io_write/part-00000,CRC64,-4804567387993776854,-4804567387993776854,SUCCESS,'The source file and the target file are the same.'
The check report is in the following format:
Source file path in `check_list.txt`, absolute path of the source file, absolute path of the destination file, checksum algorithm, checksum of the source file, checksum of the destination file, check result, result description
There are 7 check results:
A CRC64 checksum may contain 20 digits, which exceeds the range of the Java long
type. However, they have the same underlying bytes. Therefore, when the long
value is printed, it may be negative.
Was this page helpful?