Submitting MapReduce Tasks
Last updated: 2019-07-26 17:44:32PDF
This operation guide describes: 1. How to perform basic MapReduce task operations in command-line interfaces. 2. How to allow MapReduce tasks to access to the data stored in COS. For more information, see the community documentation.
- The task submitted is a wordcount task. To count the words in a file, you need to upload the specified file in advance.
- The path of relevant software such as Hadoop is
- The relevant logs are stored in
1. Preparations for Development
You need to create a bucket in COS for this task.
Confirm that you have activated Tencent Cloud and created an EMR cluster. When creating the EMR cluster, select "Enable COS" on the basic configuration page and enter your SecretId and SecretKey, which can be found in the API Key Management page. If you don’t have a key yet, click Create Key to create one.
2. Logging in to an EMR Server
You need to log in to any server in the EMR cluster first before performing the relevant operations. A master node is recommended for this step.
EMR is built on CVM instances running on Linux; therefore, using EMR in command line mode requires logging in to an CVM instance.
After creating the EMR cluster, select Elastic MapReduce in the console, find the cluster you just created in the cluster list, click a CVM instance ID in Details > Node Info > Master Nodes > Active Master Nodes to enter the CVM Console, and find the instance of the EMR cluster.
For more information about how to log in to a CVM instance, see Logging in to a Linux Instance. Here, you can choose to log in using WebShell. Click Login on the right of the desired CVM instance to enter the login page. The default username is root, and the password is the one you set when creating the EMR cluster.
Once your credentials have been validated, you can access the EMR command-line interface. All Hadoop operations are under the Hadoop user. The root user is logged in by default when you log in to the EMR server, so you need to switch to the Hadoop user. Run the following command to switch users and go to the Hadoop folder:
[root@172 ~]# su hadoop [hadoop@172 root]$ cd /usr/local/service/Hadoop [hadoop@172 hadoop]$
3. Data Preparations
You need to prepare a text file for counting. There are two ways to do so: storing data in an HDFS cluster and storing data in COS.
The first step is to upload a local file to the CVM instance of the EMR cluster using the scp or sftp service. Run the following command on the local command line:
scp $localfile root@public IP address:$remotefolder
Here, $localfile is the path and the name of your local file; root is the CVM instance username. You can look up the public IP address in the node information in the EMR or CVM Console. $remotefolder is the path where you want to store the file in the CVM instance.
After the upload is completed, you can check whether the file is in the corresponding folder on the command line of the EMR cluster.
[hadoop@172 hadoop]$ ls –l
Storing Data in HDFS
After uploading the data to the CVM instance, you can copy it to the HDFS cluster. The README.txt file in the
/usr/local/service/hadoop directory is used here as an example. Copy the file to the Hadoop cluster by running the following command:
[hadoop@172 hadoop]$ hadoop fs -put README.txt /user/hadoop/
After the copy is completed, run the following command to view the copied file:
[hadoop@172 hadoop]$ hadoop fs -ls /user/hadoop Output: -rw-r--r-- 3 hadoop supergroup 1366 2018-06-28 11:39 /user/hadoop/README.txt
If there is no
/user/hadoop folder in Hadoop, you can create it on your own by running the following command:
[hadoop@172 hadoop]$ hadoop fs –mkdir /user
See Common HDFS Operations for more Hadoop commands.
Storing Data in COS
There are two ways to store data in COS: uploading via the COS Console from the local file system and uploading via Hadoop command from the EMR cluster.
When uploading via the COS Console from the local file system, if the data file is already in COS, you can view it by running the following command:
[hadoop@10 hadoop]$ hadoop fs -ls cosn://$bucketname/README.txt -rw-rw-rw- 1 hadoop hadoop 1366 2017-03-15 19:09 cosn://$bucketname /README.txt
Replace $bucketname with the name and path of your bucket.
To upload via Hadoop command from the EMR cluster, run the following command:
[hadoop@10 hadoop]$ hadoop fs -put README.txt cosn:// $bucketname / [hadoop@10 hadoop]$ bin/hadoop fs -ls cosn:// $bucketname /README.txt -rw-rw-rw- 1 hadoop hadoop 1366 2017-03-15 19:09 cosn://$bucketname /README.txt
4. Submitting a Task via MapReduce
The task submitted this time is the wordcount routine that comes with the Hadoop cluster, which has already been compressed into a .jar package and uploaded to the created Hadoop cluster for direct call and use.
Counting a Text File in HDFS
Go to the
/usr/local/service/hadoop directory as described in data preparations, and submit the task by running the following command:
[hadoop@10 hadoop]$ bin/yarn jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar wordcount /user/hadoop/README.txt /user/hadoop/output
In the complete command above,
/user/hadoop/README.txtis the input file to be processed, and
/user/hadoop/outputis the output folder. You should make sure that there is no output folder before submitting the command; otherwise, the submission will fail.
After the execution is completed, view the output file by running the following command:
[hadoop@10 hadoop]$ bin/hadoop fs -ls /user/hadoop/output Found 2 items -rw-r--r-- 3 hadoop supergroup 0 2017-03-15 19:52 /user/hadoop/output/_SUCCESS -rw-r--r-- 3 hadoop supergroup 1306 2017-03-15 19:52 /user/hadoop/output/part-r-00000
View the statistics in part-r-00000 by running the following command:
[hadoop@10 hadoop]$ bin/hadoop fs -cat /user/hadoop/output/part-r-00000 (BIS), 1 (ECCN) 1 (TSU) 1 (see 1 5D002.C.1, 1 740.13) 1 <http://www.wassenaar.org/> 1 ……
Counting a Text File in COS
Go to the
/usr/local/service/hadoop directory and submit the task by running the following command:
[hadoop@10 hadoop]$ bin/yarn jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar wordcount cosn://$bucketname/README.txt /user/hadoop/output
The input file for the command is changed to cosn:// $bucketname /README.txt, which indicates to process the file in COS, where $bucketname is your bucket name and path. The result is still outputted to the HDFS cluster, but you can also choose to output to COS. The way to view the output is the same as above.
Viewing Task Logs
idbin/mapred job -status jobid yarn logs -applicationId