Submitting MapReduce Tasks

Last updated: 2019-07-26 17:44:32

PDF

This operation guide describes: 1. How to perform basic MapReduce task operations in command-line interfaces. 2. How to allow MapReduce tasks to access to the data stored in COS. For more information, see the community documentation.

  • The task submitted is a wordcount task. To count the words in a file, you need to upload the specified file in advance.
  • The path of relevant software such as Hadoop is /usr/local/service/.
  • The relevant logs are stored in /data/emr.

1. Preparations for Development

  • You need to create a bucket in COS for this task.

  • Confirm that you have activated Tencent Cloud and created an EMR cluster. When creating the EMR cluster, select "Enable COS" on the basic configuration page and enter your SecretId and SecretKey, which can be found in the API Key Management page. If you don’t have a key yet, click Create Key to create one.

2. Logging in to an EMR Server

You need to log in to any server in the EMR cluster first before performing the relevant operations. A master node is recommended for this step.
EMR is built on CVM instances running on Linux; therefore, using EMR in command line mode requires logging in to an CVM instance.

After creating the EMR cluster, select Elastic MapReduce in the console, find the cluster you just created in the cluster list, click a CVM instance ID in Details > Node Info > Master Nodes > Active Master Nodes to enter the CVM Console, and find the instance of the EMR cluster.

For more information about how to log in to a CVM instance, see Logging in to a Linux Instance. Here, you can choose to log in using WebShell. Click Login on the right of the desired CVM instance to enter the login page. The default username is root, and the password is the one you set when creating the EMR cluster.

Once your credentials have been validated, you can access the EMR command-line interface. All Hadoop operations are under the Hadoop user. The root user is logged in by default when you log in to the EMR server, so you need to switch to the Hadoop user. Run the following command to switch users and go to the Hadoop folder:

[root@172 ~]# su hadoop
[hadoop@172 root]$ cd /usr/local/service/Hadoop
[hadoop@172 hadoop]$

3. Data Preparations

You need to prepare a text file for counting. There are two ways to do so: storing data in an HDFS cluster and storing data in COS.

The first step is to upload a local file to the CVM instance of the EMR cluster using the scp or sftp service. Run the following command on the local command line:

scp $localfile root@public IP address:$remotefolder

Here, $localfile is the path and the name of your local file; root is the CVM instance username. You can look up the public IP address in the node information in the EMR or CVM Console. $remotefolder is the path where you want to store the file in the CVM instance.
After the upload is completed, you can check whether the file is in the corresponding folder on the command line of the EMR cluster.

[hadoop@172 hadoop]$ ls –l

Storing Data in HDFS

After uploading the data to the CVM instance, you can copy it to the HDFS cluster. The README.txt file in the /usr/local/service/hadoop directory is used here as an example. Copy the file to the Hadoop cluster by running the following command:

[hadoop@172 hadoop]$ hadoop fs -put README.txt /user/hadoop/

After the copy is completed, run the following command to view the copied file:

[hadoop@172 hadoop]$ hadoop fs -ls /user/hadoop
Output:
-rw-r--r-- 3 hadoop supergroup 1366 2018-06-28 11:39 /user/hadoop/README.txt

If there is no /user/hadoop folder in Hadoop, you can create it on your own by running the following command:

[hadoop@172 hadoop]$ hadoop fs –mkdir /user

See Common HDFS Operations for more Hadoop commands.

Storing Data in COS

There are two ways to store data in COS: uploading via the COS Console from the local file system and uploading via Hadoop command from the EMR cluster.

  • When uploading via the COS Console from the local file system, if the data file is already in COS, you can view it by running the following command:

    [hadoop@10 hadoop]$ hadoop fs -ls cosn://$bucketname/README.txt
    -rw-rw-rw- 1 hadoop hadoop 1366 2017-03-15 19:09 cosn://$bucketname /README.txt

    Replace $bucketname with the name and path of your bucket.

  • To upload via Hadoop command from the EMR cluster, run the following command:

    [hadoop@10 hadoop]$ hadoop fs -put README.txt cosn:// $bucketname /
    [hadoop@10 hadoop]$ bin/hadoop fs -ls cosn:// $bucketname /README.txt
    -rw-rw-rw- 1 hadoop hadoop 1366 2017-03-15 19:09 cosn://$bucketname /README.txt

4. Submitting a Task via MapReduce

The task submitted this time is the wordcount routine that comes with the Hadoop cluster, which has already been compressed into a .jar package and uploaded to the created Hadoop cluster for direct call and use.

Counting a Text File in HDFS

Go to the /usr/local/service/hadoop directory as described in data preparations, and submit the task by running the following command:

[hadoop@10                      hadoop]$                                 bin/yarn 
jar      ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar    wordcount
/user/hadoop/README.txt /user/hadoop/output

Note:
In the complete command above, /user/hadoop/README.txt is the input file to be processed, and /user/hadoop/output is the output folder. You should make sure that there is no output folder before submitting the command; otherwise, the submission will fail.

After the execution is completed, view the output file by running the following command:

[hadoop@10 hadoop]$ bin/hadoop fs -ls /user/hadoop/output
Found 2 items
-rw-r--r-- 3 hadoop supergroup 0 2017-03-15 19:52 /user/hadoop/output/_SUCCESS
-rw-r--r-- 3 hadoop supergroup 1306 2017-03-15 19:52 /user/hadoop/output/part-r-00000

View the statistics in part-r-00000 by running the following command:

[hadoop@10 hadoop]$ bin/hadoop fs -cat /user/hadoop/output/part-r-00000
(BIS),    1
(ECCN)    1
(TSU)    1
(see    1
5D002.C.1,    1
740.13)    1
<http://www.wassenaar.org/>    1
……

Counting a Text File in COS

Go to the /usr/local/service/hadoop directory and submit the task by running the following command:

[hadoop@10                        hadoop]$                             bin/yarn 
jar      ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar  wordcount
cosn://$bucketname/README.txt /user/hadoop/output

The input file for the command is changed to cosn:// $bucketname /README.txt, which indicates to process the file in COS, where $bucketname is your bucket name and path. The result is still outputted to the HDFS cluster, but you can also choose to output to COS. The way to view the output is the same as above.

Viewing Task Logs

# View task status
bin/mapred job -status jobid
# View task logs
yarn logs -applicationId id