You can log in to any EMR server and run the following command to view task logs:
yarn logs -applicationId application_1507732460084_0057
To view the cause of a task exception, run the following command:
yarn logs -applicationId application_1507732460084_0057|grep -A20 Exception
Cluster computing resources are determined by the following two configuration items in yarn-site.xml:
<property> <name>yarn.nodemanager.resource.cpu-vcores</name> <value>4</value> </property> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>14745</value> </property>
cpu-vcores is equal to number of CPU cores of the server, and
memory-mb is equal to 91% of the memory size of the server. You can adjust them based on your actual needs, but if they are too large, there may be a risk of server failure.
If an out of memory error occurs when you are submitting a MapReduce task or running an SQL script through Hive, fix it by setting the following parameters:
The memory parameter can be adjusted based on your actual computation needs. It can also be written in the
~/.hiverc file in Hive and will be executed automatically when submitted.
Suppose that you need to run an SQL query. If 64 vcores and 128 GB memory are needed for getting the query result in the specified time period, and the business requires 10 concurrencies, then the required resources would be 640 vcores and 1,280 GB memory. If the server specification you are using is 24 cores and 48 GB memory, then you need around 1280 / 48 = 40 servers.
The default query in Hive is as follows:
select * from tablename where a=’1’ limit 10;
The default query does not start a computation task. You can start a distributed query by adding the
set hive.fetch.task.conversion=none parameter.
An EMR cluster supports the following storage media: HDD local disk, SSD local disk, HDD cloud disk, SSD cloud disk, and COS. You can choose the most appropriate one based on your actual needs: