Failed Operations on EMR Master Node Due to Low Configuration

Last updated: 2020-05-15 17:26:14

    How do I fix failed operations on the EMR master node due to low configuration?


    As the master node's configuration is too low, Hive or Spark jobs submitted to it report errors or are directly killed.

    Cause analysis

    The memory of the master node is insufficient, causing other applications to be killed due to OOM.


    1. Too many businesses are deployed on the EMR master node, which usually becomes the bottleneck of the entire cluster. However, the master node cannot be scaled out; instead, it can only be upgraded as described below:
      • First, find the node where the standby NameNode resides in the cluster.
        • Run the following command on the standby NameNode to enter the safe mode.
          hdfs dfsadmin -fs node IP):4007 -safemode enter   Enter the safe mode
        • Run the following command on the standby NameNode to save the metadata.
          hdfs dfsadmin -fs node IP):4007 -saveNamespace   Save the metadata
        • Run the following command on the standby NameNode to exit the safe mode.
          hdfs dfsadmin -fs node IP):4007 -safemode leave   Exit the safe mode
      • Then, in the EMR Console (or the CVM Console for a legacy cluster), upgrade the active node.
      • Upgrade the standby node and make the configuration of the master's active node the same as that of the standby node.

        If your cluster is not a high-availability one, then it will become unavailable for a while during the upgrade.

    2. In Spark, jobs are committed in client mode by default, and the driver runs on the master node. You can change the mode to master mode and then commit jobs.
    3. For the Hive component, enable the router node, migrate HiveServer2 to it, and then disable the Hive component on the master node. For detailed directions, please see Migrating HiveServer2 to Router.
    4. Disable components that are not commonly used on the master node or migrate Hue to the router node.
      Directions for migrating Hue to the router node:
      • Enter the EMR Console, Add a router node on the Cloud Hardware Management page, and select the Hue component.
      • After the scale-out, disable the original Hue component on the master node, retain that on the router node, bind a public EIP to the router node, and open the source policy and ports in the security group.

    Preset values of memory size for master node components in EMR cluster and recommendations

    1. List of heap memories of common components
      Component Process Configuration File Configuration Item Default Heap Memory (in MB)
      HDFS Namenode NNHeapsize 4,096
      YARN Resourcemanager Heapsize 2,000
      Hive Hiveserver2 HS2Heapsize 4,096
      HBase Hmaster Heapsize 1,024
      Presto Coordinator jvm.config Maximum JVM 3,072
      Spark spark-driver spark-defaults.conf spark.driver.memory 1,024
      Oozie Oozie - - 1,024
      Storm Nimbus - - 1,024
    2. Suggested preset values for components
      Component Suggested Heap Memory Size
      HDFS (NameNode) Minimum heap memory = 250 x number of files + 290 x number of directories + 368 x number of blocks
      YARN (ResourceManager) It can be increased as needed
      Hive (HiveServer2) It can be increased as needed
      HBase (HMaster) The master node only receives DDL requests and performs load balancing. The default size of 1 GB is generally sufficient
      Presto (Coordinator) Use the default value
      Spark (spark-driver) It can be increased as needed
      Oozie (oozie) Use the default value
      Storm (Nimbus) Use the default value
    3. Suggested idle memory size for servers: 10–20% of the total memory size.
    4. You can deploy EMR components in independent mode or hybrid mode as needed.
      • Independent deployment: it is suitable for HDFS clusters for storage, HBase clusters for analysis of massive amounts of data, and Spark clusters for job computation.
      • Hybrid deployment: multiple components can be deployed in a cluster in this mode, which is suitable for testing clusters or scenarios where the business volume is not high or resource preemption is negligible.

    Was this page helpful?

    Was this page helpful?

    • Not at all
    • Not very helpful
    • Somewhat helpful
    • Very helpful
    • Extremely helpful
    Send Feedback