yarn RM crash问题一例

今天收到线上的resource manager报警：

报错信息如下：

1
2
3
4
5
6
7
8
9
10
11
12
13


2014-07-08 13:22:54,118 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: Expired:xxxx:53356 Timed out after 600 secs

2014-07-08 13:22:54,118 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Deactivating Node xxxx:53356 as it is now LOST

2014-07-08 13:22:54,118 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: xxxx:53356 Node Transitioned from UNHEALTHY to LOST

2014-07-08 13:22:54,118 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type NODE_REMOVED to the scheduler
java.lang.NullPointerException

        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeNode(FairScheduler.java:715)

        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:974)

        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:108)

        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:378)

        at java.lang.Thread.run(Thread.java:662)

2014-07-08 13:22:54,118 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..

2014-07-08 13:22:54,119 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 1000

2014-07-08 13:22:54,119 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 2000

这是一个bug，bug id：https://issues.apache.org/jira/browse/YARN-502

根据bug的描述，是在rm删除标记为UNHEALTHY的nm的时候可能会触发bug（第一次已经删除，后面删除再进行删除操作时就会报错）。

根据堆栈信息来看代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34

org.apache.hadoop.yarn.server.resourcemanager.scheduler.ResourceScheduler:

  protected ResourceScheduler scheduler; 

    private final class EventProcessor implements Runnable { // 开启一个EventProcessor 线程，对event进行处理

      @Override

      public void run() {

        SchedulerEvent event;

        while (!stopped && !Thread.currentThread ().isInterrupted()) {

          try {

            event = eventQueue.take();  // 从event queue里面拿出event

          } catch (InterruptedException e) {

            LOG.error("Returning, interrupted : " + e);

            return; // TODO: Kill RM.

          }

          try {

            scheduler.handle(event); //处理event

          } catch (Throwable t) { // cache event的异常

            // An error occurred, but we are shutting down anyway.

            // If it was an InterruptedException, the very act of

            // shutdown could have caused it and is probably harmless.

            if (stopped ) {

              LOG.warn("Exception during shutdown: " , t);

              break;

            }

            LOG.fatal("Error in handling event type " + event.getType() //根据日志来看，这里获取的event.getType()为 NODE_REMOVED

                + " to the scheduler", t);

            if (shouldExitOnError

                && !ShutdownHookManager.get().isShutdownInProgress()) {

              LOG.info("Exiting, bbye.." );

              System. exit(-1);

            }

          }

        }

      }

    }

这里可以看到可以通过shouldExitOnError可以控制RM线程是否退出。

1
2
3
4
5
6
7
8


private boolean shouldExitOnError = false; // 初始设置为false

    @Override

    public synchronized void init(Configuration conf) {  // 在做初始化时，可以通过配置文件获取

      this. shouldExitOnError =

          conf.getBoolean(Dispatcher.DISPATCHER_EXIT_ON_ERROR_KEY,

            Dispatcher.DEFAULT_DISPATCHER_EXIT_ON_ERROR); // 参数在Dispatcher类中定义

      super.init(conf);

    }

1
2
3
4
5
6
7
8
9
10
11
12

org.apache.hadoop.yarn.event.Dispatcher类：

public interface Dispatcher {   

  // Configuration to make sure dispatcher crashes but doesn't do system-exit in

  // case of errors. By default, it should be false, so that tests are not

  // affected. For all daemons it should be explicitly set to true so that

  // daemons can crash instead of hanging around.

  public static final String DISPATCHER_EXIT_ON_ERROR_KEY =

      "yarn.dispatcher.exit-on-error"; // 控制参数

  public static final boolean DEFAULT_DISPATCHER_EXIT_ON_ERROR = false; // 默认为false

  EventHandler getEventHandler();

  void register(Class<? extends Enum> eventType, EventHandler handler);
}

在ResourceManager类的init函数中：

1
2
3
4


 @Override

  public synchronized void init(Configuration conf) {

    this. conf = conf;

    this. conf.setBoolean(Dispatcher.DISPATCHER_EXIT_ON_ERROR_KEY, true);  // 这个值的默认值为true了（覆盖了Dispatcher类中的DEFAULT设置）

即默认在遇到dispather的错误时，会退出。
遇到错误是否退出可以由配置参数yarn.dispatcher.exit-on-error决定。不过这个改动影响比较大，最好还是不要设置，还是打patch来解决吧。

官方的patch也比较简单，即在rmnm时进行一次判断，防止二次删除操作：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

--- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
+++ hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java

@@ -501,8 +501,13 @@ public DeactivateNodeTransition(NodeState finalState) {

     public void transition(RMNodeImpl rmNode, RMNodeEvent event) {

       // Inform the scheduler

       rmNode.nodeUpdateQueue.clear();
-      rmNode.context.getDispatcher().getEventHandler().handle(

-          new NodeRemovedSchedulerEvent(rmNode));

+      // If the current state is NodeState.UNHEALTHY

+      // Then node is already been removed from the

+      // Scheduler

+      if (!rmNode.getState().equals(NodeState.UNHEALTHY)) {
+        rmNode.context.getDispatcher().getEventHandler()

+          .handle( new NodeRemovedSchedulerEvent(rmNode));
+      }

       rmNode.context.getDispatcher().getEventHandler().handle(

           new NodesListManagerEvent(

               NodesListManagerEventType.NODE_UNUSABLE, rmNode));

本文转自菜菜光 51CTO博客，原文链接：http://blog.51cto.com/caiguangguang/1436087，如需转载请自行联系原作者

yarn RM crash问题一例

相关推荐