yarn RM crash问题一例
今天收到线上的resource manager报警:
报错信息如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
|
2014 - 07 - 08 13 : 22 : 54 , 118 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: Expired:xxxx: 53356 Timed out after 600 secs
2014 - 07 - 08 13 : 22 : 54 , 118 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Deactivating Node xxxx: 53356 as it is now LOST
2014 - 07 - 08 13 : 22 : 54 , 118 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: xxxx: 53356 Node Transitioned from UNHEALTHY to LOST
2014 - 07 - 08 13 : 22 : 54 , 118 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type NODE_REMOVED to the scheduler
java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeNode(FairScheduler.java: 715 )
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java: 974 )
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java: 108 )
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java: 378 )
at java.lang.Thread.run(Thread.java: 662 )
2014 - 07 - 08 13 : 22 : 54 , 118 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
2014 - 07 - 08 13 : 22 : 54 , 119 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 1000
2014 - 07 - 08 13 : 22 : 54 , 119 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 2000
|
这是一个bug,bug id:https://issues.apache.org/jira/browse/YARN-502
根据bug的描述,是在rm删除标记为UNHEALTHY的nm的时候可能会触发bug(第一次已经删除,后面删除再进行删除操作时就会报错)。
根据堆栈信息来看代码:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
|
org.apache.hadoop.yarn.server.resourcemanager.scheduler.ResourceScheduler: protected ResourceScheduler scheduler;
private final class EventProcessor implements Runnable { // 开启一个EventProcessor 线程,对event进行处理
@Override
public void run() {
SchedulerEvent event;
while (!stopped && !Thread.currentThread ().isInterrupted()) {
try {
event = eventQueue.take(); // 从event queue里面拿出event
} catch (InterruptedException e) {
LOG.error( "Returning, interrupted : " + e);
return ; // TODO: Kill RM.
}
try {
scheduler.handle(event); //处理event
} catch (Throwable t) { // cache event的异常
// An error occurred, but we are shutting down anyway.
// If it was an InterruptedException, the very act of
// shutdown could have caused it and is probably harmless.
if (stopped ) {
LOG.warn( "Exception during shutdown: " , t);
break ;
}
LOG.fatal( "Error in handling event type " + event.getType() //根据日志来看,这里获取的event.getType()为 NODE_REMOVED
+ " to the scheduler" , t);
if (shouldExitOnError
&& !ShutdownHookManager.get().isShutdownInProgress()) {
LOG.info( "Exiting, bbye.." );
System. exit(- 1 );
}
}
}
}
}
|
这里可以看到可以通过shouldExitOnError可以控制RM线程是否退出。
1
2
3
4
5
6
7
8
|
private boolean shouldExitOnError = false ; // 初始设置为false
@Override
public synchronized void init(Configuration conf) { // 在做初始化时,可以通过配置文件获取
this . shouldExitOnError =
conf.getBoolean(Dispatcher.DISPATCHER_EXIT_ON_ERROR_KEY,
Dispatcher.DEFAULT_DISPATCHER_EXIT_ON_ERROR); // 参数在Dispatcher类中定义
super .init(conf);
}
|
1
2
3
4
5
6
7
8
9
10
11
12
|
org.apache.hadoop.yarn.event.Dispatcher类: public interface Dispatcher {
// Configuration to make sure dispatcher crashes but doesn't do system-exit in
// case of errors. By default, it should be false, so that tests are not
// affected. For all daemons it should be explicitly set to true so that
// daemons can crash instead of hanging around.
public static final String DISPATCHER_EXIT_ON_ERROR_KEY =
"yarn.dispatcher.exit-on-error" ; // 控制参数
public static final boolean DEFAULT_DISPATCHER_EXIT_ON_ERROR = false ; // 默认为false
EventHandler getEventHandler();
void register(Class<? extends Enum> eventType, EventHandler handler);
} |
在ResourceManager类的init函数中:
1
2
3
4
|
@Override
public synchronized void init(Configuration conf) {
this . conf = conf;
this . conf.setBoolean(Dispatcher.DISPATCHER_EXIT_ON_ERROR_KEY, true ); // 这个值的默认值为true了(覆盖了Dispatcher类中的DEFAULT设置)
|
即默认在遇到dispather的错误时,会退出。
遇到错误是否退出可以由配置参数yarn.dispatcher.exit-on-error决定。不过这个改动影响比较大,最好还是不要设置,还是打patch来解决吧。
官方的patch也比较简单,即在rmnm时进行一次判断,防止二次删除操作:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
--- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java +++ hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java @@ - 501 , 8 + 501 , 13 @@ public DeactivateNodeTransition(NodeState finalState) {
public void transition(RMNodeImpl rmNode, RMNodeEvent event) {
// Inform the scheduler
rmNode.nodeUpdateQueue.clear();
- rmNode.context.getDispatcher().getEventHandler().handle( - new NodeRemovedSchedulerEvent(rmNode));
+ // If the current state is NodeState.UNHEALTHY
+ // Then node is already been removed from the
+ // Scheduler
+ if (!rmNode.getState().equals(NodeState.UNHEALTHY)) {
+ rmNode.context.getDispatcher().getEventHandler() + .handle( new NodeRemovedSchedulerEvent(rmNode));
+ } rmNode.context.getDispatcher().getEventHandler().handle(
new NodesListManagerEvent(
NodesListManagerEventType.NODE_UNUSABLE, rmNode));
|
本文转自菜菜光 51CTO博客,原文链接:http://blog.51cto.com/caiguangguang/1436087,如需转载请自行联系原作者