集群中的datanode由于未知原因无法和namenode保持正常心跳,从而被namenode从集群中删除,但是查看datanode的日志,发现该datanode还在运行。在重启该datanode的过程中,datanode运行到MBean for source ugi registered完这一步之后,就被挂起,不再输出任何日志信息,datanode也始终无法注册到namenode,日志信息如下:
STARTUP_MSG: host = test/192.168.1.29 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.20.203.0 STARTUP_MSG: build = http://svn.apache.org/repos/asf/Hadoop/common/branches/branch-0.20-security-203 -r 1099333; compiled by 'oom' on Wed May 4 07:57:50 PDT 2011 ************************************************************/ 2011-12-27 16:03:13,954 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties 2011-12-27 16:03:14,006 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source MetricsSystem,sub=Stats registered. 2011-12-27 16:03:14,020 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s). 2011-12-27 16:03:14,020 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: DataNode metrics system started 2011-12-27 16:03:14,173 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source ugi registered.
运行trace命令发现datanode进程 一直在FUTEX_WAIT[@linuxidc hadoop-0.20.203.0]$ strace -p 15000 Process 15000 attached - interrupt to quit futex(0x40d3a9d0, FUTEX_WAIT, 15029, NULL <unfinished ...>
有同事告诉我说这现象的原因可能是由于java本身的bug导致datanode在退出时没能够清理相关资源,从而导致datanode无法正常启动。
重启服务器后一切正常。
更多Hadoop相关信息见Hadoop 专题页面 ?tid=13