注:oom-killer是linux kernel的机制,发生内存溢出,会尝试调用vm.oom-killer将溢出程序杀死,这个可以在sysctl中设置禁用。初步判断是由于 kernel尝试杀死溢出进程,hadoop又会重新调用,陷入死循环。最后节点不响应,将故障转移到其他服务器,然后其他服务器也同样处理,引发连锁反 应,集群崩溃。
正常的定时mapreduce任务都是久经考验的,绝对忠诚的战士,从来没有发生过崩溃事件,所以只有可能是某人独自运行的python脚本。
问了一下同事,有一个同事在大约4点左右执行了一个跑协同过滤的脚本,脚本里可能有个大字典直接读入内存,占用内存较多,结果hadoop就OOM崩溃了。
重启以后,清理了mrlocal下面的dist-cache,很快就恢复服务了。
这里不说人,其实发生故障是好事,有点小毛病发生总比憋着一次全挂了强,记录下来的是个找故障的思路。要善用linux自带的 strace,lsof,systat这些工具,也要善于读syslog,如果都指着看log4j,那可能永远也找不出故障原因。log4j所处的服务层 级决定他无法完成更底层的日志记录。
另外一个好处,顺便把出故障的旧节点全统一换成EasyHadoop了。之前都是手工用tar包安装的,没法界面化管理。
还有一个好玩的事情,集群挂掉以后,剩下存活的服务器,会大量复制文件块,结果就是所有节点的网络带宽飙升。然后直到现在,各节点在重新做块平衡,大量CPU时间都被占用在IOWait上。
第二个故障是datanode重启时间不长后就挂掉了,日志显示如下
2013-04-04 00:37:22,842 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_-145226050820856206_12015772 src=\\'#\\'" /192.168.1.51:9879 dest: /192.168.1.51:50010
at java.lang.Thread.run(Thread.java:662)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:107)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:299)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:99)
at org.apache.hadoop.hdfs.server.datanode.FSDataset.writeToBlock(FSDataset.java:1402)
org.apache.hadoop.hdfs.server.datanode.BlockAlreadyExistsException: Block blk_2613577836754795446_11189945 is valid, and cannot be written to.
----
************************************************************/
SHUTDOWN_MSG: Shutting down DataNode at hadoop-node-51/192.168.1.51
/************************************************************
2013-04-04 00:43:36,872 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
2013-04-04 00:43:34,336 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanode
2013-04-04 00:43:33,082 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src=\\'#\\'" /192.168.1.51:50010, dest: /192.168.1.46:17879, bytes: 6175632, op: HDFS_READ, cliID: DFSClient_1354167993, offset: 0, srvID: DS-1881855868-192.168.1.51-50010-1338289396565, blockid: blk_890732859468346910_12015888, duration: 21544233000
java.lang.OutOfMemoryError: GC overhead limit exceeded
2013-04-04 00:43:29,975 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(192.168.1.51:50010, storageID=DS-1881855868-192.168.1.51-50010-1338289396565, infoPort=50075, ipcPort=50020):DataXceiver
上半部分的错误日志累计多了之后,datanode会OOM而shutdown。
一开始看了源码里面的DataXceiver.java,DataNode.java但是没找到有用的东西,然后日志里有一个FSDataset.java,和专门抛异常的BlockAlreadyExistsException.java找到了有用的东西。
FSDataset.java里面包含了这一段
if (isValidBlock(b)) {
if (!isRecovery) {
throw new BlockAlreadyExistsException("Block " + b + " is valid, and cannot be written to.");
}
// If the block was successfully finalized because all packets
// were successfully processed at the Datanode but the ack for
// some of the packets were not received by the client. The client
// re-opens the connection and retries sending those packets.
// The other reason is that an "append" is occurring to this block.
detachBlock(b, 1);
}
而BlockAlreadyExistsException.java里面包含这一段,其实也就这一段
class BlockAlreadyExistsException extends IOException {
private static final long serialVersionUID = 1L;
public BlockAlreadyExistsException() {
super();
}
public BlockAlreadyExistsException(String msg) {
super(msg);
}
}