最近Hadoop集群在执行作业的时候经常发生tasktracker错误,节点宕掉。查看了tasktracker的日志,报错如下:
2012-07-14 10:43:41,492 WARN org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext: org.apache.hadoop.util.DiskChecker$DiskErrorException: can not create directory: /home/hadoop/var/hadoop/tmp/mapred/local/taskTracker/archive/bfdbjc1/home/hadoop/var/hadoop/tmp/mapred/system/job_201207132043_0026/libjars
at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:73)
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.createPath(LocalDirAllocator.java:253)
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:331)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:143)
2012-07-14 10:43:41,493 WARN org.apache.hadoop.mapred.TaskRunner: attempt_201207132043_0026_m_000198_1 Child Error
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for taskTracker/archive/bfdbjc1/home/hadoop/var/hadoop/tmp/mapred/system/job_201207132043_0026/libjars/udf.jar
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:343)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:143)
在分发Job包的时候无法为当前的Job创建文件夹。
到相应的目录下尝试手动创建文件夹的时候报错
cd /home/hadoop/var/hadoop/tmp/mapred/local/taskTracker/archive/bfdbjc1/home/hadoop/var/hadoop/tmp/mapred/system
mkdir aaa
mkdir: cannot create directory `aaaa’: Too many links
这是由于ext3文件系统一级子目录的个数默认为31998(个),准确地说是32000个。
Linux为了cpu的搜索效率而规定的,要想改变数目限制需要重新编译内核。我看到在kernel代码中有这样的:
include/linux/ext2_fs.h:#define EXT2_LINK_MAX 32000
include/linux/ext3_fs.h:#define EXT3_LINK_MAX 32000
解决方法就是删除对应目录下的Job临时文件就可以了。重启集群运行作业发现无误。