今天发现测试环境的kerberos Hadoop的hive不能跑了,具体表现是select * limit这种不走mapred的job是ok的,走mapred的job就会报错,报的错比较奇怪(Unable to retrieve URL for Hadoop Task logs. Unable to find job tracker info port.)但是确认jobtracker是ok的,配置文件也是正常的,看来和jobtracker没有关系,进一步分析tasktracker的日志,发现如下错误。
2014-03-26 17:28:02,048 WARN org.apache.hadoop.mapred.TaskTracker: Exception while localization java.io.IOException: Job initialization failed (24) with output: File /home/test/platform must be owned by root, but is owned by 501
at org.apache.hadoop.mapred.LinuxTaskController.initializeJob(LinuxTaskController.java:194)
at org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1420)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1407)
at org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1395)
at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1310)
at org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:2727)
at org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:2691)
Caused by: org.apache.hadoop.util.Shell$ExitCodeException:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:261)
at org.apache.hadoop.util.Shell.run(Shell.java:188)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:381)
at org.apache.hadoop.mapred.LinuxTaskController.initializeJob(LinuxTaskController.java:187)
... 8 more
其中/home/test/platform是mapred程序所在目录,通过更改/home/test/platform的属主为root解决,不过这个为什么需要是root用户呢
从调用栈信息看到,是在调用LinuxTaskController类(因为用到了kerberos,taskcontroller需要选择这个类)的initializeJob出错了。initializeJob方法是对job做初始操作,传入user,jobid,token,mapred的local dir等参数,生成一个数组,并调用ShellCommandExecutor的构造方法进行实例化,最终调用ShellCommandExecutor类的execute方法。
public void initializeJob(String user, String jobid, Path credentials,
Path jobConf, TaskUmbilicalProtocol taskTracker,
InetSocketAddress ttAddr
) throws IOException {
List<String> command = new ArrayList<String>(
Arrays.asList(taskControllerExe , //task-controller
user,
localStorage.getDirsString(), //mapred.local.dir
Integer. toString(Commands.INITIALIZE_JOB.getValue()),
jobid,
credentials.toUri().getPath().toString(), //jobToken
jobConf.toUri().getPath().toString())); //job.xml
File jvm = // use same jvm as parent
new File( new File(System.getProperty( "java.home"), "bin" ), "java" );
command.add(jvm.toString());
command.add("-classpath");
command.add(System.getProperty("java.class.path" ));
command.add("-Dhadoop.log.dir=" + TaskLog.getBaseLogDir());
command.add("-Dhadoop.root.logger=INFO,console");
command.add(JobLocalizer.class.getName()); // main of JobLocalizer
command.add(user);
command.add(jobid);
// add the task tracker's reporting address
command.add(ttAddr.getHostName());
command.add(Integer.toString(ttAddr.getPort()));
String[] commandArray = command.toArray( new String[0]);
ShellCommandExecutor shExec = new ShellCommandExecutor(commandArray);
if (LOG.isDebugEnabled()) {
LOG.debug( "initializeJob: " + Arrays.toString(commandArray)); //commandArray
}
try {
shExec.execute();
if (LOG.isDebugEnabled()) {
logOutput(shExec.getOutput());
}
} catch (ExitCodeException e) {
int exitCode = shExec.getExitCode();
logOutput(shExec.getOutput());
throw new IOException("Job initialization failed (" + exitCode +
") with output: " + shExec.getOutput(), e);
}
}
通过打开tasktracker的debug日志,可以获取commandArray的具体信息:
2014-03-26 19:49:02,489 DEBUG org.apache.hadoop.mapred.LinuxTaskController: initializeJob:
[/home/test/platform/hadoop-2.0.0-mr1-cdh4.2.0/bin/../sbin/Linux-amd64-64/task-controller,
hdfs, xxxxxxx, 0, job_201403261945_0002, xxxxx/jobToken, xxxx/job.xml, /usr/local/jdk1.6.0_37/jre/bin/java,
-classpath,xxxxxx.jar, -Dhadoop.log.dir=/home/test/logs/hadoop/mapred, -Dhadoop.root.logger=INFO,console,
org.apache.hadoop.mapred.JobLocalizer, hdfs, job_201403261945_0002, localhost.localdomain, 57536]
其中比较重要的是taskControllerExe 这个参数,它代表了taskcontroller的可执行文件(本例中是/home/test/platform/hadoop-2.0.0-mr1-cdh4.2.0/bin/../sbin/Linux-amd64-64/task-controller)
而execute方法其实最终调用了task-controller.
task-controller的源码在 src/c++/task-controller目录下。
在configuration.c中定义了对目录属主进行检查:
static int is_only_root_writable(const char *file) {
.......
if (file_stat.st_uid != 0) {
fprintf(LOGFILE, "File %s must be owned by root, but is owned by %d\n",
file, file_stat.st_uid);
return 0;
}
.......
如果检查的文件属主不是root,则报错。
调用这个方法的代码:
int check_configuration_permissions(const char* file_name) {
// copy the input so that we can modify it with dirname
char* dir = strdup(file_name);
char* buffer = dir;
do {
if (!is_only_root_writable(dir)) {
free(buffer);
return -1;
}
dir = dirname(dir);
} while (strcmp(dir, "/") != 0);
free(buffer);
return 0;
}