解决Hadoop mapreduce 包依赖问题

使用Hadoop时,我们有时会自写一些mapreduce的应用,我们可能会用到一些第三方的包。如果不做任何处理,在job执行的就会报ClassNotFound的Exception.
有四种处理方法:

1. 把要依赖的包部署到每台task tracker上面

这个方法最简单,但是要部署到每台task tracker,而且可能引起包污染的问题。比如应用A和应用B都用到同一个libray,但是版本不同,就会出现冲突的问题。

2. 把依赖的包和直接合并到mapreduce job的包

这个方法的问题是合并后的包可能非常大,也不利于的包的升级

3. 使用DistributedCache

这个方法就是先把这些包上传到HDFS,可以在程序启动的时候做一次。然后在submit job的时候把hdfs path加到classpath里面。
示例:

$ bin/hadoop fs -copyFromLocal ib/protobuf-java-2.0.3.jar /myapp/protobuf-java-2.0.3.jar //Setup the application's JobConf: JobConf job = new JobConf(); DistributedCache.addFileToClassPath(new Path("/myapp/protobuf-java-2.0.3.jar"), job);4,还有一种情况是扩展包特别多的情况下用3就不爽了,参考一下:

One of the disadvantages of setting up a Hadoop development environment in Eclipse is that I have been dependent on Eclipse to take care of job submission for me and so I had never worried about doing it by hand. I have been developing mostly on a single node cluster (i.e my laptop) which meant I never had the need to submit a job to an actual cluster, a remote cluster in this case. Also, the first MapReduce programs I have written and run on the cluster (more to follow) were not dependent on third party jars. However, the program I am working on depends on a third-party xml parser which in turn depends on another jar.

As it turns out, I had to specify 3 external jars everytime I submit a job. I knew there was a -libjars option that you could use as I had seen it somewhere (including the hadoop help when you don’t specify all arguments for a command) but I did not pay attention since I did not need it then. Googling around, I found a mention of copying the jars to the lib folder of the Hadoop installation. It seemed a good solution untill you think about a multi-node cluster which means you have to copy the libraries to every node. Also, what if you do not have complete control of the clusters. Will you have write permissions to lib folder.

Luckily, I bumped into a solution suggested Doug Cutting as an answer to someone who had a similar predicament. The solution was to create a “lib” folder in your project and copy all the external jars into this folder. According to Doug, Hadoop will look for third-party jars in this folder. It works great!

《Hadoop权威指南》中也有关于jar打包的处理措施,查找之

【任何非独立的JAR文件都必须打包到JAR文件的lib目录中。(这与Java的web application archive或WAR文件类似,不同的是,后者的JAR文件放在WEB-INF/lib子目录下的WAR文件中)】

内容版权声明:除非注明,否则皆为本站原创文章。

转载注明出处:http://www.heiqu.com/b10d7c95b5588ed885ca0e3a93aadb08.html