Hadoop lzo文件的并行map处理(2)

JAVA_LIBRARY_PATH=''
because it keeps Hadoop from keeping the alteration you made to JAVA_LIBRARY_PATH above. (Update: seehttps://issues.apache.org/jira/browse/HADOOP-6453). Make sure you restart your jobtrackers and tasktrackers after uploading and changing configs so that they take effect.

Using Hadoop and LZO
Reading and Writing LZO Data
The project provides LzoInputStream and LzoOutputStream wrapping regular streams, to allow you to easily read and write compressed LZO data.

Indexing LZO Files
At this point, you should also be able to use the indexer to index lzo files in Hadoop (recall: this makes them splittable, so that they can be analyzed in parallel in a mapreduce job). Imagine that big_file.lzo is a 1 GB LZO file. You have two options:

index it in-process via:

hadoop jar /path/to/your/hadoop-lzo.jar com.hadoop.compression.lzo.LzoIndexer big_file.lzo
index it in a map-reduce job via:

hadoop jar /path/to/your/hadoop-lzo.jar com.hadoop.compression.lzo.DistributedLzoIndexer big_file.lzo
Either way, after 10-20 seconds there will be a file named big_file.lzo.index. The newly-created index file tells the LzoTextInputFormat's getSplits function how to break the LZO file into splits that can be decompressed and processed in parallel. Alternatively, if you specify a directory instead of a filename, both indexers will recursively walk the directory structure looking for .lzo files, indexing any that do not already have corresponding .lzo.index files.

Running MR Jobs over Indexed Files
Now run any job, say wordcount, over the new file. In Java-based M/R jobs, just replace any uses of TextInputFormat by LzoTextInputFormat. In streaming jobs, add "-inputformat com.hadoop.mapred.DeprecatedLzoTextInputFormat" (streaming still uses the old APIs, and needs a class that inherits from org.apache.hadoop.mapred.InputFormat). For Pig jobs, email me or check the pig list -- I have custom LZO loader classes that work but are not (yet) contributed back.

Note that if you forget to index an .lzo file, the job will work but will process the entire file in a single split, which will be less efficient.

内容版权声明:除非注明,否则皆为本站原创文章。

转载注明出处:http://www.heiqu.com/9e552b4e589c6cc4f7007ade54ce3139.html