sort by value
Hadoop 不提供直接的sort by value方法,因为这样会降低mapreduce性能。
但可以用组合的办法来实现,具体实现方法见《the definitive guide》, P250
基本思想:
1. 组合key/value作为新的key;
2. 重载partitioner,根据old key来分割;
conf.setPartitionerClass(FirstPartitioner.class);
3. 自定义keyComparator:先根据old key排序,再根据old value排序;
conf.setOutputKeyComparatorClass(KeyComparator.class);
4. 重载GroupComparator, 也根据old key 来组合; conf.setOutputValueGroupingComparator(GroupComparator.class);
small input files的处理
对于一系列的small files作为input file,会降低hadoop效率。
有3种方法可以将small file合并处理:
1. 将一系列的small files合并成一个sequneceFile,加快mapreduce速度。
详见WholeFileInputFormat及SmallFilesToSequenceFileConverter,《the definitive guide》, P194
2. 使用CombineFileInputFormat集成FileinputFormat,但是未实现过;
3. 使用hadoop archives(类似打包),减少小文件在namenode中的metadata内存消耗。(这个方法不一定可行,所以不建议使用)
方法:
将/my/files目录及其子目录归档成files.har,然后放在/my目录下
bin/hadoop archive -archiveName files.har /my/files /my
查看files in the archive:
bin/hadoop fs -lsr har://my/files.har
skip bad records
JobConf conf = new JobConf(ProductMR.class);
conf.setJobName("ProductMR");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Product.class);
conf.setMapperClass(Map.class);
conf.setReducerClass(Reduce.class);
conf.setMapOutputCompressorClass(DefaultCodec.class);
conf.setInputFormat(SequenceFileInputFormat.class);
conf.setOutputFormat(SequenceFileOutputFormat.class);
String objpath = "abc1";
SequenceFileInputFormat.addInputPath(conf, new Path(objpath));
SkipBadRecords.setMapperMaxSkipRecords(conf, Long.MAX_VALUE);
SkipBadRecords.setAttemptsToStartSkipping(conf, 0);
SkipBadRecords.setSkipOutputPath(conf, new Path("data/product/skip/"));
String output = "abc";
SequenceFileOutputFormat.setOutputPath(conf, new Path(output));
JobClient.runJob(conf);
For skipping failed tasks try : mapred.max.map.failures.percent
restart 单个datanode
如果一个datanode 出现问题,解决之后需要重新加入cluster而不重启cluster,方法如下:
bin/hadoop-daemon.sh start datanode
bin/hadoop-daemon.sh start jobtracker
reduce exceed 100%
"Reduce Task Progress shows > 100% when the total size of map outputs (for a
single reducer) is high "
造成原因:
在reduce的merge过程中,check progress有误差,导致status > 100%,在统计过程中就会出现以下错误:java.lang.ArrayIndexOutOfBoundsException: 3
at org.apache.hadoop.mapred.StatusHttpServer$TaskGraphServlet.getReduceAvarageProgresses(StatusHttpServer.java:228)
at org.apache.hadoop.mapred.StatusHttpServer$TaskGraphServlet.doGet(StatusHttpServer.java:159)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:427)
at org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandler.java:475)
at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567)
at org.mortbay.http.HttpContext.handle(HttpContext.java:1565)
at org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationContext.java:635)
at org.mortbay.http.HttpContext.handle(HttpContext.java:1517)
at org.mortbay.http.HttpServer.service(HttpServer.java:954)
jira地址:
counters
3中counters:
1. built-in counters: Map input bytes, Map output records...
2. enum counters
调用方式:
enum Temperature {
MISSING,
MALFORMED
}
reporter.incrCounter(Temperature.MISSING, 1)
结果显示:
09/04/20 06:33:36 INFO mapred.JobClient: Air Temperature Recor
09/04/20 06:33:36 INFO mapred.JobClient: Malformed=3
09/04/20 06:33:36 INFO mapred.JobClient: Missing=66136856
3. dynamic countes:
调用方式:
reporter.incrCounter("TemperatureQuality", parser.getQuality(),1);
结果显示:
09/04/20 06:33:36 INFO mapred.JobClient: TemperatureQuality
09/04/20 06:33:36 INFO mapred.JobClient: 2=1246032
09/04/20 06:33:36 INFO mapred.JobClient: 1=973422173
09/04/20 06:33:36 INFO mapred.JobClient: 0=1
7: Namenode in safe mode
解决方法
bin/hadoop dfsadmin -safemode leave
8:java.net.NoRouteToHostException: No route to host
解决方法:
sudo /etc/init.d/iptables stop
9:更改namenode后,在hive中运行select 依旧指向之前的namenode地址
这是因为:When youcreate a table, hive actually stores the location of the table (e.g.
hdfs://ip:port/user/root/...) in the SDS and DBS tables in the metastore . So when I bring up a new cluster the master has a new IP, but hive's metastore is still pointing to the locations within the old
cluster. I could modify the metastore to update with the new IP everytime I bring up a cluster. But the easier and simpler solution was to just use an elastic IP for the master
所以要将metastore中的之前出现的namenode地址全部更换为现有的namenode地址
10:Your DataNode is started and you can create directories with bin/hadoop dfs -mkdir, but you get an error message when you try to put files into the HDFS (e.g., when you run a command like bin/hadoop dfs -put).
解决方法:
Go to the HDFS info web page (open your web browser and go to :dfs_info_port where namenode is the hostname of your NameNode and dfs_info_port is the port you chose dfs.info.port; if followed the QuickStart on your personal computer then this URL will be :50070). Once at that page click on the number where it tells you how many DataNodes you have to look at a list of the DataNodes in your cluster.
If it says you have used 100% of your space, then you need to free up room on local disk(s) of the DataNode(s).
If you are on Windows then this number will not be accurate (there is some kind of bug either in Cygwin's df.exe or in Windows). Just free up some more space and you should be okay. On one Windows machine we tried the disk had 1GB free but Hadoop reported that it was 100% full. Then we freed up another 1GB and then it said that the disk was 99.15% full and started writing data into the HDFS again. We encountered this bug on Windows XP SP2.
11:Your DataNodes won't start, and you see something like this in logs/*datanode*:
Incompatible namespaceIDs in /tmp/hadoop-ross/dfs/data
原因:
Your Hadoop namespaceID became corrupted. Unfortunately the easiest thing to do reformat the HDFS.
解决方法:
You need to do something like this:
bin/stop-all.sh
rm -Rf /tmp/hadoop-your-username/*
bin/hadoop namenode -format
12:You can run Hadoop jobs written in Java (like the grep example), but your HadoopStreaming jobs (such as the Python example that fetches web page titles) won't work.
原因:
You might have given only a relative path to the mapper and reducer programs. The tutorial originally just specified relative paths, but absolute paths are required if you are running in a real cluster.
解决方法:
Use absolute paths like this from the tutorial:
bin/hadoop jar contrib/hadoop-0.15.2-streaming.jar \
-mapper $HOME/proj/hadoop/multifetch.py \
-reducer $HOME/proj/hadoop/reducer.py \
-input urls/* \
-output titles
13: 2009-01-08 10:02:40,709 ERROR metadata.Hive (Hive.java:getPartitions(499)) - javax.jdo.JDODataStoreException: Required table missing : ""PARTITIONS"" in Catalog "" Schema "". JPOX requires this table to perform its persistence operations. Either your MetaData is incorrect, or you need to enable "org.jpox.autoCreateTables"
原因:就是因为在 hive-default.xml 里把 org.jpox.fixedDatastore 设置成 true 了
starting namenode, logging to /home/hadoop/HadoopInstall/hadoop/bin/../logs/hadoop-hadoop-namenode-hadoop.out
localhost: starting datanode, logging to /home/hadoop/HadoopInstall/hadoop/bin/../logs/hadoop-hadoop-datanode-hadoop.out
localhost: starting secondarynamenode, logging to /home/hadoop/HadoopInstall/hadoop/bin/../logs/hadoop-hadoop-secondarynamenode-hadoop.out
localhost: Exception in thread "main" java.lang.NullPointerException
localhost: at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:130)
localhost: at org.apache.hadoop.dfs.NameNode.getAddress(NameNode.java:116)
localhost: at org.apache.hadoop.dfs.NameNode.getAddress(NameNode.java:120)
localhost: at org.apache.hadoop.dfs.SecondaryNameNode.initialize(SecondaryNameNode.java:124)
localhost: at org.apache.hadoop.dfs.SecondaryNameNode.<init>(SecondaryNameNode.java:108)
localhost: at org.apache.hadoop.dfs.SecondaryNameNode.main(SecondaryNameNode.java:460)
14:09/08/31 18:25:45 INFO hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException:Bad connect ack with firstBadLink 192.168.1.11:50010
> 09/08/31 18:25:45 INFO hdfs.DFSClient: Abandoning block blk_-8575812198227241296_1001
> 09/08/31 18:25:51 INFO hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException:
Bad connect ack with firstBadLink 192.168.1.16:50010
> 09/08/31 18:25:51 INFO hdfs.DFSClient: Abandoning block blk_-2932256218448902464_1001
> 09/08/31 18:25:57 INFO hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException:
Bad connect ack with firstBadLink 192.168.1.11:50010
> 09/08/31 18:25:57 INFO hdfs.DFSClient: Abandoning block blk_-1014449966480421244_1001
> 09/08/31 18:26:03 INFO hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException:
Bad connect ack with firstBadLink 192.168.1.16:50010
> 09/08/31 18:26:03 INFO hdfs.DFSClient: Abandoning block blk_7193173823538206978_1001
> 09/08/31 18:26:09 WARN hdfs.DFSClient: DataStreamer Exception: java.io.IOException: Unable
to create new block.
> at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2731)
> at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1996)
> at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2182)
>
> 09/08/31 18:26:09 WARN hdfs.DFSClient: Error Recovery for block blk_7193173823538206978_1001
bad datanode[2] nodes == null
> 09/08/31 18:26:09 WARN hdfs.DFSClient: Could not get block locations. Source file "/user/umer/8GB_input"
- Aborting...
> put: Bad connect ack with firstBadLink 192.168.1.16:50010
解决方法:
I have resolved the issue:
What i did:
1) '/etc/init.d/iptables stop' -->stopped firewall
2) SELINUX=disabled in '/etc/selinux/config' file.-->disabled selinux
I worked for me after these two changes
解决jline.ConsoleReader.readLine在Windows上不生效问题方法
在 CliDriver.java的main()函数中,有一条语句reader.readLine,用来读取标准输入,但在Windows平台上该语句总是 返回null,这个reader是一个实例jline.ConsoleReader实例,给Windows Eclipse调试带来不便。
我们可以通过使用java.util.Scanner.Scanner来替代它,将原来的
while ((line=reader.readLine(curPrompt+"> ")) != null)
复制代码
替换为:
Scanner sc = new Scanner(System.in);
while ((line=sc.nextLine()) != null)
复制代码
重新编译发布,即可正常从标准输入读取输入的SQL语句了。