今天总结本人在使用Hive过程中的一些优化技巧,希望给大家带来帮助。Hive优化最体现程序员的技术能力,面试官在面试时最喜欢问的就是Hive的优化技巧。
技巧1.控制reducer数量下面的内容是我们每次在hive命令行执行SQL时都会打印出来的内容:
In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number>很多人都会有个疑问,上面的内容是干什么用的。我们一一来解答,先看
set hive.exec.reducers.bytes.per.reducer=<number>,这个一条Hive命令,用于设置在执行SQL的过程中每个reducer处理的最大字节数量。可以在配置文件中设置,也可以由我们在命令行中直接设置。如果处理的数据量大于,就会多生成一个reudcer。例如, = 1024K,处理的数据是1M,就会生成10个reducer。我们来验证下上面的说法是否正确:
执行set hive.exec.reducers.bytes.per.reducer=200000;命令,设置每个reducer处理的最大字节是200000。
执行sql:
select user_id,count(1) as cnt from orders group by user_id limit 20;执行上面的sql时会在控制台打印出信息:
Number of reduce tasks not specified. Estimated from input data size: 159 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1538917788450_0020, Tracking URL = :8088/proxy/application_1538917788450_0020/ Kill Command = /usr/local/src/hadoop-2.6.1/bin/hadoop job -kill job_1538917788450_0020 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 159控制台打印的信息中第一句话:Number of reduce tasks not specified. Estimated from input data size: 159。翻译成中文:没有指定reducer任务数量,根据输入的数据量估计会有159个reducer任务。然后在看最后一句话:number of mappers: 1; number of reducers: 159。确定该SQL最终生成159个reducer。因此如果我们知道数据的大小,只要通过set hive.exec.reducers.bytes.per.reducer命令设置每个reducer处理数据的大小就可以控制reducer的数量。
接着看
set hive.exec.reducers.max=<number> 这也是一条Hive命令,用于设置Hive的最大reducer数量,如果我们设置number为50,表示reducer的最大数量是50。
我们来验证下这个说法是否正确:
执行命令set hive.exec.reducers.max=8;设置reducer的数量为8。
继续执行sql:
select user_id,count(1) as cnt from orders group by user_id limit 20;会在控制台打印出如下信息:
Number of reduce tasks not specified. Estimated from input data size: 8 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1538917788450_0020, Tracking URL = :8088/proxy/application_1538917788450_0020/ Kill Command = /usr/local/src/hadoop-2.6.1/bin/hadoop job -kill job_1538917788450_0020 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 8控制台打印的信息中第一句话:Number of reduce tasks not specified. Estimated from input data size: 8。reducer的数据量为8,正好验证了我们的说法。set set hive.exec.reducers.max=8;命令是设置reducer的数量的上界。
最后来看set mapreduce.job.reduces=命令。这条Hive命令是设置reducer的数据,在执行sql会生成多少个reducer处理数据。使用和上面同样的方法来验证set mapreduce.job.reduces=这条命令。
执行命令set mapreduce.job.reduces=5;设置reducer的数量为8。
继续执行sql:
select user_id,count(1) as cnt from orders group by user_id limit 20;会在控制台打印出如下信息:
Number of reduce tasks not specified. Defaulting to jobconf value of: 5 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1538917788450_0026, Tracking URL = :8088/proxy/application_1538917788450_0026/ Kill Command = /usr/local/src/hadoop-2.6.1/bin/hadoop job -kill job_1538917788450_0026 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 5根据Number of reduce tasks not specified. Defaulting to jobconf value of: 5和number of mappers: 1; number of reducers: 5这两句话,可以知道生成5个reducer。