Hive分析函数和窗口函数

日期：2020-06-12 栏目：程序人生浏览：次

Hive分析函数和窗口函数

在Hive 0.11之后支持的，扫描多个输入的行计算每行的结果。通常和OVER,PARTITION BY, ORDER BY, WINDOWING配合使用。和传统的分组结果不一样，传统的结果每组中只有一个结果。分析函数的结果会出现多次，和每条记录都连接输出。

语法形式如下：

Function(arg1,....argn) OVER([PARTITION BY<...>] [ORDER BY<...>] [window_clause]) 窗口函数函数名说明
FIRST_VALUE 取出分组内排序后，截止到当前行，第一个值
LAST_VALUE 取出分组内排序后，截止到当前行，最后一个值
LEAD(col, n, DEFAULT) 用于统计窗口内往下第n行的值。第一个参数为列名，第二个参数为往下第n行(可选，默认为1)，第三个参数为默认值(当往下第n行为NULL时，取默认值)
LAG(col,n,DEFAULT) 与lead相反，用于统计窗口内往下第n个值。第一个参数为列名，第二个参数为往上第n行(可选，默认为1)
OVER从句

使用标准的聚合函数COUNT,SUM,MIN,MAX,AVG

使用PARTITION BY语句，使用一个或者多个原始数据类型的列

使用PARTITION BY与ORDER BY语句，使用一个或者多个数据类型的分区或者拍序列

使用窗口规范，窗口规范支持一下格式：

当ORDER BY后面缺少窗口从句条件，窗口规范默认是

RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

当ORDER BY和窗口从句都缺失，窗口规范默认是：

ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING 分析函数函数说明
ROW_NUMBER() 从1开始，按照顺序，生成分组内记录的序列，比如，按照pv降序排列，生成分组内每天的pv名次，ROW_NUMBER()的应用场景非常多，再比如，获取分组内排序第一的记录，获取一个session中的第一条refer等
RANK() 生成数据项在分组中的排名，排名相等会在名次中留下空位
DENSE_RANK() 生成数据项在分组中的排名，排名相等会在名次中不会留下空位
CUME_DIST() 小于等于当前值的行数除以分组内总行数。比如，统计小于等于当前薪水的人数所占总人数的比例
PERCENT_RANK() 分组内当前行的RANK值-1/分组内总行数-1
NTILE(n) 用于将分组数据按照顺序切分成n片，返回当前切片值，如果切片不均匀，默认增加第一个切片的分布。NTILE不支持ROWS BETWEEN,比如NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PERCEDING AND CURRENT ROW)
Hive2.1.0及以后版本支持distinct

在聚合函数(sum, count, avg)中支持distinct，但是在order by或者窗口限制中不支持。
conut(distinct a) over(partition by c)

Hive2.1.0以后支持在OVER从句中支持聚合函数 select rank() over(order by sum(b)) Hive2.2.0中在使用ORDER BY和窗口限制时支持distinct

count(distinct a) over (partition by c order by d rows between 1 preceding and 1 following)

通过实例深入理解窗口函数和分析函数 COUNT、SUM、MIN、MAX、AVG案例分析 ## 创建数据表 create table orders( user_id string, device_id string, user_type string, price float, sales int); ## 添加数据orders.txt zhangsa test1 new 67.1 2 lisi test2 old 43.32 1 wanger test3 new 88.88 3 liliu test4 new 66.0 1 tom test5 new 54.32 1 tomas test6 old 77.77 2 tomson test7 old 88.44 3 tom1 test8 new 56.55 6 tom2 test9 new 88.88 5 tom3 test10 new 66.66 5 ## 开窗函数案例 select user_id, user_type, sales, -- 默认从起点到当前所有重复行 sum(sales) over(partition by user_type order by sales asc) as sales_1, -- 从起点到当前所有重复行与sales_1结果相同 sum(sales) over(partition by user_type order by sales asc range between unbounded preceding and current row) as sales_2, -- 从起点到当前行，结果与sale_1结果不同 sum(sales) over(partition by user_type order by sales asc rows between unbounded preceding and current row) as sales_3, -- 当前行加上往前3行 sum(sales) over(partition by user_type order by sales asc rows between 3 preceding and current row) as sales_4, -- 当前范围往上加3行 sum(sales) over(partition by user_type order by sales asc range between 3 preceding and current row) as sales_5, -- 当前行+往前3行+往后1行 sum(sales) over(partition by user_type order by sales asc rows between 3 preceding and 1 following) as sales_6, -- sum(sales) over(partition by user_type order by sales asc range between 3 preceding and 1 following) as sales_7, -- 当前行+之后所有行 sum(sales) over(partition by user_type order by sales asc rows between current row and unbounded following) as sales_8, -- sum(sales) over(partition by user_type order by sales asc range between current row and unbounded following) as sales_9, -- 分组内所有行 sum(sales) over(partition by user_type) as sales_10 from orders order by user_type, sales, user_id; ##上述查询结果如下： | user_id | user_type | sales | sales_1 | sales_2 | sales_3 | sales_4 | sales_5 | sales_6 | sales_7 | sales_8 | sales_9 | sales_10 | |----------|------------|--------|----------|----------|----------|----------|----------|----------|----------|----------|----------|-----------| | liliu | new | 1 | 2 | 2 | 2 | 2 | 2 | 4 | 4 | 22 | 23 | 23 | | tom | new | 1 | 2 | 2 | 1 | 1 | 2 | 2 | 4 | 23 | 23 | 23 | | zhangsa | new | 2 | 4 | 4 | 4 | 4 | 4 | 7 | 7 | 21 | 21 | 23 | | wanger | new | 3 | 7 | 7 | 7 | 7 | 7 | 12 | 7 | 19 | 19 | 23 | | tom2 | new | 5 | 17 | 17 | 17 | 15 | 15 | 21 | 21 | 11 | 16 | 23 | | tom3 | new | 5 | 17 | 17 | 12 | 11 | 15 | 16 | 21 | 16 | 16 | 23 | | tom1 | new | 6 | 23 | 23 | 23 | 19 | 19 | 19 | 19 | 6 | 6 | 23 | | lisi | old | 1 | 1 | 1 | 1 | 1 | 1 | 3 | 3 | 6 | 6 | 6 | | tomas | old | 2 | 3 | 3 | 3 | 3 | 3 | 6 | 6 | 5 | 5 | 6 | | tomson | old | 3 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 3 | 3 | 6 |

注意

结果和ORDER BY相关,默认为升序

如果不指定ROWS BETWEEN,默认为从起点到当前行;

如果不指定ORDER BY，则将分组内所有值累加;

PRECEDING：往前

FOLLOWING：往后

CURRENT ROW：当前行

UNBOUNDED：无界限（起点或终点）

UNBOUNDED PRECEDING：表示从前面的起点

UNBOUNDED FOLLOWING：表示到后面的终点

转载注明出处：https://www.heiqu.com/84efcc956ea3ce7c9409b64b993e284a.html

Hive分析函数和窗口函数

相关推荐