impala使用指南 (2)

impala使用指南

8.2 压缩方式

impala使用指南

9 impala分区 9.1 创建分区方式

partitioned by 创建表时,添加该字段指定分区列表

create table t_person(id int, name string, age int) partitioned by (type string);

使用alter table 进行分区的添加和删除操作

alter table t_person add partition (sex=‘man'); alter table t_person drop partition (sex=‘man'); alter table t_person drop partition (sex=‘man‘,type=‘boss’); 9.2 分区内添加数据 insert into t_person partition (type='boss') values (1,’zhangsan’,18),(2,’lisi’,23) insert into t_person partition (type='coder') values (3,wangwu’,22),(4,’zhaoliu’,28),(5,’tianqi’,24) 9.3 查询指定分区

select id,name from t_person where type=‘coder’

10 hive 和impala区别 10.1 数据类型

impala支持数据类型没有hive多

10.2 不支持HiveQL以下特性

可扩展机制

xml json函数

聚合函数

多distinct查询

11 常用sql -- 创建数据库 create database db1; -- 使用数据库 use db1; -- 删除数据库 drop database db1; -- 创建外部表 CREATE TABLE table01 ( id INT, name STRING, confirm_flag INT, create_date TIMESTAMP, create_user_id STRING, id BIGINT, license_num STRING, NAME STRING, postcode STRING, update_date TIMESTAMP, update_user_id STRING, valid_flag INT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' WITH SERDEPROPERTIES ( 'serialization.format' = '|', 'field.delim' = '|' ) STORED AS Parquet LOCATION 'hdfs://node03:8020/user/hive/warehouse/table01'; -- 使用现有表进行创建 create table tab_3 like tab_1; -- 创建内部表 CREATE TABLE table01 ( id INT, name STRING, confirm_flag INT, create_date TIMESTAMP, create_user_id STRING, id BIGINT, license_num STRING, NAME STRING, postcode STRING, update_date TIMESTAMP, update_user_id STRING, valid_flag INT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' WITH SERDEPROPERTIES ( 'serialization.format' = '|', 'field.delim' = '|' ) STORED AS Parquet LOCATION 'hdfs://node03:8020/test/table01'; -- 创建视图 create view v1 as select count(id) as total from tab_3 ; -- 查询视图 select * from v1; -- 查看视图定义 describe formatted v1 -- 数据导入 * 加载数据: insert语句:插入数据时每条数据产生一个数据文件,不建议用此方式 * 加载批量数据 load data方式:在进行批量插入时使用这种方式比较合适 来自中间表:此种方式使用于从一个小文件较多的大表中读取文件并写 入新的表生产少量的数据文件。也可以通过此种方式进行格式转换。 * 空值处理: impala将“\n” 表示为NULL,在结合sqoop使用是注意做相应的空字段 过滤, 也可以使用以下方式进行处理: alter table name set tblproperties (“serialization.null.format”=“null”) -- 示例 CREATE TABLE access_log_text ( log_id BIGINT, idc_id STRING, house_id BIGINT, src_port BIGINT, dest_port BIGINT, protocol_type INT, access_time STRING , src_ip_long BIGINT, dest_ip_long BIGINT, url64 STRING ) PARTITIONED BY (hn BIGINT,ds STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE; alter table access_log_text add partition(hn=91014,ds='2017-05-18-11'); insert into access_log_text partition(hn=91014,ds='2017-05-18-11') values(12345,"A2.B1.B2-20090003",91014,53487,80,1,"2017-05-18 11:00:19",2076024904,2086642602,"aHR0cDovL3d4MS5zaW5haW1nLmNuL3RodW1iMTUwLzY5MjdlN2E1bHkxZmVkdnU4ejkxNmoyMGM4MGVzM3prLmpwZw=="); --内部表,parquet格式存储 CREATE TABLE access_log_parquet ( log_id BIGINT, idc_id STRING, house_id BIGINT, src_port BIGINT, dest_port BIGINT, protocol_type INT, access_time STRING , src_ip_long BIGINT, dest_ip_long BIGINT, url64 STRING ) PARTITIONED BY (hn BIGINT,ds STRING) STORED AS PARQUET; alter table access_log_parquet add partition(hn=91014,ds='2017-05-18-11'); alter table access_log_parquet add partition(hn=91014,ds='2017-05-18-12'); insert into access_log_parquet partition(hn=91014,ds='2017-05-18-11') values(12345,"A2.B1.B2-20090003",91014,53487,80,1,"2017-05-18 11:00:19",2076024904,2086642602,"aHR0cDovL3d4MS5zaW5haW1nLmNuL3RodW1iMTUwLzY5MjdlN2E1bHkxZmVkdnU4ejkxNmoyMGM4MGVzM3prLmpwZw=="); insert into access_log_parquet partition(hn=91014,ds='2017-05-18-12') select log_id,idc_id,house_id,src_port,dest_port,protocol_type,access_time,src_ip_long,dest_ip_long,url64 from access_log_text where hn=91014 and ds='2017-05-18-11'; --覆盖更新 insert overwrite table access_log_parquet (log_id,idc_id,house_id,src_port,dest_port,protocol_type,access_time,src_ip_long,dest_ip_long,url64) partition (hn=91014, ds = '2017-05-18-13' ) select log_id,idc_id,house_id,src_port,dest_port,protocol_type,access_time,src_ip_long,dest_ip_long,url64 from access_log_parquet where hn=91014 and ds='2017-05-18-11'; --load data inpath load data inpath '/input/access/12345.txt' into table access_log_text partition (hn=91014,ds='2017-05-18-11'); --文件拷贝 需要刷新元数据 外部表 CREATE EXTERNAL TABLE access_log_parquet_external ( log_id BIGINT, idc_id STRING, house_id BIGINT, src_port BIGINT, dest_port BIGINT, protocol_type INT, access_time STRING , src_ip_long BIGINT, dest_ip_long BIGINT, url64 STRING ) PARTITIONED BY (hn BIGINT,ds STRING) STORED AS PARQUET LOCATION '/user/hive/warehouse/access_log_parquet_external '; alter table access_log_parquet_external add partition(hn=91014,ds='2017-05-17-11');

内容版权声明:除非注明,否则皆为本站原创文章。

转载注明出处:https://www.heiqu.com/zgzyxf.html