Sqoop1.4.4 实现将 Oracle10g 中的增量数据导入 Hive0

增量表:增量临时表

更新后的全量表:更新后的全量数据表

 

步骤:

通过Sqoop将Oracle中的表导入Hive,模拟全量表和增量表

通过Hive将“全量表+增量表”合并为“更新后的全量表”,覆盖当前的全量表

步骤1:通过Sqoop将Oracle中表的导入Hive,模拟全量表和增量表

为了模拟场景,需要一张全量表,和一张增量表,由于数据源有限,所以两个表都来自Oracle中的OMP_SERVICE,全量表包含所有数据,在Hive中名称叫service_all,增量表包含部分时间段数据,在Hive中名称叫service_tmp。

(1)全量表导入:导出所有数据,只要部分字段,导入到Hive指定表里

为实现导入Hive功能,需要先配置HCatalog(HCatalog是Hive子模块)的环境变量,/etc/profile中新增:

export HCAT_HOME=/home/fulong/Hive/apache-hive-0.13.1-bin/hcatalog

 

执行以下命令导入数据:

fulong@FBI006:~/Sqoop/sqoop-1.4.4/bin$ ./sqoop import \

> --connect jdbc:oracle:thin:@192.168.0.147:1521:ORCLGBK  --username SP --password fulong \

> --table OMP_SERVICE \

> --columns "SERVICE_CODE,SERVICE_NAME,SERVICE_PROCESS,CREATE_TIME,ENABLE_ORG,ENABLE_PLATFORM,IF_DEL" \

> --hive-import --hive-table SERVICE_ALL

 

注意:用户名必须大写

 

(2)增量表导入:只导出所需时间范围内的数据,只要部分字段,导入到Hive指定表里

使用以下命令导入数据:

fulong@FBI006:~/Sqoop/sqoop-1.4.4/bin$ ./sqoop import \

> --connect jdbc:oracle:thin:@192.168.0.147:1521:ORCLGBK  --username SP --password fulong \

> --table OMP_SERVICE \

> --columns "SERVICE_CODE,SERVICE_NAME,SERVICE_PROCESS,CREATE_TIME,ENABLE_ORG,ENABLE_PLATFORM,IF_DEL" \

> --where "CREATE_TIME > to_date('2012/12/4 17:00:00','yyyy-mm-dd hh24:mi:ss') and CREATE_TIME < to_date('2012/12/4 18:00:00','yyyy-mm-dd hh24:mi:ss')" \

> --hive-import --hive-overwrite --hive-table SERVICE_TMP

 

注意:

由于使用了--hive-overwrite参数,所以该语句可反复执行,往service_tmp表中覆盖插入最新的增量数据;

Sqoop还支持使用复杂Sql语句查询数据导入,相亲参见的“7.2.3.Free-form Query Imports”章节

(3)验证导入结果:列出所有表,统计行数,查看表结构

hive> show tables;

OK

searchlog

searchlog_tmp

service_all

service_tmp

Time taken: 0.04 seconds, Fetched: 4 row(s)

hive> select count(*) from service_all;

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks determined at compile time: 1

In order to change the average load for a reducer (in bytes):

  set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

  set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

  set mapreduce.job.reduces=<number>

Starting Job = job_1407233914535_0013, Tracking URL = :8088/proxy/application_1407233914535_0013/

Kill Command = /home/fulong/Hadoop/hadoop-2.2.0/bin/hadoop job  -kill job_1407233914535_0013

Hadoop job information for Stage-1: number of mappers: 3; number of reducers: 1

2014-08-21 16:51:47,389 Stage-1 map = 0%,  reduce = 0%

2014-08-21 16:51:59,816 Stage-1 map = 33%,  reduce = 0%, Cumulative CPU 1.36 sec

2014-08-21 16:52:01,996 Stage-1 map = 67%,  reduce = 0%, Cumulative CPU 2.45 sec

2014-08-21 16:52:07,877 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.96 sec

2014-08-21 16:52:17,639 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 5.29 sec

MapReduce Total cumulative CPU time: 5 seconds 290 msec

Ended Job = job_1407233914535_0013

MapReduce Jobs Launched:

Job 0: Map: 3  Reduce: 1   Cumulative CPU: 5.46 sec   HDFS Read: 687141 HDFS Write: 5 SUCCESS

Total MapReduce CPU Time Spent: 5 seconds 460 msec

OK

6803

Time taken: 59.386 seconds, Fetched: 1 row(s)

hive> select count(*) from service_tmp;

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks determined at compile time: 1

In order to change the average load for a reducer (in bytes):

  set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

  set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

  set mapreduce.job.reduces=<number>

Starting Job = job_1407233914535_0014, Tracking URL = :8088/proxy/application_1407233914535_0014/

Kill Command = /home/fulong/Hadoop/hadoop-2.2.0/bin/hadoop job  -kill job_1407233914535_0014

Hadoop job information for Stage-1: number of mappers: 3; number of reducers: 1

2014-08-21 16:53:03,951 Stage-1 map = 0%,  reduce = 0%

2014-08-21 16:53:15,189 Stage-1 map = 67%,  reduce = 0%, Cumulative CPU 2.17 sec

2014-08-21 16:53:16,236 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.38 sec

2014-08-21 16:53:57,935 Stage-1 map = 100%,  reduce = 22%, Cumulative CPU 3.78 sec

2014-08-21 16:54:01,811 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 5.34 sec

MapReduce Total cumulative CPU time: 5 seconds 340 msec

Ended Job = job_1407233914535_0014

MapReduce Jobs Launched:

Job 0: Map: 3  Reduce: 1   Cumulative CPU: 5.66 sec   HDFS Read: 4720 HDFS Write: 3 SUCCESS

Total MapReduce CPU Time Spent: 5 seconds 660 msec

OK

13

Time taken: 75.856 seconds, Fetched: 1 row(s)

hive> describe service_all;

OK

service_code            string

service_name            string

service_process         string

create_time             string

enable_org              string

enable_platform         string

if_del                  string

Time taken: 0.169 seconds, Fetched: 7 row(s)

hive> describe service_tmp;

OK

service_code            string

service_name            string

service_process         string

create_time             string

enable_org              string

enable_platform         string

if_del                  string

Time taken: 0.117 seconds, Fetched: 7 row(s)

内容版权声明:除非注明,否则皆为本站原创文章。

转载注明出处:https://www.heiqu.com/69bf62d7b174540dd24684f896eef720.html