Sqoop1.4.4原生增量导入特性探秘

要想实现增量导入,完全可以不使用Sqoop的原生增量特性,仅使用shell脚本生成一个以当前时间为基准的固定时间范围,然后拼接Sqoop命令语句即可。

原生增量导入特性简介

Sqoop提供了原生增量导入的特性,包含以下三个关键参数:

ArgumentDescription
--check-column (col)   指定一个“标志列”用于判断增量导入的数据范围,该列不能是字符型,最好是数字或者日期型(这个很好理解吧)。  
--incremental (mode)   指定增量模式,包含“追加模式” append 和“最后修改模式” lastmodified (该模式更满足常见需求)
 
--last-value (value)   指定“标志列”上次导入的上界。如果“标志列”是最后修改时间,则--last-value为上次执行导入脚本的时间。  

 

结合Saved Jobs机制,可以实现重复调度增量更新Job时 --last-value 字段的自动更新赋值,再结合cron或者oozie的定时调度,可实现真正意义的增量更新。

实验:增量job的创建和执行   

创建增量更新job:

fulong@FBI006:~/Sqoop/sqoop-1.4.4/bin$ sqoop job --create incretest -- import --connect jdbc:Oracle:thin:@192.168.0.138:1521:orcl  --username HIVE --password hivefbi --table FBI_SQOOPTEST --hive-import --hive-table INCRETEST --incremental lastmodified --check-column LASTMODIFIED --last-value '2014/8/27 13:00:00'

14/08/27 17:29:37 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.

14/08/27 17:29:37 INFO tool.BaseSqoopTool: Using Hive-specific delimiters for output. You can override

14/08/27 17:29:37 INFO tool.BaseSqoopTool: delimiters with --fields-terminated-by, etc.

14/08/27 17:29:37 WARN tool.BaseSqoopTool: It seems that you've specified at least one of following:

14/08/27 17:29:37 WARN tool.BaseSqoopTool:      --hive-home

14/08/27 17:29:37 WARN tool.BaseSqoopTool:      --hive-overwrite

14/08/27 17:29:37 WARN tool.BaseSqoopTool:      --create-hive-table

14/08/27 17:29:37 WARN tool.BaseSqoopTool:      --hive-table

14/08/27 17:29:37 WARN tool.BaseSqoopTool:      --hive-partition-key

14/08/27 17:29:37 WARN tool.BaseSqoopTool:      --hive-partition-value

14/08/27 17:29:37 WARN tool.BaseSqoopTool:      --map-column-hive

14/08/27 17:29:37 WARN tool.BaseSqoopTool: Without specifying parameter --hive-import. Please note that

14/08/27 17:29:37 WARN tool.BaseSqoopTool: those arguments will not be used in this session. Either

14/08/27 17:29:37 WARN tool.BaseSqoopTool: specify --hive-import to apply them correctly or remove them

14/08/27 17:29:37 WARN tool.BaseSqoopTool: from command line to remove this warning.

14/08/27 17:29:37 INFO tool.BaseSqoopTool: Please note that --hive-home, --hive-partition-key,

14/08/27 17:29:37 INFO tool.BaseSqoopTool:       hive-partition-value and --map-column-hive options are

14/08/27 17:29:37 INFO tool.BaseSqoopTool:       are also valid for HCatalog imports and exports

 

执行Job:

fulong@FBI006:~/Sqoop/sqoop-1.4.4/bin$ ./sqoop job --exec incretest

 

注意日志中显示的SQL语句:

14/08/27 17:36:23 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(ID), MAX(ID) FROM FBI_SQOOPTEST WHERE ( LASTMODIFIED >= TO_DATE('2014/8/27 13:00:00', 'YYYY-MM-DD HH24:MI:SS') AND LASTMODIFIED < TO_DATE('2014-08-27 17:36:23', 'YYYY-MM-DD HH24:MI:SS') )

其中,LASTMODIFIED的下界是创建job的语句中指定的,上界是当前时间2014-08-27 17:36:23

 

验证:

hive> select * from incretest;

OK

2       lion    2014-08-27

Time taken: 0.085 seconds, Fetched: 1 row(s)

 

然后我向Oracle中插入一条数据:

 

Sqoop1.4.4原生增量导入特性探秘

 

再执行一次:

fulong@FBI006:~/Sqoop/sqoop-1.4.4/bin$ ./sqoop job --exec incretest

 

日志中显示的SQL语句: 

内容版权声明:除非注明,否则皆为本站原创文章。

转载注明出处:https://www.heiqu.com/1badf4744c9d2b1a941f86767bca6b8c.html