HASH JOIN与NETSTED LOOP、SORT MERGE JOIN连接方式对比:
介绍Hash Join算法步骤:
1、Hash Join连接对象依然是两个数据表,首先选择出其中一个“小表”。这里的小表,就是参与连接操作的数据集合数据量小。对连接列字段的所有数据值,进行Hash函数操作。Hash函数是计算机科学中经常使用到的一种处理函数,利用Hash值的快速搜索算法已经被认为是成熟的检索手段。Hash函数处理过的数据特征是“相同数据值的Hash函数值一定相同,不同数据值的Hash函数值可能相同”;
2、经过Hash处理过的小表连接列,连同数据一起存放到Oracle PGA空间中。PGA中存在一块空间为hash_area,专门存放此类数据。并且,依据不同的Hash函数值,进行划分Bucket操作。每个Bucket中包括所有相同hash函数值的小表数据。同时建立Hash键值对应位图。
3、之后对进行Hash连接大表数据连接列依次读取,并且将每个Hash值进行Bucket匹配,定位到适当的Bucket上(应用Hash检索算法);
4、在定位到的Bucket中,进行小规模的精确匹配。因为此时的范围已经缩小,进行匹配的成功率精确度高。同时,匹配操作是在内存中进行,速度较Merge Sort Join时要快很多;
案例:
12:16:16 SYS@ prod>create table sgtb as select * from dba_segments where owner='SYS';
Table created.
Elapsed: 00:00:00.73
12:17:05 SYS@ prod>create table obtb as select * from dba_objects where owner='SYS';
Table created.
Elapsed: 00:00:01.02
12:17:30 SYS@ prod>SELECT count(*) from sgtb;
COUNT(*)
----------
2312
Elapsed: 00:00:00.02
12:17:41 SYS@ prod>SELECT count(*) from obtb;
COUNT(*)
----------
30928
Elapsed: 00:00:00.04
12:17:51 SYS@ prod>
12:17:51 SYS@ prod>create index seg_name_ind on sgtb (segment_name);
Index created.
Elapsed: 00:00:00.27
12:19:00 SYS@ prod>create index ob_name_ind on obtb(object_name);
Index created.
Elapsed: 00:00:00.32
12:19:29 SYS@ prod>
12:19:29 SYS@ prod>exec dbms_stats.gather_table_stats(user,'SGTB',CASCADE=>TRUE);
PL/SQL procedure successfully completed.
Elapsed: 00:00:00.46
12:20:49 SYS@ prod>exec dbms_stats.gather_table_stats(user,'OBTB',CASCADE=>TRUE);
PL/SQL procedure successfully completed.
Elapsed: 00:00:00.33
HASH JOIN:
12:21:03 SYS@ prod>SET autotrace trace
12:21:32 SYS@ prod>select * from sgtb a,obtb b where a.segment_name=b.object_name;
2528 rows selected.
Elapsed: 00:00:00.14
Execution Plan
----------------------------------------------------------
Plan hash value: 1028776806
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 2401 | 515K| 134 (1)| 00:00:02 |
|* 1 | HASH JOIN | | 2401 | 515K| 134 (1)| 00:00:02 |
| 2 | TABLE ACCESS FULL| SGTB | 2312 | 279K| 13 (0)| 00:00:01 |
| 3 | TABLE ACCESS FULL| OBTB | 30928 | 2899K| 121 (1)| 00:00:02 |
---------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - access("A"."SEGMENT_NAME"="B"."OBJECT_NAME")
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
650 consistent gets
0 physical reads
0 redo size
223156 bytes sent via SQL*Net to client
2371 bytes received via SQL*Net from client
170 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
2528 rows processed
NETSTED LOOP:
12:22:41 SYS@ prod>select /*+ use_nl(a b) */ * from sgtb a,obtb b where a.segment_name=b.object_name;
2528 rows selected.
Elapsed: 00:00:00.09
Execution Plan
----------------------------------------------------------
Plan hash value: 2080873268
--------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 2401 | 515K| 4638 (1)| 00:00:56 |
| 1 | NESTED LOOPS | | | | | |
| 2 | NESTED LOOPS | | 2401 | 515K| 4638 (1)| 00:00:56 |
| 3 | TABLE ACCESS FULL | SGTB | 2312 | 279K| 13 (0)| 00:00:01 |
|* 4 | INDEX RANGE SCAN | OB_NAME_IND | 1 | | 1 (0)| 00:00:01 |
| 5 | TABLE ACCESS BY INDEX ROWID| OBTB | 1 | 96 | 2 (0)| 00:00:01 |
--------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
4 - access("A"."SEGMENT_NAME"="B"."OBJECT_NAME")
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
3065 consistent gets
0 physical reads
0 redo size
213135 bytes sent via SQL*Net to client
2371 bytes received via SQL*Net from client
170 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
2528 rows processed
SORT MERGE JOIN:
12:24:30 SYS@ prod>select /*+ use_merge(a b) */ * from sgtb a,obtb b where a.segment_name=b.object_name;
2528 rows selected.
Elapsed: 00:00:00.16
Execution Plan
----------------------------------------------------------
Plan hash value: 2191280214
------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |
------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 2401 | 515K| | 883 (1)| 00:00:11 |
| 1 | MERGE JOIN | | 2401 | 515K| | 883 (1)| 00:00:11 |
| 2 | SORT JOIN | | 2312 | 279K| 840K| 80 (2)| 00:00:01 |
| 3 | TABLE ACCESS FULL| SGTB | 2312 | 279K| | 13 (0)| 00:00:01 |
|* 4 | SORT JOIN | | 30928 | 2899K| 8136K| 803 (1)| 00:00:10 |
| 5 | TABLE ACCESS FULL| OBTB | 30928 | 2899K| | 121 (1)| 00:00:02 |
------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
4 - access("A"."SEGMENT_NAME"="B"."OBJECT_NAME")
filter("A"."SEGMENT_NAME"="B"."OBJECT_NAME")
Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
485 consistent gets
0 physical reads
0 redo size
235884 bytes sent via SQL*Net to client
2371 bytes received via SQL*Net from client
170 SQL*Net roundtrips to/from client
2 sorts (memory)
0 sorts (disk)
2528 rows processed
成本对比:
连接方式 读取块数 CPU 排序 时间netsted loop 3065 4638 0 0.56
hash join 650 134 0 0.02
sort merge join 485 883 2 0.11
最后,Hash Join使用的场景是有限制的。其中最大的一个就是连接操作仅能使用“=”连接。因为Hash匹配的过程只能支持相等操作。还有就是连接列的数据分布要尽量做到数据分布均匀,这样产生的Bucket也会尽可能均匀。这样限制匹配的速度才有保证。如果数据列分布偏移严重,Hash Join算法效率会有退化趋势。