30036故障解决方法案例

日期：2020-06-02 栏目：程序人生浏览：次

故障现象为某省结算库应用方在跑存储过程发现报错

ORA-30036: unable to extend segment by 8 in undo tablespace 'UNDOTBS1'

后进行了下面的一系列的排查分析：
查看undo表空间使用率为100%，查看alert日志中发现了大量的事物已经ORA-01555的报错。
Wed Jan 30 04:32:01 GMT+08:00 2013ORA-01555 caused by SQL statement below (SQL ID: 4ds6qq0mfac2t, Query Duration=5 sec, SCN: 0x0c00.d915c977):
Wed Jan 30 04:32:01 GMT+08:00 2013SELECT distinct FILE_NAME,to_char(file_name_time,:"SYS_B_00"),A.operation_type_grade FROM T_FILE_INFO_81 A,T_FILE_CLASS B ,T_FILE_STAT
E C WHERE ((A.RAT_FIRST_STARTTIME >=:"SYS_B_01" and A.RAT_FIRST_STARTTIME <=:"SYS_B_02" )
or (A.RAT_LAST_ENDTIME >=:"SYS_B_03" and A.RAT_LAST_ENDTIME <=:"SYS_B_04" )
or (A.RAT_FIRST_STARTTIME <=:"SYS_B_05" and A.RAT_LAST_ENDTIME >= :"SYS_B_06" ) )
and ((A.RAT_FIRST_CALLING_NBR >=:"SYS_B_07" and A.RAT_FIRST_CALLING_NBR <=:"SYS_B_08")
or (A.RAT_LAST_CALLING_NBR >= :"SYS_B_09" and A.RAT_LAST_CALLING_NBR <= :"SYS_B_10" )
or (A.RAT_FIRST_CALLING_NBR <=:"SYS_B_11" and A.RAT_LAST_CALLING_NBR >=:"SYS_B_12" ))
and A.FILE_NAME <>:"SYS_B_13" and A.File_Class_Id = B.File_Class_Id and B.operati
on_type_id=:"SYS_B_14" and (A.FILE_NAME_TIME + interval :"SYS_B_15" day ) > TO_DATE(:"SYS_B_16",:"SYS_B_17")
and A.city_id =:"SYS_B_18" and a.state in (:"SYS
Wed Jan 30 08:49:12 GMT+08:00 2013insert into STL_GX.T_all_file_TOTAL
select aa.province_id,
dd.name,
aa.BILL_DATE,
aa.OPERATION_TYPE_GRADE,
:"SYS_B_00",
aa.cj_files,
bb.pj_files,
bb.org_counts,
bb.rate_counts,
bb.inp_counts,
bb.inpc_counts,
cc.jf_files,
cc.jf_counts,
:"SYS_B_01",
:"SYS_B_02",
aa.is_rate,
aa.is_billtag,
aa.is_insert,
bb.ERR_COUNTS
from (SELECT a.province_code as province_id,
substr(A.ORG_FILENAME, :"SYS_B_03", :"SYS_B_04") as BILL_DATE,
b.operation_type_grade,
COUNT(*) as cj_files,
b.is_rate as is_rate,
b.is_billtag as is_billtag,
b.is_insert as is_insert
FROM STL_PARA.T_LOG_COLLECT_76@pub_PARA A,
stl_gx.tmp_cj_info b
WHERE substr(A.ORG_FILENAME, :"SYS_B_05", :"SYS_B_06") = :"SYS_B_07"
and b.province_id = :"S
Wed Jan 30 09:44:56 GMT+08:00 2013Thread 1 advanced to log sequence 149384 (LGWR switch)
Current log# 6 seq# 149384 mem# 0: /dev/rjs_redolog06
Wed Jan 30 09:50:00 GMT+08:00 2013ORA-01555 caused by SQL statement below (SQL ID: 7t0bjnwxt9ufv, Query Duration=180 sec, SCN: 0x0c00.e294a117):
这个很明显，是因为undo中存在大量的insert操作，导致数据库undo没有commit，由于本库的实际环境，之前做过undo_retention的调整。下面看此设定值。
SQL> show parameter undo

NAME TYPE VALUE
------------------------------------ -------------------------------- ------------------------------
undo_management string AUTO
undo_retention integer 0
undo_tablespace string UNDOTBS1

undo_management为auto、retention时间为0让Oracle自动调整保留提交后undo信息的时间。Oracle 10g之前,在自动Undo管理的模式下，我们都知道undo_retention参数的作用是用来控制当transaction被commit之后，undo信息的保留时间。这些undo信息可以用来构造consistent read以及用于一系列的闪回恢复，而且足够的undo信息还可以减少ORA-01555错误的发生，在Oracle 9R1中呢，这个value的默认值是900秒，Oracle 9R2以后这个value提高到了10800秒。即使我们设置了undo_retention这个参数，那么在默认情况下，这是一个noguarantee的限制。也就是说我将undo_retention=10800,那么原本以为在一个transaction commit之后，之前的undo还可以保存10800秒，才可以被别的transaction DML覆盖，孰不知当有其他的transaction DML处理过程中需要undo空间的时候，恰恰这个时候not enough space for undo，也就说我并没有允许undo tablespace自动扩展。由于我们的retention是noguarantee的，所以transaction DML就会忽略这种retention的时间限制直接回绕覆盖我们的undo信息，这种结果下其实在很多情况下是不希望得到的。
Oracle 10g之后，oracle提出了一个特性就是undo的guarantee，可以强制oracle来guarantee的undo信息，也就说如果一个session的transaction DML需要undo空间的时候，即使undo的空间不足，这个session也不会强制覆盖由undo_retention所保护的undo信息，那么这个transaction DML会因为undo空间的不足会而report一个error并自动退出。
SQL> select tablespace_name,block_size,extent_management
2 segment_space_management,contents,retention
3 from dba_tablespaces;

TABLESPACE_NAME BLOCK_SIZE SEGMENT_SP CONTENTS RETENTION
------------------------------ ---------- ---------- --------- -----------
SYSTEM 8192 LOCAL PERMANENT NOT APPLY
UNDOTBS1 8192 LOCAL UNDO NOGUARANTEE
SYSAUX 8192 LOCAL PERMANENT NOT APPLY
TEMP 8192 LOCAL TEMPORARY NOT APPLY
USERS 8192 LOCAL PERMANENT NOT APPLY
COMMDATA 8192 LOCAL PERMANENT NOT APPLY
SETTLEINDEX 8192 LOCAL PERMANENT NOT APPLY
SETTLEDATA 8192 LOCAL PERMANENT NOT APPLY
STATDATA 8192 LOCAL PERMANENT NOT APPLY
STATINDEX 8192 LOCAL PERMANENT NOT APPLY
COMMINDEX 8192 LOCAL PERMANENT NOT APPLY

TABLESPACE_NAME BLOCK_SIZE SEGMENT_SP CONTENTS RETENTION
------------------------------ ---------- ---------- --------- -----------
RMAN_TBS 8192 LOCAL PERMANENT NOT APPLY

12 rows selected.
之后想的既然资源无法commit是否可以重启数据库达到资源释放，所以15点20开始重启数据库，打算重新找一个数据文件，然后重新创建一个undo表空间，将undotbs1切换到undotbs2并把tbs1 offline，drop后，在切回到tbs1上面进行资源释放。

create undo tablespace UNDOTBS2 datafile '/dev/untb03.dbf' size 32700M
alter system set undo_tablespace=UNDOTBS2 scope=both;
将原来的UNDO表空间,置为脱机:
alter tablespace UNDOTBS1 offline;
删除原来的UNDO表空间:
drop tablespace UNDOTBS1 including contents AND DATAFILES CASCADE CONSTRAINTS ;
Wed Jan 30 15:33:39 GMT+08:00 2013ALTER DATABASE OPEN
重启发现，undo的利用率还是100%、也就是说undo_retention=0没有生效
Wed Jan 30 17:27:06 GMT+08:00 2013ALTER SYSTEM SET undo_retention=900 SCOPE=BOTH;
设定retention时间为15分钟，那么看看数据库中undo active使用率居然高达62GB（总共undo表空间为64GB）
发现此刻数据库中存在两个死事物
SQL> select ADDR,KTUXEUSN,KTUXESLT,KTUXESQN,KTUXESIZ from x$ktuxe where KTUXECFL='DEAD';
ADDR KTUXEUSN KTUXESLT KTUXESQN KTUXESIZ
---------------- ---------- ---------- ---------- ----------
00000001108C63F0 75 27 2368514 795545
00000001108C5D10 644 7 57597 0
由于已经报Oracle ACS服务，oracle工程师到来后，原75-27死事物已经不存在（18：35左右应用方停止了相关应用）
再次查看数据库中UNEXPIRED利用率63GB、ACTIVE利用率1GB。
那么应该是死事物得到了释放，再次查看
SQL> select ADDR,KTUXEUSN,KTUXESLT,KTUXESQN,KTUXESIZ from x$ktuxe where KTUXECFL='DEAD';

ADDR KTUXEUSN KTUXESLT KTUXESQN KTUXESIZ
---------------- ---------- ---------- ---------- ----------
0000000110845CF0 644 7 57597 0
那么绝对就是75-27得到了释放，查看现有的交易
SQL> alter session set nls_date_format='mm/dd/yy hh24:mi:ss';

Session altered.
SQL> select begin_time,end_time,UNXPSTEALCNT from v$undostat;

BEGIN_TIME END_TIME UNXPSTEALCNT
----------------- ----------------- ------------
01/30/13 20:53:28 01/31/13 11:09:42 71794
01/30/13 20:43:28 01/30/13 20:53:28 110035
01/30/13 20:33:28 01/30/13 20:43:28 15240
01/30/13 20:23:28 01/30/13 20:33:28 25489
01/30/13 20:13:28 01/30/13 20:23:28 11936
01/30/13 20:03:28 01/30/13 20:13:28 2950
01/30/13 19:53:28 01/30/13 20:03:28 707
01/30/13 19:43:28 01/30/13 19:53:28 0
01/30/13 19:33:28 01/30/13 19:43:28 0
01/30/13 19:23:28 01/30/13 19:33:28 1271
01/30/13 19:13:28 01/30/13 19:23:28 29187

BEGIN_TIME END_TIME UNXPSTEALCNT
----------------- ----------------- ------------
01/30/13 19:03:28 01/30/13 19:13:28 19976
01/30/13 18:53:28 01/30/13 19:03:28 1365
01/30/13 18:43:28 01/30/13 18:53:28 6235
01/30/13 18:33:28 01/30/13 18:43:28 24651
01/30/13 18:23:28 01/30/13 18:33:28 38220
01/30/13 18:13:28 01/30/13 18:23:28 49888
01/30/13 18:03:28 01/30/13 18:13:28 29815
01/30/13 17:53:28 01/30/13 18:03:28 43678
01/30/13 17:43:28 01/30/13 17:53:28 104834
01/30/13 17:33:28 01/30/13 17:43:28 101518
01/30/13 17:23:28 01/30/13 17:33:28 45838

BEGIN_TIME END_TIME UNXPSTEALCNT
----------------- ----------------- ------------
01/30/13 17:13:28 01/30/13 17:23:28 30964
01/30/13 17:03:28 01/30/13 17:13:28 43876
01/30/13 16:53:28 01/30/13 17:03:28 15455
01/30/13 16:43:28 01/30/13 16:53:28 7839
01/30/13 16:33:28 01/30/13 16:43:28 24606
01/30/13 16:23:28 01/30/13 16:33:28 40497
01/30/13 16:13:28 01/30/13 16:23:28 34759
01/30/13 16:03:28 01/30/13 16:13:28 118142
01/30/13 15:53:28 01/30/13 16:03:28 107958
01/30/13 15:43:28 01/30/13 15:53:28 20249
01/30/13 15:33:28 01/30/13 15:43:28 0

33 rows selected.
把疑问交给ORACLE工程师了，为什么undo_retention设定了900s未即时生效？
通过查看metalink得知一个bug问题ps：Bug 5387030 - Automatic tuning of undo_retention causes unusual extra space allocation [ID 5387030.8]
Product (Component)
Oracle Server (Rdbms)

Range of versions believed to be affected
Versions >= 10.2.0.1 but BELOW 11.1

Versions confirmed as being affected
10.2.0.3
10.2.0.2
10.2.0.1

Description
When undo tablespace is using NON-AUTOEXTEND datafiles,
V$UNDOSTAT.TUNED_UNDORETENTION may be calculated too high preventing
undo block from being expired and reused. In extreme cases the undo
tablespace could be filled to capacity by these unexpired blocks.

An alert may be posted on DBA_ALERT_HISTORY that advises to increase
the space when it is not really necessary if this fix is applied.
If the user sets their own alert thresholds for undo tablespaces the
bug may prevent alerts from being produced.

Workaround
alter system set "_smu_debug_mode" = 33554432;
This causes the v$undostat.tuned_undoretention to be calculated as
the maximum of:
maxquerylen secs + 300
undo_retention specified in init.ora

Please note: The above is a summary description only. Actual symptoms can vary. Matching to any symptoms here does not confirm that you are encountering this problem. For questions about this bug please consult Oracle Support.

转载注明出处：https://www.heiqu.com/dcbb2e421b02004772d78aa97c46a98c.html

30036故障解决方法案例

相关推荐