环境:AIX6.1 + Oracle11.2.0.4 RAC(2 nodes)
1.故障现象
2.定位问题
3.处理问题
1.故障现象使用crsctl查看集群各资源状态,在任一节点都会直接报错CRS-4535, CRS-4000;但此时数据库是可以被正常访问的。
具体故障现象如下:
同样的,crs_stat -t 查看一样报错,错误码是CRS-0184:
root@bjdb1:/>crs_stat -t CRS-0184: Cannot communicate with the CRS daemon.节点2也一样!
确定此时数据库是可以被正常访问的。如下:
#节点2模拟客户端登录RAC集群,使用SCAN IP访问,发现可以正常访问到数据库 oracle@bjdb2:/home/oracle>sqlplus jingyu/jingyu@192.168.103.31/bjdb SQL*Plus: Release 11.2.0.4.0 Production on Mon Oct 10 14:24:47 2016 Copyright (c) 1982, 2013, Oracle. All rights reserved. Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP, Data Mining and Real Application Testing options SQL>RAC环境下的/etc/hosts文件相关内容:
#scan 192.168.103.31 scan-ip 2.定位问题首先查看节点1的集群相关日志:
Clusterware(GI)的日志存放在$GRID_HOME/log/nodename下;
Clusterware(GI)对应几个关键的后台进程css,crs,evm,它们的日志分别存在cssd,crsd,evmd目录下;
节点1查看相关日志:
#查看GI的alert日志文件,最近的记录只是提示GI所在存储空间使用率高,稍后清理下即可,而且目前还有一定空间剩余,显然并非是此次故障的原因。 root@bjdb1:/opt/u01/app/11.2.0/grid/log/bjdb1>tail -f alert*.log 2016-10-10 14:18:26.125: [crflogd(39190674)]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is '/opt/u01/app/11.2.0/grid/crf/db/bjdb1'. 2016-10-10 14:23:31.125: [crflogd(39190674)]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is '/opt/u01/app/11.2.0/grid/crf/db/bjdb1'. 2016-10-10 14:28:36.125: [crflogd(39190674)]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is '/opt/u01/app/11.2.0/grid/crf/db/bjdb1'. 2016-10-10 14:33:41.125: [crflogd(39190674)]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is '/opt/u01/app/11.2.0/grid/crf/db/bjdb1'. 2016-10-10 14:38:46.125: [crflogd(39190674)]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is '/opt/u01/app/11.2.0/grid/crf/db/bjdb1'. #因为crsctl不可以使用,进而查看crs的日志信息,发现3号已经有报错,无法打开裸设备,从而导致无法初始化OCR;继续看错误信息,发现是这个时候访问共享存储时无法成功。怀疑此刻存储出现问题,需要进一步和现场人员确定此时间点是否有存储相关的施工。 root@bjdb1:/opt/u01/app/11.2.0/grid/log/bjdb1/crsd>tail -f crsd.log 2016-10-03 18:04:40.248: [ OCRRAW][1]proprinit: Could not open raw device 2016-10-03 18:04:40.248: [ OCRASM][1]proprasmcl: asmhandle is NULL 2016-10-03 18:04:40.252: [ OCRAPI][1]a_init:16!: Backend init unsuccessful : [26] 2016-10-03 18:04:40.253: [ CRSOCR][1] OCR context init failure. Error: PROC-26: Error while accessing the physical storage 2016-10-03 18:04:40.253: [ CRSD][1] Created alert : (:CRSD00111:) : Could not init OCR, error: PROC-26: Error while accessing the physical storage 2016-10-03 18:04:40.253: [ CRSD][1][PANIC] CRSD exiting: Could not init OCR, code: 26 2016-10-03 18:04:40.253: [ CRSD][1] Done.节点2查看相关日志:
#查看GI的alert日志,发现节点2的ctss有CRS-2409的报错,虽然根据MOS文档 ID 1135337.1说明,This is not an error. ctssd is reporting that there is a time difference and it is not doing anything about it as it is running in observer mode.只需要查看两个节点的时间是否一致,但实际上查询节点时间一致: root@bjdb2:/opt/u01/app/11.2.0/grid/log/bjdb2>tail -f alert*.log 2016-10-10 12:29:22.145: [ctssd(5243030)]CRS-2409:The clock on host bjdb2 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode. 2016-10-10 12:59:38.799: [ctssd(5243030)]CRS-2409:The clock on host bjdb2 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode. 2016-10-10 13:34:11.402: [ctssd(5243030)]CRS-2409:The clock on host bjdb2 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode. 2016-10-10 14:12:44.168: [ctssd(5243030)]CRS-2409:The clock on host bjdb2 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode. 2016-10-10 14:44:04.824: [ctssd(5243030)]CRS-2409:The clock on host bjdb2 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode. #查看节点2的crs日志,发现和节点1相近的时间点,同样访问共享存储出现了问题,进而无法初始化OCR root@bjdb2:/opt/u01/app/11.2.0/grid/log/bjdb2/crsd>tail -f crsd.log 2016-10-03 18:04:31.077: [ OCRRAW][1]proprinit: Could not open raw device 2016-10-03 18:04:31.077: [ OCRASM][1]proprasmcl: asmhandle is NULL 2016-10-03 18:04:31.081: [ OCRAPI][1]a_init:16!: Backend init unsuccessful : [26] 2016-10-03 18:04:31.081: [ CRSOCR][1] OCR context init failure. Error: PROC-26: Error while accessing the physical storage 2016-10-03 18:04:31.082: [ CRSD][1] Created alert : (:CRSD00111:) : Could not init OCR, error: PROC-26: Error while accessing the physical storage 2016-10-03 18:04:31.082: [ CRSD][1][PANIC] CRSD exiting: Could not init OCR, code: 26 2016-10-03 18:04:31.082: [ CRSD][1] Done.