###db01 message
Sep 29 15:29:27 db01 kernel: bonding: bond1: link status definitely down for interface eth3, disabling it
Sep 29 15:29:27 db01 kernel: bonding: bond1: making interface eth2 the new active one.
Sep 29 15:29:31 db01 kernel: igb: eth3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Sep 29 15:29:31 db01 kernel: bonding: bond1: link status definitely up for interface eth3.
Sep 29 15:31:28 db01 kernel: igb: eth2 NIC Link is Down
Sep 29 15:31:28 db01 kernel: bonding: bond1: link status definitely down for interface eth2, disabling it
Sep 29 15:31:28 db01 kernel: bonding: bond1: making interface eth3 the new active one.
Sep 29 15:31:28 db01 kernel: igb: eth3 NIC Link is Down
Sep 29 15:31:29 db01 kernel: bonding: bond1: link status definitely down for interface eth3, disabling it
Sep 29 15:31:29 db01 kernel: bonding: bond1: now running without any active interface !
Sep 29 15:31:54 db01 kernel: igb: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Sep 29 15:31:54 db01 kernel: bonding: bond1: link status definitely up for interface eth2.
Sep 29 15:31:54 db01 kernel: bonding: bond1: making interface eth2 the new active one.
Sep 29 15:31:54 db01 kernel: bonding: bond1: first active interface up!
Sep 29 15:31:54 db01 kernel: igb: eth3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Sep 29 15:31:54 db01 kernel: bonding: bond1: link status definitely up for interface eth3.
Sep 29 15:36:10 db01 shutdown[17047]: shutting down for system reboot
Sep 29 15:36:11 db01 gconfd (root-6536): Received signal 15, shutting down cleanly
Sep 29 15:36:11 db01 gconfd (root-6536): Exiting
###db02 message
Sep 29 15:36:54 db02 kernel: igb: eth2 NIC Link is Down
Sep 29 15:36:54 db02 kernel: bonding: bond1: link status definitely down for interface eth2, disabling it
Sep 29 15:36:54 db02 kernel: bonding: bond1: making interface eth3 the new active one.
Sep 29 15:36:55 db02 kernel: igb: eth3 NIC Link is Down
Sep 29 15:36:55 db02 kernel: bonding: bond1: link status definitely down for interface eth3, disabling it
Sep 29 15:36:55 db02 kernel: bonding: bond1: now running without any active interface !
Sep 29 15:37:10 db02 kernel: igb: eth2 NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX/TX
Sep 29 15:37:10 db02 kernel: igb: eth3 NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX/TX
Sep 29 15:37:10 db02 kernel: bonding: bond1: link status definitely up for interface eth2.
Sep 29 15:37:10 db02 kernel: bonding: bond1: making interface eth2 the new active one.
Sep 29 15:37:10 db02 kernel: bonding: bond1: first active interface up!
Sep 29 15:37:10 db02 kernel: bonding: bond1: link status definitely up for interface eth3.
问题分析:
从如上详细的日志信息我们不难看出有如下动作,在db01上执行关机操作之后ocss和crsd进程都会向远端返送消息告诉对端本机即将执行关闭操作。后再停止各个进程。
在节点二中我们可从message日志中看到之前的私网bond1口状态时Down,在15:37分将第一个节点shutdown -immediate之后私网bond1居然自动执行了up操作。随即我们从ocss和crsd的日志中可以看到集群进程都正在起来。
那么这个时候我们可以分析问题应该是出在私网网络这一部分,可能是网卡绑定的问题。
问题处理过程:
既然从日志中看出是网络问题,那么我们就从网络排除,待节点一重启启动后,首先采用ping私网来确定,节点一启动了,同样集群服务是没有起来的:
ping节点2的私网,不通:
[root@db01 ~]# ping pri02
PING pri02.xmtvdb.com (10.10.11.2) 56(84) bytes of data.
From pri01.xmtvdb.com (10.10.11.1) icmp_seq=193 Destination Host Unreachable
From pri01.xmtvdb.com (10.10.11.1) icmp_seq=194 Destination Host Unreachable
From pri01.xmtvdb.com (10.10.11.1) icmp_seq=195 Destination Host Unreachable
From pri01.xmtvdb.com (10.10.11.1) icmp_seq=197 Destination Host Unreachable
From pri01.xmtvdb.com (10.10.11.1) icmp_seq=198 Destination Host Unreachable
From pri01.xmtvdb.com (10.10.11.1) icmp_seq=199 Destination Host Unreachable
From pri01.xmtvdb.com (10.10.11.1) icmp_seq=201 Destination Host Unreachable
From pri01.xmtvdb.com (10.10.11.1) icmp_seq=202 Destination Host Unreachable
From pri01.xmtvdb.com (10.10.11.1) icmp_seq=203 Destination Host Unreachable
From pri01.xmtvdb.com (10.10.11.1) icmp_seq=204 Destination Host Unreachable
From pri01.xmtvdb.com (10.10.11.1) icmp_seq=205 Destination Host Unreachable
From pri01.xmtvdb.com (10.10.11.1) icmp_seq=206 Destination Host Unreachable
检查bonding,是好的,没有问题:
###db01
[root@db01 ~]# cat /proc/net/bonding/bond1
Ethernet Channel Bonding Driver: v3.4.0-1 (October 7, 2008)
Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth2
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0
Slave Interface: eth2
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 1
Permanent HW addr: 40:f2:e9:db:c9:c4
Slave Interface: eth3
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 1
Permanent HW addr: 40:f2:e9:db:c9:c5
###db02
[root@db02 ~]# cat /proc/net/bonding/bond1
Ethernet Channel Bonding Driver: v3.4.0-1 (October 7, 2008)
Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth2
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0
Slave Interface: eth2
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 40:f2:e9:db:c9:fc
Slave Interface: eth3
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 40:f2:e9:db:c9:fd
随即尝试down掉节点二的bond1中的eth3网口,发现可以ping通,且集群能够起来。
###db02
[root@db02 ~]# ifdow eth3
Sep 29 15:40:55 db02 kernel: bonding: bond1: Removing slave eth3
###db01
[root@db01 ~]# ping pri02
PING pri02.xmtvdb.com (10.10.11.2) 56(84) bytes of data.
64 bytes from pri02.xmtvdb.com (10.10.11.2): icmp_seq=1 ttl=64 time=0.071 ms
64 bytes from pri02.xmtvdb.com (10.10.11.2): icmp_seq=2 ttl=64 time=0.122 ms
64 bytes from pri02.xmtvdb.com (10.10.11.2): icmp_seq=3 ttl=64 time=0.134 ms
64 bytes from pri02.xmtvdb.com (10.10.11.2): icmp_seq=4 ttl=64 time=0.098 ms
同时这个时候集群服务也起来了:
[root@db01 ~]# su - grid -c "crsctl status res -t"
--------------------------------------------------------------------------------
NAME TARGET STATE SERVER STATE_DETAILS
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.BAK001.dg
ONLINE ONLINE db01
ONLINE ONLINE db02
ora.DATA001.dg
ONLINE ONLINE db01
ONLINE ONLINE db02
ora.FRA001.dg
ONLINE ONLINE db01
ONLINE ONLINE db02
ora.LISTENER.lsnr
ONLINE ONLINE db01
ONLINE ONLINE db02
ora.OCR_VOTE.dg
ONLINE ONLINE db01
ONLINE ONLINE db02
ora.asm
ONLINE ONLINE db01 Started
ONLINE ONLINE db02 Started
ora.gsd
OFFLINE OFFLINE db01
OFFLINE OFFLINE db02
ora.net1.network
ONLINE ONLINE db01
ONLINE ONLINE db02
ora.ons
ONLINE ONLINE db01
ONLINE ONLINE db02
ora.registry.acfs
ONLINE ONLINE db01
ONLINE ONLINE db02
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
1 ONLINE ONLINE db01
ora.cvu
1 ONLINE ONLINE db01
ora.db01.vip
1 ONLINE ONLINE db01
ora.db02.vip
1 ONLINE ONLINE db02
ora.oc4j
1 ONLINE ONLINE db01
ora.scan1.vip
1 ONLINE ONLINE db01
ora.xmman.db
1 ONLINE ONLINE db01 Open
2 ONLINE ONLINE db02 Open
ora.xmman.taf.svc
1 ONLINE ONLINE db01
2 ONLINE ONLINE db02
再次把eth3 up起来,不受影响
###db02
[root@db02 ~]# ifup eth3
###db01
[root@db01 ~]# ping pri02
PING pri02.xmtvdb.com (10.10.11.2) 56(84) bytes of data.
64 bytes from pri02.xmtvdb.com (10.10.11.2): icmp_seq=1 ttl=64 time=0.161 ms
64 bytes from pri02.xmtvdb.com (10.10.11.2): icmp_seq=2 ttl=64 time=0.022 ms
64 bytes from pri02.xmtvdb.com (10.10.11.2): icmp_seq=3 ttl=64 time=0.034 ms
64 bytes from pri02.xmtvdb.com (10.10.11.2): icmp_seq=4 ttl=64 time=0.196 ms
随即根据Oracle最佳实践将直连的两根心跳线连接上交换后,问题没有再现;原因未知,有知道的请告知。