先了解一些概念
DRAM(Dynamic Random Access Memory),即动态随机存取存储器,最为常见的系统内存。ECC是“Error Checking and Correcting”的简写,中文名称是“错误检查和纠正”。ECC内存,即应用了能够实现错误检查和纠正技术(ECC)的内存条。EDAC,即Error Detection And Correction(错误检测与纠正)。
内存有两种错误类型分别是CE和UE,CE 是 Correctable Error 的简称, UE是Uncorrectable Error的简称,CE即可恢复的错误,暂不影响系统的正常运行。可以在找时机停机换掉。UE为不可恢复的内存错误,通常会导致宕机。
系统messages日志 [root@my-host mg4a]# grep kernel /var/log/messages Jan 14 19:01:11 my-host kernel: mce: [Hardware Error]: Machine check events logged Jan 14 19:01:12 my-host kernel: EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Ha#1_Chan#1_DIMM#0 (channel:5 slot:0 page:0x554c02 offset:0x3c0 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0091 socket:0 ha:1 channel_mask:2 rank:0) [root@my-host mg4a]# grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count /sys/devices/system/edac/mc/mc0/csrow0/ch1_ce_count:0 /sys/devices/system/edac/mc/mc0/csrow0/ch5_ce_count:1 /sys/devices/system/edac/mc/mc1/csrow0/ch1_ce_count:0 /sys/devices/system/edac/mc/mc1/csrow0/ch5_ce_count:0 /sys/devices/system/edac/mc/mc2/csrow0/ch1_ce_count:0 /sys/devices/system/edac/mc/mc2/csrow0/ch5_ce_count:0 /sys/devices/system/edac/mc/mc3/csrow0/ch1_ce_count:0 /sys/devices/system/edac/mc/mc3/csrow0/ch5_ce_count:0 [root@my-host mg4a]# dmidecode -t 1 # dmidecode 3.0 Getting SMBIOS data from sysfs. SMBIOS 2.7 present. Handle 0x0044, DMI type 1, 27 bytes System Information Manufacturer: LENOVO Product Name: Lenovo System x3750 M4 -[8753IH5]- Version: 03 Serial Number: 06FF367 UUID: C4EF8080-7926-11E5-8B14-6C0B849B418E Wake-up Type: Other SKU Number: XxXxXxX Family: System X这是另外一台设备messges日志
Jun 27 13:53:25 irora30 kernel: [Hardware Error]: MC4 Error (node 2): DRAM ECC error detected on the NB. Jun 27 13:53:25 irora30 kernel: EDAC amd64 MC2: CE ERROR_ADDRESS= 0x8de3b1960 Jun 27 13:53:25 irora30 kernel: EDAC MC2: CE page 0x8de3b1, offset 0x960, grain 0, syndrome 0xab40, row 5, channel 0, label "": amd64_edac Jun 27 13:53:25 irora30 kernel: [Hardware Error]: Error Status: Corrected error, no action required. Jun 27 13:53:25 irora30 kernel: [Hardware Error]: CPU:1 (15:2:0) MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]: 0x8c204000ab080a13 Jun 27 13:53:25 irora30 kernel: [Hardware Error]: MC4_ADDR: 0x00000008de3b1960 Jun 27 13:53:25 irora30 kernel: [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout) Jun 27 14:19:27 irora30 auditd[5571]: Audit daemon rotating log files Jun 27 19:09:23 irora30 auditd[5571]: Audit daemon rotating log files Jun 27 23:59:21 irora30 auditd[5571]: Audit daemon rotating log files Jun 28 02:15:55 irora30 kernel: [Hardware Error]: MC4 Error (node 2): DRAM ECC error detected on the NB. Jun 28 02:15:55 irora30 kernel: EDAC amd64 MC2: CE ERROR_ADDRESS= 0x8d9ea5960 Jun 28 02:15:55 irora30 kernel: EDAC MC2: CE page 0x8d9ea5, offset 0x960, grain 0, syndrome 0xab40, row 5, channel 0, label "": amd64_edac Jun 28 02:15:55 irora30 kernel: [Hardware Error]: Error Status: Corrected error, no action required. Jun 28 02:15:55 irora30 kernel: [Hardware Error]: CPU:1 (15:2:0) MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]: 0x8c204000ab080813 Jun 28 02:15:55 irora30 kernel: [Hardware Error]: MC4_ADDR: 0x00000008d9ea5960 Jun 28 02:15:55 irora30 kernel: [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: SRC (no timeout) Jun 28 03:08:25 irora30 kernel: [Hardware Error]: MC4 Error (node 2): DRAM ECC error detected on the NB. Jun 28 03:08:25 irora30 kernel: EDAC amd64 MC2: CE ERROR_ADDRESS= 0x8ded39960 Jun 28 03:08:25 irora30 kernel: EDAC MC2: CE page 0x8ded39, offset 0x960, grain 0, syndrome 0xab40, row 5, channel 0, label "": amd64_edac Jun 28 03:08:25 irora30 kernel: [Hardware Error]: Error Status: Corrected error, no action required. Jun 28 03:08:25 irora30 kernel: [Hardware Error]: CPU:1 (15:2:0) MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]: 0x8c204000ab080813 Jun 28 03:08:25 irora30 kernel: [Hardware Error]: MC4_ADDR: 0x00000008ded39960 Jun 28 03:08:25 irora30 kernel: [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: SRC (no timeout) Jun 28 03:45:13 irora30 rhsmd: In order for Subscription Manager to provide your system with updates, your system must be registered with the Customer Portal. Please enter your Red Hat login to ensure your system is up-to-date. Jun 28 04:44:25 irora30 auditd[5571]: Audit daemon rotating log files Jun 28 09:34:22 irora30 auditd[5571]: Audit daemon rotating log files Jun 28 10:02:30 irora30 ansible-command: Invoked with warn=True executable=None _uses_shell=True _raw_params=df -hl /var|awk \'NR>1 && int($5) > 80\' removes=None creates=None chdir=None Jun 28 14:23:49 irora30 auditd[5571]: Audit daemon rotating log files Jun 28 19:09:25 irora30 auditd[5571]: Audit daemon rotating log files 故障确认及定位故障内存槽位 [root@irora30 ~]# grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count /sys/devices/system/edac/mc/mc0/csrow4/ch0_ce_count:0 /sys/devices/system/edac/mc/mc0/csrow5/ch0_ce_count:0 /sys/devices/system/edac/mc/mc1/csrow4/ch0_ce_count:0 /sys/devices/system/edac/mc/mc1/csrow5/ch0_ce_count:0 /sys/devices/system/edac/mc/mc2/csrow4/ch0_ce_count:0 /sys/devices/system/edac/mc/mc2/csrow5/ch0_ce_count:294 /sys/devices/system/edac/mc/mc3/csrow4/ch0_ce_count:0 /sys/devices/system/edac/mc/mc3/csrow5/ch0_ce_count:0 /sys/devices/system/edac/mc/mc4/csrow4/ch0_ce_count:0 /sys/devices/system/edac/mc/mc4/csrow5/ch0_ce_count:0 /sys/devices/system/edac/mc/mc5/csrow4/ch0_ce_count:0 /sys/devices/system/edac/mc/mc5/csrow5/ch0_ce_count:0 /sys/devices/system/edac/mc/mc6/csrow4/ch0_ce_count:0 /sys/devices/system/edac/mc/mc6/csrow5/ch0_ce_count:0 /sys/devices/system/edac/mc/mc7/csrow4/ch0_ce_count:0 /sys/devices/system/edac/mc/mc7/csrow5/ch0_ce_count:0 [root@irora30 ~]#count:不为0的行即代表存在内存错误。
mc:第几个CPU。
csrow:内存通道。
ch*:通道内的第几根内存。