MS Status is DOWN & Ping is SUCCESS in Exadata.
MS is the Management Server in Exadata. Received this alert from OEM server.
In this article let us see how to verify these kind of alerts/errors/warnings
Fatal Error Message
Message=ORPRD_CELLSRVR10 is down. MS Status is DOWN and Ping Status is SUCCESS.
Below is the step by step procedure for verifying alert by using dcli command in Exadata server.
Verification :
status of Cell services on Exadata Server
dcli command on uptime and output
[root@ORPRD_SRVR01 ~]# dcli -l root -g all_group “uptime” |
output from compute nodes
ORPRD_SRVR01: 02:39:30 up 316 days, 10:46, 3 users, load average: 2.98, 2.97, 2.95
ORPRD_SRVR02: 02:39:31 up 316 days, 10:52, 0 users, load average: 5.99, 4.82, 4.32
ORPRD_SRVR03: 02:39:30 up 316 days, 11:04, 0 users, load average: 2.89, 2.84, 2.91
ORPRD_SRVR04: 02:39:30 up 374 days, 23:20, 0 users, load average: 2.17, 2.28, 2.24
ORPRD_SRVR05: 02:39:30 up 374 days, 22:49, 0 users, load average: 2.43, 2.28, 2.24
ORPRD_SRVR06: 02:39:30 up 33 days, 23:40, 0 users, load average: 2.03, 1.91, 1.77
ORPRD_SRVR07: 02:39:30 up 374 days, 22:46, 0 users, load average: 179.46, 178.36, 178.11
ORPRD_SRVR08: 02:39:30 up 374 days, 22:26, 0 users, load average: 1.94, 2.03, 2.02
output from cell nodes
ORPRD_CELLSRVR01: 02:39:30 up 686 days, 9:31, 0 users, load average: 0.71, 0.77, 0.84
ORPRD_CELLSRVR02: 02:39:30 up 686 days, 9:27, 0 users, load average: 0.99, 0.93, 0.88
ORPRD_CELLSRVR03: 02:39:30 up 686 days, 9:32, 0 users, load average: 0.97, 0.91, 0.89
ORPRD_CELLSRVR04: 02:39:30 up 686 days, 9:34, 0 users, load average: 0.96, 1.05, 0.96
ORPRD_CELLSRVR05: 02:39:30 up 507 days, 14:31, 0 users, load average: 1.00, 0.93, 0.85
ORPRD_CELLSRVR06: 02:39:30 up 686 days, 9:32, 0 users, load average: 0.72, 0.89, 0.93
ORPRD_CELLSRVR07: 02:39:30 up 686 days, 9:36, 0 users, load average: 0.77, 0.92, 0.93
ORPRD_CELLSRVR08: 02:39:30 up 686 days, 9:39, 0 users, load average: 1.07, 0.91, 0.91
ORPRD_CELLSRVR09: 02:39:30 up 686 days, 9:37, 1 user, load average: 1.24, 1.01, 0.97
ORPRD_CELLSRVR10: 02:39:30 up 686 days, 9:15, 0 users, load average: 0.85, 0.90, 0.92
ORPRD_CELLSRVR11: 02:39:30 up 521 days, 20:11, 0 users, load average: 0.81, 0.85, 0.85
ORPRD_CELLSRVR12: 02:39:30 up 686 days, 9:21, 0 users, load average: 1.06, 1.00, 0.98
ORPRD_CELLSRVR13: 02:39:30 up 686 days, 9:19, 0 users, load average: 0.76, 0.89, 0.86
ORPRD_CELLSRVR14: 02:39:30 up 686 days, 9:21, 0 users, load average: 1.30, 1.04, 0.96
dcli command on “service celld status” and output
[root@ORPRD_SRVR01 ~]# dcli -l root -g cell_group “service celld status” |
ORPRD_CELLSRVR01: rsStatus: running
ORPRD_CELLSRVR01: msStatus: running
ORPRD_CELLSRVR01: cellsrvStatus: running
ORPRD_CELLSRVR02: rsStatus: running
ORPRD_CELLSRVR02: msStatus: running
ORPRD_CELLSRVR02: cellsrvStatus: running
ORPRD_CELLSRVR03: rsStatus: running
ORPRD_CELLSRVR03: msStatus: running
ORPRD_CELLSRVR03: cellsrvStatus: running
ORPRD_CELLSRVR04: rsStatus: running
ORPRD_CELLSRVR04: msStatus: running
ORPRD_CELLSRVR04: cellsrvStatus: running
ORPRD_CELLSRVR05: rsStatus: running
ORPRD_CELLSRVR05: msStatus: running
ORPRD_CELLSRVR05: cellsrvStatus: running
ORPRD_CELLSRVR06: rsStatus: running
ORPRD_CELLSRVR06: msStatus: running
ORPRD_CELLSRVR06: cellsrvStatus: running
ORPRD_CELLSRVR07: rsStatus: running
ORPRD_CELLSRVR07: msStatus: running
ORPRD_CELLSRVR07: cellsrvStatus: running
ORPRD_CELLSRVR08: rsStatus: running
ORPRD_CELLSRVR08: msStatus: running
ORPRD_CELLSRVR08: cellsrvStatus: running
ORPRD_CELLSRVR09: rsStatus: running
ORPRD_CELLSRVR09: msStatus: running
ORPRD_CELLSRVR09: cellsrvStatus: running
ORPRD_CELLSRVR10: rsStatus: running
ORPRD_CELLSRVR10: msStatus: running
ORPRD_CELLSRVR10: cellsrvStatus: running
ORPRD_CELLSRVR11: rsStatus: running
ORPRD_CELLSRVR11: msStatus: running
ORPRD_CELLSRVR11: cellsrvStatus: running
ORPRD_CELLSRVR12: rsStatus: running
ORPRD_CELLSRVR12: msStatus: running
ORPRD_CELLSRVR12: cellsrvStatus: running
ORPRD_CELLSRVR13: rsStatus: running
ORPRD_CELLSRVR13: msStatus: running
ORPRD_CELLSRVR13: cellsrvStatus: running
ORPRD_CELLSRVR14: rsStatus: running
ORPRD_CELLSRVR14: msStatus: running
ORPRD_CELLSRVR14: cellsrvStatus: running
dcli command on “service celld status” count and output
[root@ORPRD_SRVR01 ~]# dcli -l root -g cell_group “service celld status |wc -l” |
ORPRD_CELLSRVR01: 3
ORPRD_CELLSRVR02: 3
ORPRD_CELLSRVR03: 3
ORPRD_CELLSRVR04: 3
ORPRD_CELLSRVR05: 3
ORPRD_CELLSRVR06: 3
ORPRD_CELLSRVR07: 3
ORPRD_CELLSRVR08: 3
ORPRD_CELLSRVR09: 3
ORPRD_CELLSRVR10: 3
ORPRD_CELLSRVR11: 3
ORPRD_CELLSRVR12: 3
ORPRD_CELLSRVR13: 3
ORPRD_CELLSRVR14: 3
Verification in alert logfile
All services are up and Running fine and could see only below Enteries in Cell alertlog file.
[RS] Process /opt/oracle/cell/cellsrv/bin/cellrsmmt (pid: 23167) received clean shutdown signal from pid: 22903, uid: 0
[RS] Stopped Service MS
[RS] Started monitoring process /opt/oracle/cell/cellsrv/bin/cellrsmmt with pid 2556
[RS] Started Service MS with pid 2629
[RS] Process /opt/oracle/cell/cellsrv/bin/cellrsmmt (pid: 4442) received clean shutdown signal from pid: 6616, uid: 0
[RS] Stopped Service MS
[RS] Started monitoring process /opt/oracle/cell/cellsrv/bin/cellrsmmt with pid 7010
[RS] Started Service MS with pid 7079
dcli command on “list physicaldisk” count and output
[root@ORPRD_SRVR01 ~]# dcli -l root -g ~/cell_group ‘cellcli -e list physicaldisk | grep normal | wc -l’ |
ORPRD_CELLSRVR01: 16
ORPRD_CELLSRVR02: 16
ORPRD_CELLSRVR03: 16
ORPRD_CELLSRVR04: 16
ORPRD_CELLSRVR05: 16
ORPRD_CELLSRVR06: 16
ORPRD_CELLSRVR07: 16
ORPRD_CELLSRVR08: 16
ORPRD_CELLSRVR09: 16
ORPRD_CELLSRVR10: 16
ORPRD_CELLSRVR11: 16
ORPRD_CELLSRVR12: 16
ORPRD_CELLSRVR13: 16
ORPRD_CELLSRVR14: 16
dcli command on “list griddisk attributes asmmodestatus” and output
[root@ORPRD_SRVR01 ~]# dcli -l root -g ~/cell_group ‘cellcli -e list griddisk attributes asmmodestatus | grep ONLINE |wc -l’ |
ORPRD_CELLSRVR01: 34
ORPRD_CELLSRVR02: 34
ORPRD_CELLSRVR03: 34
ORPRD_CELLSRVR04: 34
ORPRD_CELLSRVR05: 34
ORPRD_CELLSRVR06: 34
ORPRD_CELLSRVR07: 34
ORPRD_CELLSRVR08: 34
ORPRD_CELLSRVR09: 34
ORPRD_CELLSRVR10: 34
ORPRD_CELLSRVR11: 34
ORPRD_CELLSRVR12: 34
ORPRD_CELLSRVR13: 34
ORPRD_CELLSRVR14: 34
dcli command on ” list griddisk attributes asmdeactivationoutcome”
[root@ORPRD_SRVR01 ~]# dcli -l root -g ~/cell_group ‘cellcli -e list griddisk attributes asmdeactivationoutcome | grep Yes |wc -l’ |
ORPRD_CELLSRVR01: 34
ORPRD_CELLSRVR02: 34
ORPRD_CELLSRVR03: 34
ORPRD_CELLSRVR04: 34
ORPRD_CELLSRVR05: 34
ORPRD_CELLSRVR06: 34
ORPRD_CELLSRVR07: 34
ORPRD_CELLSRVR08: 34
ORPRD_CELLSRVR09: 34
ORPRD_CELLSRVR10: 34
ORPRD_CELLSRVR11: 34
ORPRD_CELLSRVR12: 34
ORPRD_CELLSRVR13: 34
ORPRD_CELLSRVR14: 34
dcli command on “list metriccurrent” and output
[root@ORPRD_SRVR01 ~]# dcli -l root -g /root/cell_group “cellcli -e list metriccurrent | grep CL_MEMUT_MS | grep -v grep” |
ORPRD_CELLSRVR01: CL_MEMUT_MS ORPRD_CELLSRVR01 0.5 %
ORPRD_CELLSRVR02: CL_MEMUT_MS ORPRD_CELLSRVR02 0.5 %
ORPRD_CELLSRVR03: CL_MEMUT_MS ORPRD_CELLSRVR03 0.5 %
ORPRD_CELLSRVR04: CL_MEMUT_MS ORPRD_CELLSRVR04 0.5 %
ORPRD_CELLSRVR05: CL_MEMUT_MS ORPRD_CELLSRVR05 0.5 %
ORPRD_CELLSRVR06: CL_MEMUT_MS ORPRD_CELLSRVR06 0.5 %
ORPRD_CELLSRVR07: CL_MEMUT_MS ORPRD_CELLSRVR07 0.5 %
ORPRD_CELLSRVR08: CL_MEMUT_MS ORPRD_CELLSRVR08 0.5 %
ORPRD_CELLSRVR09: CL_MEMUT_MS ORPRD_CELLSRVR09 0.5 %
ORPRD_CELLSRVR10: CL_MEMUT_MS ORPRD_CELLSRVR10 0.5 %
ORPRD_CELLSRVR11: CL_MEMUT_MS ORPRD_CELLSRVR11 0.5 %
ORPRD_CELLSRVR12: CL_MEMUT_MS ORPRD_CELLSRVR12 0.5 %
ORPRD_CELLSRVR13: CL_MEMUT_MS ORPRD_CELLSRVR13 0.5 %
ORPRD_CELLSRVR14: CL_MEMUT_MS ORPRD_CELLSRVR14 0.5 %
[root@ORPRD_SRVR01 ~]#
Action taken:
Services were stopped and auto restarted, can be ignored.
Complete Original Message from OEM alert looks like below.
Subject: EM Event: Fatal:ORPRD_CELLSRVR10 – ORPRD_CELLSRVR10 is down. MS Status is DOWN and Ping Status is SUCCESS.
Host=ORPRD_SRVR08
Target type=Oracle Exadata Storage Server
Target name=ORPRD_CELLSRVR10
Categories=Availability
Message=ORPRD_CELLSRVR10 is down. MS Status is DOWN and Ping Status is SUCCESS.
Severity=Fatal
Operating System=Linux
Platform=x86_64
Associated Incident Id=149882
Associated Incident Status=New
Associated Incident Owner=
Associated Incident Acknowledged By Owner=No
Associated Incident Priority=None
Associated Incident Escalation Level=0
Event Type=Target Availability
Event name=Status
Availability status=Down
Root Cause Analysis Status=Cause
Causal analysis result=Identified as a cause to 1 symptoms
Rule Name=Incident management rule set for all targets,Incident creation rule for a Target Down availability status
Rule Owner=System Generated
Alert – Exadata MS Status is DOWN and ping Status is SUCCESS in Exadata.txt
Displaying Alert – Exadata MS Status is DOWN and ping Status is SUCCESS in Exadata.txt.
See Also: