Exadata node rebooted due to hardware failure

The Exadata node rebooted due to hardware failure.

Message from ASR :
An integrated I-O fatal error in a downstream PCIE device has been detected.
Alert from OEM :
 EM Event: Fatal:Node3  – The current status of the target is Down

Sample Output from node 3 ilom look like this :

-> show faulty
Target                            | Property                               | Value
———————————-+—————————————-+—————————————
/SP/faultmgmt/0                   | fru                                    | /SYS/MB
/SP/faultmgmt/0/faults/0          | class                                  | fault.io.intel.iio.pcie-fatal
/SP/faultmgmt/0/faults/0          | component                              | /SYS/MB

How to verify node got rebooted?

dcli -l root -g dbs_group “uptime”
OPRD_SRVR01: 14:36:00 up 319 days,  7:26,  3 users,  load average: 4.55, 5.60, 6.17
OPRD_SRVR02: 14:36:00 up 319 days,  8:04,  0 users,  load average: 22.05, 21.95, 22.24
OPRD_SRVR03: 14:36:00 up 14 min,  4 users,  load average: 10.57, 9.80, 5.28                 =====> showing uptime : 14 min
OPRD_SRVR04: 14:36:00 up 319 days,  7:37,  0 users,  load average: 5.09, 5.11, 5.20
OPRD_SRVR05: 14:36:00 up 319 days,  7:28,  0 users,  load average: 2.69, 2.45, 2.43
OPRD_SRVR06: 14:36:00 up 319 days,  7:09,  0 users,  load average: 2.72, 2.76, 2.87
OPRD_SRVR07: 14:36:00 up 319 days,  7:04,  0 users,  load average: 4.60, 5.20, 5.23
OPRD_SRVR08: 14:36:00 up 319 days,  8:51,  0 users,  load average: 3.63, 4.21, 4.73

Verify the DB is running or not on this node 3

The Oracle base has been set to /u01/app/oracle
[oracle@OPRD_SRVR03 ~]$ srvctl status database -d OPRDP1
Instance OPRDP11 is running on node OPRD_SRVR03  ====>Instance is runnng on rebooted node
Instance OPRDP12 is running on node OPRD_SRVR04
Instance OPRDP13 is running on node OPRD_SRVR05
Root Cause:
Hardware Failure.Node got rebooted due  to faulty RAID HBA in PCIE Slot 4.
Action Plan:
Uploaded requested logs to support.
Engaged the Field engineer from support to replace faulty hardware.

Plan for replacement of HBA

1. Shutdown cluster services on node 3.
Database instances on this will be shutdown.
Applications  will have an outage since connections to these instances will be failed over to other nodes.
2. Disable cluster auto start
3. Handover to Oracle Field Engineer for replacement of Host Bus adapator in slot 4
4. Reboot node,  verify and clear ilom faults if any
5. Enable cluster services to auto start
6. Start cluster services on node 3

what can be done proactively to avoid these issues?

1) For hardware failures, we cannot get any proactive alerts but we receive alerts for node reboot and
need to start working immediately.
2) Configure DB on multiple nodes(in Exadata) as per the criticality.
3) Application team/vendors should get TAF enbled for the apps that are accessing this DB.
Glossary
ASR: Auto Service Request
OEM: Oracle Enterprise Manager
RAID:Redundant Array of Independent Disks
HBA: Host Bus Adapter
PCIE:Peripheral Component Interconnect Express
See Also: