The Exadata node rebooted due to hardware failure.

Message from ASR :

An integrated I-O fatal error in a downstream PCIE device has been detected.

Alert from OEM :

EM Event: Fatal:Node3 – The current status of the target is Down

Sample Output from node 3 ilom look like this :

-> show faulty

Target | Property | Value

———————————-+—————————————-+—————————————

/SP/faultmgmt/0 | fru | /SYS/MB

/SP/faultmgmt/0/faults/0 | class | fault.io.intel.iio.pcie-fatal

/SP/faultmgmt/0/faults/0 | component | /SYS/MB

How to verify node got rebooted?

dcli -l root -g dbs_group “uptime”

OPRD_SRVR01: 14:36:00 up 319 days, 7:26, 3 users, load average: 4.55, 5.60, 6.17

OPRD_SRVR02: 14:36:00 up 319 days, 8:04, 0 users, load average: 22.05, 21.95, 22.24

OPRD_SRVR03: 14:36:00 up 14 min, 4 users, load average: 10.57, 9.80, 5.28 =====> showing uptime : 14 min

OPRD_SRVR04: 14:36:00 up 319 days, 7:37, 0 users, load average: 5.09, 5.11, 5.20

OPRD_SRVR05: 14:36:00 up 319 days, 7:28, 0 users, load average: 2.69, 2.45, 2.43

OPRD_SRVR06: 14:36:00 up 319 days, 7:09, 0 users, load average: 2.72, 2.76, 2.87

OPRD_SRVR07: 14:36:00 up 319 days, 7:04, 0 users, load average: 4.60, 5.20, 5.23

OPRD_SRVR08: 14:36:00 up 319 days, 8:51, 0 users, load average: 3.63, 4.21, 4.73

Verify the DB is running or not on this node 3

The Oracle base has been set to /u01/app/oracle

[oracle@OPRD_SRVR03 ~]$ srvctl status database -d OPRDP1

Instance OPRDP11 is running on node OPRD_SRVR03 ====>Instance is runnng on rebooted node

Instance OPRDP12 is running on node OPRD_SRVR04

Instance OPRDP13 is running on node OPRD_SRVR05

Root Cause:

Hardware Failure.Node got rebooted due to faulty RAID HBA in PCIE Slot 4.

Action Plan:

Uploaded requested logs to support.

Engaged the Field engineer from support to replace faulty hardware.

Plan for replacement of HBA

1. Shutdown cluster services on node 3.

Database instances on this will be shutdown.

Applications will have an outage since connections to these instances will be failed over to other nodes.

2. Disable cluster auto start

3. Handover to Oracle Field Engineer for replacement of Host Bus adapator in slot 4

4. Reboot node, verify and clear ilom faults if any

5. Enable cluster services to auto start

6. Start cluster services on node 3

what can be done proactively to avoid these issues?

1) For hardware failures, we cannot get any proactive alerts but we receive alerts for node reboot and

need to start working immediately.

2) Configure DB on multiple nodes(in Exadata) as per the criticality.

3) Application team/vendors should get TAF enbled for the apps that are accessing this DB.

Glossary

ASR: Auto Service Request

OEM: Oracle Enterprise Manager

RAID:Redundant Array of Independent Disks

HBA: Host Bus Adapter

PCIE:Peripheral Component Interconnect Express

MS Status is DOWN & Ping is SUCCESS in Exadata.

To change DB time zone in PDB and CDB on Exadata

http://oracle.com

Exadata node rebooted due to hardware failure

The Exadata node rebooted due to hardware failure.

Sample Output from node 3 ilom look like this :

How to verify node got rebooted?

Verify the DB is running or not on this node 3

Plan for replacement of HBA

what can be done proactively to avoid these issues?

Footer Links

Subscribe

All Technologies

News

Exadata node rebooted due to hardware failure

The Exadata node rebooted due to hardware failure.

Sample Output from node 3 ilom look like this :

How to verify node got rebooted?

Verify the DB is running or not on this node 3

Plan for replacement of HBA

what can be done proactively to avoid these issues?

Footer Links

Subscribe

All Technologies