Connection issue in between DB Nodes and Cell Nodes

Intermittent connection issue between DB and Cell Nodes due to IB Switch.

Overview of the issue

The connection between DB Nodes and Cell Nodes cannot be established intermittently. Due to this users are  facing timeout issue. ‘df -h’ command is also hanging and we are seeing network error in ASM alert log and node level implicit fencing message in Cell Alert log.

Action Taken

As per ASM alert log message we mainly identified issue with DB Node 3 and Node 7.So we tried rebooting it and making CRS up.But CRS is showing in hung state and couldn’t able to start asm resources.

Involved Oracle Support and after checking logs they confirmed this as a bug with InfiniBand switch version 2.1.5-1.

Bug Details

Bug: 17482244 Cannot establish new connections until SM(SubnetManager) is manually restarted.

The workaround is to login to one of the InfiniBand switches and run the getmaster command and issue disablesm command to force the master to failover to one of the other switches.

Once the master has failed over to another switch, you should run the enablesm command on the switch to bring it back on-line as a STANDBY switch.

Solution

The solution to this bug is in InfiniBand Switch firmware 2.1.6 or higher.

Workaround for these kind of issues

If this kind of issue occurs in any of the machines where nodes are not able to establish connection. Always perform below steps.

  • login to InfiniBand switch and run ‘getmaster -l’ command to verify state of switch ( MASTER or STAND BY)
  • run ‘ibcheckerrors -v’ to check for errors.
  • If no subnet master seen in the system then run disablesm command to force the master to other switch and then enablesm to make the current switch back online.

Preventive Actions to avoid these issues

  • Do not run IB Switch pre-checks on any of your Exadata Nodes if ibswitch version is on “2.1.5-1”.  To know Version ,login to ibswitch and run “version”.
  • Before running pre-checks for IBswitches, firstly check “how Many switches are there, and which is Master and Standby and where subnet manager is running.”
  • After Running pre-checks, cross verify the status whether same or not. Also Make sure that to verify “/var/log/messages file on all Exadata compute nodes , and also ASM Alert log-file.
  • on all computenodes,If you see any Error pointing to “RDS/IB – packet drop / network reconnecting / connecting , then there is problem with IB Swtiches.

Reference Documents

http://support.oracle.com

  1. Infiniband Switch Replacement – Overview and guide to key articles (Doc ID 2125242.1)
  2. How to Prepare an Infiniband Switch for Replacement (Doc ID 1636229.1)
  3. How to Prepare an Infiniband (IB) Fabric for Planned Outage of an IB Switch (Doc ID 2140928.1)

Oracle DB migration from windows to Exadata machine