Friday, May 14, 2010

Dell 710 (Broadcom NIC) running RH kernel 2.6 and MSI setting

One of our Dev team first reported that they cannot ssh to server (Dell 710). We had to recycle the server since we could get to the server thru DRAC or remotely. After that day they reported the same problem several times. This server is used by the doing some application profiling and load test. The server network subsystem intermittently stops responding.


After first glance of troubleshooting, we noticed that the server is up and available however the network services get completely stalled. The server cannot route any packet outside the server.

When we ran tcpdump on the interface, it revealed only ARP broadcasts and no responses.
After digging thru the logs and system configuration files for hours and hours, we couldn’t establish a pattern when it loses the connectivity. There was nothing in the logs which suggests that, it could be a hardware error and any kernel related problems.

Our first suspect was firmware; we upgraded firmware to the latest available on Dell site.After the firmware upgrade, the user reported a very weird timeout. So, we upgraded the Redhat kernel
kernel-2.6.18-164.el5 -> kernel-2.6.18-194.el5

We opened a case with Dell. They gave us another upgraded firmware to apply. And then we changed the server side Ethernet port.
Also changed switch port
Changed the CAT5e cable
And, we finally requested Dell tech support to change the hardware.

Then, contacted the Net Ops team to look into the switch configurations to see if there is any settings on the switch side that can shutdown the interface if it see heavy traffic coming from the interface.

Brainstorming sessions identified that the problem lies in the layer 3

N - network <<------ problem is here
D - data
P - physical

We reviewed Hyperic data and come to a conclusion that the actual cause was not the heavy traffic, but it directly correlates with the network throughput. The more traffic, the higher probability that interface will stop responding.

Server hardware and Redhat version info:
Hardware:System Information
Manufacturer: Dell Inc.
Product Name: PowerEdge R710
Version: Not Specified
Serial Number: 848KVH1

Base Board Information
Manufacturer: Dell Inc.
Product Name: 0YDJK3
Version: A09
Serial Number: ..CN1374003900LF.

BIOS Information
Vendor: Dell Inc.
Version: 2.0.11
Release Date: 02/26/2010
Address: 0xF0000
Runtime Size: 64 kB
ROM Size: 4096 kB

Redhat release info:
# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 5.4 (Tikanga)

Kernel info:
# uname -a
Linux md000ystls02 2.6.18-194.el5 #1 SMP Tue Mar 16 21:52:39 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
Ethernet Info is as below:On Board Device 2 Information
Type: Ethernet
Status: Enabled
Description: Embedded Broadcom 5709C NIC 1
Ethernet driver version:# ethtool -i eth0
driver: bnx2
version: 2.0.2
firmware-version: 5.0.11 NCSI 2.0.5bus-info: 0000:01:00.0


Identified Cause:
This is behavior is only found on Broadcom network cards running on kernel 2.6. Its one of the problem which is not reproducible by a pre defined steps.
By default MSI (Message Signaled Interrupts) is enable on kernel 2.6 and it’s not supported on 2.4 and that causes this intermittent network drop Broadcom cards.

Disabling MSI on Broadcom bnx2 module resolves this problem.


Solution:
Here is the fix
run modprobe bnx2 disable_msi=1

or edit modprobe.conf and options bnx2 disable_msi=1 for permanent setting

# cat /etc/modprobe.conf
alias scsi_hostadapter megaraid_sas
alias scsi_hostadapter1 ata_piix
alias scsi_hostadapter2 usb-storage
alias eth0 bnx2
alias eth2 bnx2
alias eth1 bnx2
alias eth3 bnx2
options bnx2 disable_msi=1

4 comments:

Anonymous said...

Hi Pankaj,

Thanks for uploading this Post,
It was of great help for resolving our issue for Broadcom 5709C driver installation.

Pankaj Gautam said...

You are welcome, glad to be of some help

Anonymous said...

Thank you very much. We are having the same problem with DELL R810. Now we are testing the solution. We will post the result.

best regards,
Portella

Anonymous said...

Could this be related to the Broadcom NIC or switch flow control? We've been having problems with that recently, and finally got to the bottom of it. See http://lists.us.dell.com/pipermail/linux-poweredge/2011-October/045485.html

Sven