my tech scribbling

Friday, May 14, 2010

Dell 710 (Broadcom NIC) running RH kernel 2.6 and MSI setting

One of our Dev team first reported that they cannot ssh to server (Dell 710). We had to recycle the server since we could get to the server thru DRAC or remotely. After that day they reported the same problem several times. This server is used by the doing some application profiling and load test. The server network subsystem intermittently stops responding.

After first glance of troubleshooting, we noticed that the server is up and available however the network services get completely stalled. The server cannot route any packet outside the server.

When we ran tcpdump on the interface, it revealed only ARP broadcasts and no responses.
After digging thru the logs and system configuration files for hours and hours, we couldn’t establish a pattern when it loses the connectivity. There was nothing in the logs which suggests that, it could be a hardware error and any kernel related problems.

Our first suspect was firmware; we upgraded firmware to the latest available on Dell site.After the firmware upgrade, the user reported a very weird timeout. So, we upgraded the Redhat kernel
kernel-2.6.18-164.el5 -> kernel-2.6.18-194.el5

We opened a case with Dell. They gave us another upgraded firmware to apply. And then we changed the server side Ethernet port.
Also changed switch port
Changed the CAT5e cable
And, we finally requested Dell tech support to change the hardware.

Then, contacted the Net Ops team to look into the switch configurations to see if there is any settings on the switch side that can shutdown the interface if it see heavy traffic coming from the interface.

Brainstorming sessions identified that the problem lies in the layer 3

N - network <<------ problem is here
D - data
P - physical

We reviewed Hyperic data and come to a conclusion that the actual cause was not the heavy traffic, but it directly correlates with the network throughput. The more traffic, the higher probability that interface will stop responding.

Server hardware and Redhat version info:
Hardware:System Information
Manufacturer: Dell Inc.
Product Name: PowerEdge R710
Version: Not Specified
Serial Number: 848KVH1

Base Board Information
Manufacturer: Dell Inc.
Product Name: 0YDJK3
Version: A09
Serial Number: ..CN1374003900LF.

BIOS Information
Vendor: Dell Inc.
Version: 2.0.11
Release Date: 02/26/2010
Address: 0xF0000
Runtime Size: 64 kB
ROM Size: 4096 kB

Redhat release info:
# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 5.4 (Tikanga)

Kernel info:
# uname -a
Linux md000ystls02 2.6.18-194.el5 #1 SMP Tue Mar 16 21:52:39 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
Ethernet Info is as below:On Board Device 2 Information
Type: Ethernet
Status: Enabled
Description: Embedded Broadcom 5709C NIC 1
Ethernet driver version:# ethtool -i eth0
driver: bnx2
version: 2.0.2
firmware-version: 5.0.11 NCSI 2.0.5bus-info: 0000:01:00.0

Identified Cause:
This is behavior is only found on Broadcom network cards running on kernel 2.6. Its one of the problem which is not reproducible by a pre defined steps.
By default MSI (Message Signaled Interrupts) is enable on kernel 2.6 and it’s not supported on 2.4 and that causes this intermittent network drop Broadcom cards.

Disabling MSI on Broadcom bnx2 module resolves this problem.

Solution:
Here is the fix
run modprobe bnx2 disable_msi=1

or edit modprobe.conf and options bnx2 disable_msi=1 for permanent setting

# cat /etc/modprobe.conf
alias scsi_hostadapter megaraid_sas
alias scsi_hostadapter1 ata_piix
alias scsi_hostadapter2 usb-storage
alias eth0 bnx2
alias eth2 bnx2
alias eth1 bnx2
alias eth3 bnx2
options bnx2 disable_msi=1

IBM InfoSphere Information Server 8.1 installation hiccups

Recently while installation IBM IIS, I came across some silly errors which is worth documenting. For some reason I didn't see problem with 8.0.1. I think most of the probblem arises if you have any old installation of IIS on the machine.

So, here we go
#1. The installation wizard gives a wierd no clue error message and terminates.
Suggesting to fix the issue before proceeding.

Error is something like this:
this is GUI error and you see the same using console installation

if you will look at the log also it will not make any sense.

Apr 29, 2010 2:22:05 PM , INFO: com.ascential.acs.installer.utils.uservalidation.UserValidationBuildAction execute
Apr 29, 2010 2:22:05 PM , SEVERE: com.ascential.acs.installer.utils.InstalledProductBeanWizardBeanCondition
ServiceException: (error code = 200; severity = 0; exception = [java.lang.NullPointerException])
at com.installshield.wizard.service.LocalImplementorProxy.invoke(Unknown Source)
at com.installshield.wizard.service.AbstractService.invokeImpl(Unknown Source)
at com.installshield.product.service.registry.GenericRegistryService.getSoftwareObject(Unknown Source)
at com.ascential.acs.installer.utils.ProductBeanUtil.isInstalled(ProductBeanUtil.java:106)

Solution:
Comment the +@sys_admin entry in your /etc/passwd file as below

[ibmpg:/opt/IBM] # cat /etc/passwd
root:!:0:0::/:/usr/bin/ksh
daemon:!:1:1::/etc:
bin:!:2:2::/bin:
sys:!:3:3::/usr/sys:
adm:!:4:4::/var/adm:
uucp:!:5:5::/usr/lib/uucp:
guest:!:100:100::/home/guest:
nobody:!:4294967294:4294967294::/:
lpd:!:9:4294967294::/:
lp:*:11:11::/var/spool/lp:/bin/false
invscout:*:6:12::/var/adm/invscout:/usr/bin/ksh
snapp:*:200:13:snapp login user:/usr/sbin/snapp:/usr/sbin/snappd
nuucp:*:7:5:uucp login user:/var/spool/uucppublic:/usr/sbin/uucp/uucico
ipsec:*:201:1::/etc/ipsec:/usr/bin/ksh
sshd:*:202:201::/var/empty:/usr/bin/ksh
#+@sys_admin::::::
esaadmin:*:811:0::/home/esaadmin:/usr/bin/ksh
isadmin:!:5474:5087:IBM IIS Admin:/home/isadmin:/usr/bin/ksh
wasadmin:!:5475:5088:IBM WebSphere Admin :/home/wasadmin:/usr/bin/ksh
db2as:!:209:5089::/db2home/db2as:/usr/bin/ksh
xmeta81:!:204:1:XMETA 81 Admin:/home/xmeta81:/usr/bin/ksh
x81inst1:!:206:1:XMETA81 Instance Owner:/home/x81inst1:/usr/bin/ksh
x81fenc1:!:207:1:XMETA81 Fence user:/home/x81fenc1:/usr/bin/ksh
x81das1:!:208:1:XMETA81 DAS user:/home/x81das1:/usr/bin/ksh

#2. Second error
When proceeding with installation, it won't go beyond fence user ....

Fenced user [db2fenc2] x81fenc1
Press 1 for Next, 2 for Previous, 3 to Cancel or 5 to Redisplay [1]
-------------------------------------------------------------------------------
IBM Information Server - InstallShield Wizard

Errors occurred during the installation.
- null
- null
The following warnings were generated:
- null
Press 2 for Previous, 3 to Cancel or 5 to Redisplay [2]
-------------------------------------------------------------------------------
DB2 - InstallShield Wizard

Solution:
It seems something it does not like the vpd files. After talking to IBM they suggested hiding these VPD files, and the dshome file. Here are the steps:

We need to move (rename) the following directory and file:
$ cd /usr/lib/objrepos/InstallShield/Universal/IBM/
$ mv InformationServer InformationServer801
$ cd /
$ mv .dshome .dshome.801

#3. Third error
Operating system information: AIX 5.3
ERROR: The DB2 Administration Server is already configured for this computer.

Solution:
This was simple enough compared to the first two
This error is also for the fact the there was already a DB2 installed on the server.

===== Part 1 : make sure its stopped: =====
To stop the DB2 administration server:
1. Log in as the DB2 administration server owner.
2. Stop the DB2 administration server by entering the db2admin stop command.

===== Part 2 : remove DAS =====
remove the DB2 administration server:
1. Log in as a user with root user authority.
2. Stop the DB2 administration server.
3. Remove the DB2 administration server by entering the following command:

DB2DIR/instance/dasdrop
where DB2DIR is the location you specified during the DB2 Version 9 installation.

#4. Fourth error
Finally it was Websphere error complaining about vpd files from the old installation
I don' t have the exact error for this, but when installing the Engine it cannot proceed

[ibmpg:] #
cd /usr/lib/objrepos
mv vpd.properties vpd.properties.8.0.1

Friday, August 14, 2009

Dynamic DNS update for unix servers

I think thats an amazing way to create dns entries for unix servers if you don't access to DNS servers. I believe its possible because of the automatic dns update feature in MS DNS.

Create hostname.txt

server 113.167.14.63
zone ted.com
prereq nxdomain pankaj.ted.com
update add pankaj.ted.com 86400 A 211.216.153.900
show
send

# nsupdate -v /home/scripts/dns/hosts/hostname.txt

You can similarly delete or modify the resource records
# nsupdate
>update delete pankaj.ted.com 86400 A 211.216.153.900

nsupdate is used to submit Dynamic DNS Update requests as defined in
RFC2136 to a name server. This allows resource records to be added or
removed from a zone without manually editing the zone file. A single
update request can contain requests to add or remove more than one
resource record.

Thursday, August 6, 2009

All about ntp

After a very grilling and fear-provoking experience with one of the consultants about the fact that ntp is not properly configured on few on our boxes and thats one of the primary reasons that nothing works in our environment :-). I had to get this right in my head...

Installing ntp on Linux/Solaris/AIX and OSX

Few facts about NTP:
Ntp is OS independent
NTP uses UTC as reference time
Even when a network connection is temporarily unavailable,
NTP can use measurements from the past to estimate current time and error

Stratum 0 clock - >> Reference Clocks -> Cesium Clock -> GPS
Stratum 1 clock - >> Top level NTP servers, directly connected to Stratum 0
Stratum 2 clock - >> Clients for Stratum 1
Stratum 3 clock - >> Clients for Stratum 2
---
---
Stratum 16 clock ->> Lowest level server

Peers: When servers synchronizes servers at same stratum server level, so they
may decide who has the higher quality of time and then can synchronise to the
most accurate, they are called peers.

NTP configuration model:
-Ntp can be configured in client-server model
-Peer to peer model,
-Also, a server may broadcast time to a broadcast or multicast IP addresses
and clients may be configured to synchronise to these broadcast time signals.

Few ntp commands:
#ntpq -p <-- show all peers used and configured together with their corner performance data.

bash-3.00# ntpq -p
remote refid st t when poll reach delay offset disp
==============================================================================
+pg913xs01.fe pg000xscrp01.fe 5 u 876 1024 377 0.31 -15.702 9.32
*pg913xs02.fe pg000xscrp02.fe 4 u 845 1024 377 0.23 5.291 4.26

Summary information includes the address of the remote peer,
the reference ID, the stratum of the remote peer,
the type of the peer (local, unicast, multicast or broadcast),
when the last packet was received, the polling interval, in seconds,
the reachability register,in octal, and the current estimated delay,
offset and dispersion of the peer, all in milliseconds.

#ntpdc
ntpdc> peers

#ntpdate -d 134.126.23.62 <--- Manually updating time with ntp server

Setting up and troubleshooting on AIX:

#1. Edit /etc/ntp.conf
#broadcastclient
server timeserver1
server timeserver2
server timeserver3
server timeserver4
driftfile /etc/ntp.drift
tracefile /etc/ntp.trace

#2. ntpdate 134.126.23.62 ( this is only required if you are way off )
9 Jul 21:27:48 ntpdate[299236]: step time server 11.16.4.62 offset -6059.104933

The offset must be less than 1000 seconds for xntpd to synch.
If the offset is greater than 1000 seconds,change the time manually on the client and run the ntpdate -d again.

#3. start xntpd
# startsrc -s xntpd
0513-059 The xntpd Subsystem has been started. Subsystem PID is 438386.

and

Edit uncomment the line in /etc/rc.tcpip
start /usr/sbin/xntpd -x "$src_running"

#4. Wait for atleast 6 mins before issuing, two lssrc results are listed below.
lssrc -ls xntpd

Look at the stratum value in two output listed below

bash-3.00# lssrc -ls xntpd
Program name: /usr/sbin/xntpd
Version: 3
Leap indicator: 00 (No leap second today.)
Sys peer: pg913xsfed02.ted.org
Sys stratum: 5 <------- this is good
Sys precision: -18
Debug/Tracing: DISABLED
Root distance: 0.152100
Root dispersion: 1.015091
Reference ID: 11.16.4.87
Reference time: ce014349.d0b4f000 Thu, Jul 9 2009 21:34:17.815
Broadcast delay: 0.003906 (sec)
Auth delay: 0.000122 (sec)
System flags: pll monitor filegen
System uptime: 279 (sec)
Clock stability: 0.000000 (sec)
Clock frequency: 0.000000 (sec)
Peer: time4.apple.com
flags: (configured)
stratum: 2, version: 3
our mode: client, his mode: server
Peer: pg913xsfed02.ted.org
flags: (configured)(sys peer)
stratum: 4, version: 3
our mode: client, his mode: server
Peer: pg913xsfed01.ted.org
flags: (configured)(sys peer)
stratum: 5, version: 3
our mode: client, his mode: server
Subsystem Group PID Status
xntpd tcpip 438386 active

bash-3.00# lssrc -ls xntpd
Program name: /usr/sbin/xntpd
Version: 3
Leap indicator: 11 (Leap indicator is insane.)
Sys peer: no peer, system is insane
Sys stratum: 16 <------- this is not good
Sys precision: -18
Debug/Tracing: DISABLED
Root distance: 0.000000
Root dispersion: 0.000000
Reference ID: no refid, system is insane
Reference time: no reftime, system is insane
Broadcast delay: 0.003906 (sec)
Auth delay: 0.000122 (sec)
System flags: pll monitor filegen
System uptime: 10 (sec)
Clock stability: 0.000000 (sec)
Clock frequency: 0.000000 (sec)
Peer: time4.apple.com
flags: (configured)
stratum: 16, version: 3
our mode: client, his mode: unspecified
Peer: pg913xsfed02.ted.org
flags: (configured)
stratum: 4, version: 3
our mode: client, his mode: server
Peer: pg913xsfed01.ted.org
flags: (configured)
stratum: 5, version: 3
our mode: client, his mode: server
Subsystem Group PID Status
xntpd tcpip 438386 active

Setting up on Linux:
#1. Edit /etc/ntpd.conf
server timehost1
server timehost2
server timehost3
server timehost4
driftfile /var/lib/ntp/drift

#2. /etc/init.d/ntpd start

Setting up on Solaris:
#1. Edit /etc/inet/ntp.conf
server timehost1
server timehost2
server timehost3
server timehost4
driftfile /var/lib/ntp/drift

#2. /etc/init.d/xntpd start
#3. svcadm refresh svc:/network/ntp

Setting up on OSX
#1. Edit /etc/ntp.conf
driftfile /var/lib/ntp/drift
server timehost1
server timehost2
server timehost3
server timehost4

#2. sudo /System/Library/StartupItems/NetworkTime/NetworkTime restart

---------------------------------------------------------------------------------------------------------

Problem: NTP daemon starts ok but dies after few minutes
Solutions:
1. Check the date on the machine. If it shows a strange date they could be missing /unix or /vmunix.
2. Check the TZ variable. Often a timezone variable on the client that is different than the server can cause this problem.
3. Make sure "broadcast client" line is commented out of /etc/ntp.conf.
4. How much is the time off? If it is >1000 seconds then NTP won't stay active. To correct this, run ntpdate serveripaddress.

Problem: No server suitable for synchronization found.
Solution:
If you start xntpd on a server and run ntpdate on a client to set the client's time with that of the server,
it will not update the client unless the xntpd daemon has been active for 6 minutes or longer.