Friday, May 14, 2010

How to kill zombie CLOSE_WAIT() DataStage processes

When you try to restart Datastage on our Aix 5.3 machine with
$DSHOME/bin/uv -admin -stop

then $DSHOME/bin/uv -admin -start

There are no error messages but the demon is not running
Status code = 81016

ps -ef grep dsrpc shows some existing CLOSE_WAIT() connections which doesn't allow DS to start again.

grep the free 'lsof' utility for the status, such as 'CLOSE_WAIT' and use that to identify the process ID (PID) and kill it

Dell 710 (Broadcom NIC) running RH kernel 2.6 and MSI setting

One of our Dev team first reported that they cannot ssh to server (Dell 710). We had to recycle the server since we could get to the server thru DRAC or remotely. After that day they reported the same problem several times. This server is used by the doing some application profiling and load test. The server network subsystem intermittently stops responding.


After first glance of troubleshooting, we noticed that the server is up and available however the network services get completely stalled. The server cannot route any packet outside the server.

When we ran tcpdump on the interface, it revealed only ARP broadcasts and no responses.
After digging thru the logs and system configuration files for hours and hours, we couldn’t establish a pattern when it loses the connectivity. There was nothing in the logs which suggests that, it could be a hardware error and any kernel related problems.

Our first suspect was firmware; we upgraded firmware to the latest available on Dell site.After the firmware upgrade, the user reported a very weird timeout. So, we upgraded the Redhat kernel
kernel-2.6.18-164.el5 -> kernel-2.6.18-194.el5

We opened a case with Dell. They gave us another upgraded firmware to apply. And then we changed the server side Ethernet port.
Also changed switch port
Changed the CAT5e cable
And, we finally requested Dell tech support to change the hardware.

Then, contacted the Net Ops team to look into the switch configurations to see if there is any settings on the switch side that can shutdown the interface if it see heavy traffic coming from the interface.

Brainstorming sessions identified that the problem lies in the layer 3

N - network <<------ problem is here
D - data
P - physical

We reviewed Hyperic data and come to a conclusion that the actual cause was not the heavy traffic, but it directly correlates with the network throughput. The more traffic, the higher probability that interface will stop responding.

Server hardware and Redhat version info:
Hardware:System Information
Manufacturer: Dell Inc.
Product Name: PowerEdge R710
Version: Not Specified
Serial Number: 848KVH1

Base Board Information
Manufacturer: Dell Inc.
Product Name: 0YDJK3
Version: A09
Serial Number: ..CN1374003900LF.

BIOS Information
Vendor: Dell Inc.
Version: 2.0.11
Release Date: 02/26/2010
Address: 0xF0000
Runtime Size: 64 kB
ROM Size: 4096 kB

Redhat release info:
# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 5.4 (Tikanga)

Kernel info:
# uname -a
Linux md000ystls02 2.6.18-194.el5 #1 SMP Tue Mar 16 21:52:39 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
Ethernet Info is as below:On Board Device 2 Information
Type: Ethernet
Status: Enabled
Description: Embedded Broadcom 5709C NIC 1
Ethernet driver version:# ethtool -i eth0
driver: bnx2
version: 2.0.2
firmware-version: 5.0.11 NCSI 2.0.5bus-info: 0000:01:00.0


Identified Cause:
This is behavior is only found on Broadcom network cards running on kernel 2.6. Its one of the problem which is not reproducible by a pre defined steps.
By default MSI (Message Signaled Interrupts) is enable on kernel 2.6 and it’s not supported on 2.4 and that causes this intermittent network drop Broadcom cards.

Disabling MSI on Broadcom bnx2 module resolves this problem.


Solution:
Here is the fix
run modprobe bnx2 disable_msi=1

or edit modprobe.conf and options bnx2 disable_msi=1 for permanent setting

# cat /etc/modprobe.conf
alias scsi_hostadapter megaraid_sas
alias scsi_hostadapter1 ata_piix
alias scsi_hostadapter2 usb-storage
alias eth0 bnx2
alias eth2 bnx2
alias eth1 bnx2
alias eth3 bnx2
options bnx2 disable_msi=1

IBM InfoSphere Information Server 8.1 installation hiccups

Recently while installation IBM IIS, I came across some silly errors which is worth documenting. For some reason I didn't see problem with 8.0.1. I think most of the probblem arises if you have any old installation of IIS on the machine.


So, here we go
#1. The installation wizard gives a wierd no clue error message and terminates.
Suggesting to fix the issue before proceeding.

Error is something like this:
this is GUI error and you see the same using console installation




if you will look at the log also it will not make any sense.

Apr 29, 2010 2:22:05 PM , INFO: com.ascential.acs.installer.utils.uservalidation.UserValidationBuildAction execute
Apr 29, 2010 2:22:05 PM , SEVERE: com.ascential.acs.installer.utils.InstalledProductBeanWizardBeanCondition
ServiceException: (error code = 200; severity = 0; exception = [java.lang.NullPointerException])
at com.installshield.wizard.service.LocalImplementorProxy.invoke(Unknown Source)
at com.installshield.wizard.service.AbstractService.invokeImpl(Unknown Source)
at com.installshield.product.service.registry.GenericRegistryService.getSoftwareObject(Unknown Source)
at com.ascential.acs.installer.utils.ProductBeanUtil.isInstalled(ProductBeanUtil.java:106)

Solution:
Comment the +@sys_admin entry in your /etc/passwd file as below

[ibmpg:/opt/IBM] # cat /etc/passwd
root:!:0:0::/:/usr/bin/ksh
daemon:!:1:1::/etc:
bin:!:2:2::/bin:
sys:!:3:3::/usr/sys:
adm:!:4:4::/var/adm:
uucp:!:5:5::/usr/lib/uucp:
guest:!:100:100::/home/guest:
nobody:!:4294967294:4294967294::/:
lpd:!:9:4294967294::/:
lp:*:11:11::/var/spool/lp:/bin/false
invscout:*:6:12::/var/adm/invscout:/usr/bin/ksh
snapp:*:200:13:snapp login user:/usr/sbin/snapp:/usr/sbin/snappd
nuucp:*:7:5:uucp login user:/var/spool/uucppublic:/usr/sbin/uucp/uucico
ipsec:*:201:1::/etc/ipsec:/usr/bin/ksh
sshd:*:202:201::/var/empty:/usr/bin/ksh
#+@sys_admin::::::
esaadmin:*:811:0::/home/esaadmin:/usr/bin/ksh
isadmin:!:5474:5087:IBM IIS Admin:/home/isadmin:/usr/bin/ksh
wasadmin:!:5475:5088:IBM WebSphere Admin :/home/wasadmin:/usr/bin/ksh
db2as:!:209:5089::/db2home/db2as:/usr/bin/ksh
xmeta81:!:204:1:XMETA 81 Admin:/home/xmeta81:/usr/bin/ksh
x81inst1:!:206:1:XMETA81 Instance Owner:/home/x81inst1:/usr/bin/ksh
x81fenc1:!:207:1:XMETA81 Fence user:/home/x81fenc1:/usr/bin/ksh
x81das1:!:208:1:XMETA81 DAS user:/home/x81das1:/usr/bin/ksh
#2. Second error
When proceeding with installation, it won't go beyond fence user ....

Fenced user [db2fenc2] x81fenc1
Press 1 for Next, 2 for Previous, 3 to Cancel or 5 to Redisplay [1]
-------------------------------------------------------------------------------
IBM Information Server - InstallShield Wizard

Errors occurred during the installation.
- null
- null
The following warnings were generated:
- null
Press 2 for Previous, 3 to Cancel or 5 to Redisplay [2]
-------------------------------------------------------------------------------
DB2 - InstallShield Wizard

Solution:
It seems something it does not like the vpd files. After talking to IBM they suggested hiding these VPD files, and the dshome file. Here are the steps:

We need to move (rename) the following directory and file:
$ cd /usr/lib/objrepos/InstallShield/Universal/IBM/
$ mv InformationServer InformationServer801
$ cd /
$ mv .dshome .dshome.801

#3. Third error
Operating system information: AIX 5.3
ERROR: The DB2 Administration Server is already configured for this computer.

Solution:
This was simple enough compared to the first two
This error is also for the fact the there was already a DB2 installed on the server.

===== Part 1 : make sure its stopped: =====
To stop the DB2 administration server:
1. Log in as the DB2 administration server owner.
2. Stop the DB2 administration server by entering the db2admin stop command.

===== Part 2 : remove DAS =====
remove the DB2 administration server:
1. Log in as a user with root user authority.
2. Stop the DB2 administration server.
3. Remove the DB2 administration server by entering the following command:

DB2DIR/instance/dasdrop
where DB2DIR is the location you specified during the DB2 Version 9 installation.

#4. Fourth error
Finally it was Websphere error complaining about vpd files from the old installation
I don' t have the exact error for this, but when installing the Engine it cannot proceed

[ibmpg:] #
cd /usr/lib/objrepos
mv vpd.properties vpd.properties.8.0.1