Tuesday, November 30, 2010

Core and threads

Courtesy Jason Lane.

Intel does it, AMD does not:
Intel and AMD have done their best to differentiate the x86 architecture as much as possible while retaining compatibility between the two CPUs, but the differences between the two are growing. One key differentiator is hyperthreading;


To Windows threads are cores:
Multicore and HyperThreading (referred to as “HT”) are not the same, but you can be suckered into believing they are, because hyperthreading looks like a core to Windows. My computer is a Core i7-860, a quad-core design with two threads per core. To Windows 7, I have eight cores.


Core was designed, when processors hit a clock speed wall
Multicore CPUs were introduced as a solution to the fact that around a decade ago, processors hit a clock speed wall. The CPUs just were not getting any faster, and could not do so without extreme cooling. Unable to get to 4GHz, 5GHz, and beyond, AMD and Intel turned to dual core designs.


There’s two ways to look at computation: speed and parallelization. In some instances, such as tasks involving massive calculation, it makes sense to have an 8GHz core. The thing, is most business application uses don’t really need this. In fact, there are vendors arguing that the Xeon is overkill for many server tasks.

So the solution was multicore. AMD was the first consumer chip vendor to release a multicore chip in 2005 with the Athlon X2. Intel followed in 2006 with the Core 2 Duo. (The first non-embedded dual-core CPU was IBM’s Power 4 processor in 2001.) If a vendor’s CPU couldn’t improve performance with one 5GHz core, they could get there with two 2.5GHz cores.

Dual Cores are like 2 way lane on a high way:
To use a highway analogy, it’s the equivalent of going from a one-lane road to a two-lane road. Even if the two-lane road has a lower speed limit, more cars can travel to their destinations at any given time.


Dual-core are connected via interconnect rather than mobo
You may remember the days of symmetric multiprocessing (SMP), when computer systems had two physical CPUs on the motherboard. There’s no difference in execution between two single-core processors in an SMP configuration and a single dual-core CPU.


The difference, though, is that the dual-core CPU has much, much faster communication between the cores. That’s because they are on the same die and connected by a high-speed interconnections. In an SMP system, “communication” between the CPUs has to go out through the CPU socket, cross the motherboard, and go through the socket of the second CPU. So inter-CPU communication is considerably faster.


Intel first introduced HyperThreading with the Pentium IV processor in 2002 and that year’s model of Xeon processors. Intel dropped the technology when it introduced the Core architecture in 2004 and brought it back with the Nehalem generation in 2008. Intel remains firmly dedicated to HT and is even introducing it in its Atom processor, a tiny chip used in embedded systems and netbooks. As Tom’s Hardware found in tests, HyperThreading increased performance by 37%.


Why AMD doesn't believe in HT:
AMD has never embraced hyperthreading. In an interview with TechPulse 360, AMD’s director of business development Pat Patla and server product manager John Fruehe told the blog, “Real men use cores … HyperThreading requires the core logic to maintain two pipelines: its normal pipeline and its hyperthreaded pipeline. A management overhead that doesn’t give you a clear throughput.”


With the June 2009 release of the six-core “Istanbul” line of Opteron processors, AMD introduced something called “HT Assist,” a technology to map the contents of the L3 caches. It reserves a 1MB portion of each CPU’s L3 cache to act as a directory that tracks the contents of all of the other CPU caches in the system. AMD believes this will reduce latency because it creates a map of cache data, as opposed to having to search every single cache for data. It’s
not multithreading and shouldn’t be confused for it. It’s simply a means of making the server processor more efficient.


HyperThreading Deconstructed
HT is the technical process where two threads are executed on one processor core. To Windows, a core capable of executing two threads is seen as two processors, but it’s really not the same. A core is a physical unit you can see under a microscope. Threads are executed inside the core.

HT is like passing on the left on the highway. If a car ahead of you is going too slow, you pass it at your preferred speed. In HyperThreading, if a thread can’t finish immediately, it lets another run by it. But that’s a simplistic explanation.

Here’s how it works. The CPU has to constantly move data in and out of memory as it processes the code. The CPU’s cache attempts to alleviate this. Within each CPU core is the L1 cache, which usually is very small (32kb). Outside of the CPU core, right next to it on the chip, is the L2 cache. This is larger, usually between 256kb and 512kb. The next cache is the L3 cache, which is shared by all of the cores and is several megabytes in size. L3 caches were added with the advent of multicore computers. After all, it was easier and more efficient to keep data in the super fast memory of L3 cache than let it go out to memory.

The CPU executes one instruction at a time, according to its clock speed. Instructions take a various amount of cycles; some can be done in one cycle, others may require a dozen. It’s all based on the complexity of the task. Cycles are measured in nanoseconds.

Every CPU core has what’s called a pipeline. Think of pipelines as the stages in an assembly line, except here the process is the assembly of an application task. At some point, the pipeline may stall. It has to wait for data, or for another hardware component in the computer, whatever. We’re not talking about a hung application; this is a delay of a few milliseconds while data is fetched from RAM. Still, other threads have to wait in a non-hyperthreaded pipeline, so it looks like:

thread1— thread1— (delay)— thread1—- thread2— (delay)— thread2— thread3— thread3— thread3—

With hyperthreading, when the core’s execution pipeline stalls, the core begins to execute another program that’s waiting to run. Mind you, the first thread is not stopped. If it gets the data it wants, it resumes execution as well.

thread1— thread1— thread2— thread2— thread1— thread2— thread1— thread2— thread2—

The computer is not slowed by this; it simply doesn’t wait for one thread to complete before it starts executing a new one.

HT in Practice

There’s two ways HT comes into play. One is execution. A multithreaded computer boots much faster, since multiple libraries, services, and applications are loaded as fast as they can be read off the hard drive. You can start up several applications faster with a HT-equipped computer as well. That’s primarily done by Windows, which manages threads on its own. Windows 7 and Windows Server 2008 R2 were both written to better manage application execution, and both
operating systems can see up to 256 cores/threads, more than we will see for a long time.

Of course, the same applies to cores. A quad-core processor is inherently faster than a dual core processor with HT, since full cores can do more than HT.

However, benchmarks have found that for system loads, you don’t gain much after four cores (or two cores with two threads each). More cores and threads cannot compensate for other bottlenecks in your computer, like the hard drive.

Then there’s application hyperthreading, wherein an application is written to perform tasks in parallel. That requires programming skill; in addition, the latest compilers search for code that can be parallelized. Parallel processing has been around for years, but up until the last few years it remained an extremely esoteric process done by very few people (who commanded a pretty penny). The advent of multicore processors and the push by Intel and AMD to support multithreading is bringing parallel processing to the masses.

Intel has never, ever claimed that HT will double performance, because applications have to be written to take advantage of HT to make the most of it. Even there, HT is not a linear doubling. Intel puts the performance gain of HT at between 20% and 40% under ideal circumstances.

Threads can’t jump cores. Because threads are hardwired to the core, chip vendors can’t put too many threads per core or they slow the core down. That’s why Intel has just two per core.

Multithreading does not add genuine parallelism to the processing structure because it’s not executing two threads at once. Basically it’s letting whichever thread is ready to go run first. Under certain loads, it can make the processing pipeline more efficient and to push multiple threads of execution through the processor a little faster.

So while multithreading is good for system level processing — loading applications and code, executing code, etc. — application-level multithreading is another matter. Multithreading is useless for single-threaded applications and can even degrade performance in certain circumstances.

AMD takes great delight in pointing this out. On a blog entry, “It’s all about cores,” the company points out examples from software vendors against HT.

A consultant who deals with Cognos, business intelligence software owned by IBM, recommends disabling HyperThreading because it “frequently degrades performance and proves unstable.”
Microsoft recommends turning off HyperThreading when running PeopleSoft applications because “our lab testing has shown little or no improvement.”

A Microsoft TechNet article recommends disabling HT for production Microsoft Exchange servers and says it should “only [be] enabled if absolutely necessary as a temporary measure to increase CPU capacity until additional hardware can be obtained.”
Why? Because these applications have not been optimized for multithreading, for starters. And, since two threads are sharing the same circuitry in the processor, there is the odd chance for cache overwrites and data collision. Even with Windows Server 2008’s multithreading management, it can’t fully control what the processor does.

It should be noted that these are exceptions and not the rule. Hypervisors, the software layer that manages a virtualized server, are HT-aware and make full

use of the threads. HT provides the virtual machines with more execution scenarios than if HT was disabled, because the CPUs might otherwise be viewed as busy.

Applications that need lots of I/O – network and disk I/O in particular – can benefit from splitting operations into multiple threads in a HT system. By splitting tasks like disk and network I/O into multiple threads, you might see some performance gains.

The bottom line is that when considering hyperthreading systems, you need to check with your software vendors to learn what they recommend. If a number of your applications are better off with HT disabled, that should play into your decision-making process.

A Hidden Cost

The threads vs. cores argument has one more element to consider. Enterprise software vendors have two pricing policies that could potentially impact you: by the core and by the processor. Both Intel and AMD have encouraged the industry to support pricing on a per-processor (or per-socket) basis.

Some do. Microsoft has a stated policy that it charges by the processor, not by the cores. VMware’s software license is on a per-core basis. Oracle licenses its software both ways. Its Standard Edition software is on a per-socket basis, while its Enterprise Editions are on a per-core basis. Fortunately, they actually charge by the cores, and don’t view HT as extra cores.

Because of this variance, it’s incumbent on every company making a purchase decision to ask the software providers if their pricing scheme is per-core or per-socket. A typical blade server from Dell has two sockets on it, and some have four, but CPUs are moving to four, six, eight, and 12 cores.

If you purchase an AMD blade server with two Opteron 6100 processors, it’s the difference between two processors or 24 cores. Or you may want either an Intel Xeon 5600 with six cores, or an older AMD Opteron 6000, which also had six cores but no HT. It certainly adds a nice layer of confusion, doesn’t it?

Tuesday, November 16, 2010

Avahi daemon on Redhat

/var/log/messages looks something like this, and if you are wondering what is this ?

Nov 16 19:44:21 server1 ntpd[28254]: synchronized to 10.418.5.205, stratum 2
Nov 16 19:45:43 server1 avahi-daemon[29476]: Invalid response packet.
Nov 16 19:47:39 server1 last message repeated 7 times
Nov 16 19:48:48 server1 last message repeated 49 times
Nov 16 19:50:49 server1 last message repeated 7 times
Nov 16 19:52:44 server1 last message repeated 7 times
Nov 16 19:53:54 server1 last message repeated 42 times
Nov 16 19:55:54 server1 last message repeated 7 times
Nov 16 19:57:50 server1 last message repeated 7 times
Nov 16 19:58:59 server1 last message repeated 35 times
Nov 16 20:00:29 server1 last message repeated 21 times
Nov 16 20:02:55 server1 last message repeated 14 times
Nov 16 20:03:04 server1 last message repeated 20 times
Nov 16 20:03:21 server1 ntpd[28254]: synchronized to 10.418.5.205, stratum 2
Nov 16 20:03:24 server1 avahi-daemon[29476]: Invalid response packet.
Nov 16 20:04:04 server1 last message repeated 35 times
Nov 16 20:06:04 server1 last message repeated 7 times
Nov 16 20:07:59 server1 last message repeated 7 times
Nov 16 20:09:09 server1 last message repeated 35 times
Nov 16 20:11:09 server1 last message repeated 7 times
Nov 16 20:13:04 server1 last message repeated 7 times
Nov 16 20:14:16 server1 last message repeated 35 times
Nov 16 20:16:13 server1 last message repeated 21 times
Nov 16 20:18:10 server1 last message repeated 7 times

Its Avahi !!
If you have not customized your kickstart script carefully, Avahi gets by default on Redhat. If this server is in your data center you don't need it.

# /etc/init.d/avahi stop
# chkconfig --level 012345 avahi off

Avahi runs mDNS and DNS-SD daemon (that is, multicast DNS plus DNS service discovery ) implementing Apple's ZeroConf architecture (also known as "Rendezvous" or "Bonjour").

Avahi-daemon interprets its configuration file /etc/avahi/avahi-daemon.conf and reads XML fragments from /etc/avahi/services/*.service which may define static DNS-SD services. If you enable publish-resolv-conf-dns-servers in avahi-daemon.conf the file /etc/resolv.conf will be read

Some very good explanation of mDNS and DNS-SD

Multicast DNS means that each equipped host stores its own DNS records. A multicast address (224.0.0.251) is used by clients wishing to get the IP address of a given hostname, and that host responds to the client request with its IP address.

DNS-SD uses the same technology, but in addition to regular DNS information, hosts also publish service instance information: they announce what services they provide and how to contact those services. All of this is intended to mean that hosts and services can connect to one another without requiring any user configuration: known as Zeroconf sharing. Great for those who aren't comfortable doing manual setup -- or who are just lazy!

In truth, as yet there isn't that much Linux software that really uses mDNS. Apple have made rather more use of it: their software is called Bonjour, and handles printer setup, music sharing via iTunes, photo sharing via iPhoto, Skype, iChat, and an array of other software services. However, in terms of the technical implementation, avahi is an excellent piece of software, and capable of doing everything that Bonjour does. It's been suggested that the Debian/Ubuntu dev teams are actually trying to help give mDNS a bit of encouragement with the inclusion of avahi.

So, what can you do with avahi on your Linux box? One possibility is to use it for networked music sharing. In particular, if some of your music is on laptops that appear and disappear from the network as they are moved around and shut down or booted up, auto music discovery is very handy. This is the same tech that Apple uses for iTunes. Since I have a Mac laptop and a couple of Debian desktops which live in another room, this sounded promising.

Unfortunately, it currently only works in one direction: rhythmbox can connect to an iTunes share but can't actually get at any of the music (this is due to a change in protocol from iTunes 7.0). This is enormously irritating and entirely Apple's fault. Sharing in the other direction works fine: use the "Plugins" menu to configure sharing via DAAP (remember to hit the "configure" button and then check the "share my music" box), and your share will be made available. It'll show up automatically in iTunes on a Mac; in rhythmbox you'll need to use the "Connect to DAAP share" option in the Music menu of rhythmbox, and give the hostname/IP address and port (3689) to connect to. If you add music it won't appear in the share until you either restart rhythmbox (client-side), or disconnect and reconnect the share in iTunes. (Note: if running a firewall, you'll need to open appropriate holes in it for outbound sharing, although not for inbound.)

Friday, October 29, 2010

Unable to authenticate local user

We encountered one other weird problem today when someone reported that they cannot login as local user and also su to that user doesn't work as normal user.

Error were as follows:
#1.
$su - pankaj
su: incorrect password

#2.
users listed in /etc/passwd (local user) cannot login to the server


Logs:
/var/log/message shows
Oct 29 14:57:15 pgserver01 sshd[1457]: Address 11.22.20.130 maps to l4339284.federated.fds, but this does not map back to the address - POSSIBLE BREAKIN ATTEMPT!
Oct 29 14:57:15 pgserver01 sshd(pam_unix)[1465]: auth could not identify password for [pankaj]
Oct 29 14:57:15 pgserver01 sshd[1457]: error: PAM: Authentication failure for pankaj from 11.22.20.130
Oct 29 14:57:17 pgserver01 sshd(pam_unix)[1457]: auth could not identify password for [pankaj]
Oct 29 14:57:17 pgserver01 sshd[1457]: Failed password for pankaj from ::ffff:11.22.20.130 port 56138 ssh2


We did lot of troubleshooting around what is the exact symptoms and restarted all necessary services to clear out any hand auth modules.
few steps taken:
/etc/init.d/vas restart
/etc/init.d/xinetd restart
/etc/init.d/sshd restart


Resolution:
We found misconfigued the pam.d/system-auth with option "use_first_pass"

[root@esu1l101 ~]# cat /etc/pam.d/system-auth
#%PAM-1.0
# This file is auto-generated.
# User changes will be destroyed the next time authconfig is run.
auth required /lib/security/$ISA/pam_env.so
auth [ignore=ignore success=done default=die] pam_vas3.so create_homedir
#auth sufficient /lib/security/$ISA/pam_unix.so likeauth nullok use_first_pass <-- replaced this line with line below
auth sufficient /lib/security/$ISA/pam_unix.so likeauth nullok
auth required /lib/security/$ISA/pam_deny.so

account [ignore=ignore success=done default=die] pam_vas3.so
account required /lib/security/$ISA/pam_unix.so
account sufficient /lib/security/$ISA/pam_succeed_if.so uid <>
account required /lib/security/$ISA/pam_permit.so

password [ignore=ignore success=done default=die] pam_vas3.so
password requisite /lib/security/$ISA/pam_cracklib.so retry=3
password sufficient /lib/security/$ISA/pam_unix.so nullok use_authtok md5 shadow nis
password required /lib/security/$ISA/pam_deny.so

session required /lib/security/$ISA/pam_limits.so
session required pam_vas3.so create_homedir
session required /lib/security/$ISA/pam_unix.so


PAM optional arguments module explaination:

use_first_pass
The module should not prompt the user for a password. Instead, it should obtain the previously typed password
(from the preceding auth module), and use that.
If that doesn't work, then the user will not be authenticated.
(This option is intended for auth and password modules only).

Thursday, October 28, 2010

A single previous owner was found in the messaging engine's data store,

This error is from service integration bus running on WebSphere 6.1.0.17 and DB2. We use this enterprise service bus for one of the project.

You will see this error when we start the application after it connects to the database. The WebSphere starts ok and the messaging bus never allows any new connections.
If you look at the log, it shows bus in starting state instead of started state.
Messaging engine ibm61p2Node_mcomstars_QA.mcomstars_QA_s61p2-MCOMStarsBus is in state Starting.


Error:
A single previous owner was found in the messaging engine's data store, ME_UUID=3FD2CC33B88EB9E

Also referred as Websphere CWSIS1545I and CWSIS1537I errors in many IBM blogs

SystemOut.log shows:
[10/18/10 18:21:05:580 EDT] 00000032 ManagedEsServ I com.ibm.wbiserver.sequencing.service.ManagedEsService startEsServiceWithConnRetries() cannot create connection. esApps.isEmpty=true esStarted=false wasStarted=false meStarted=false isActive=true

[10/18/10 15:05:54:639 EDT] 00000045 SibMessage I [MCOMStarsBus:mcomstars_cluster01.000-MCOMStarsBus] CWSIS1538I: The messaging engine, ME_UUID=3FD2CC33B88EB9E6, INC_UUID=7D467D46C0BBCCCD, is attempting to obtain an exclusive lock on the data store.

[10/18/10 15:05:54:732 EDT] 00000046 SibMessage I [MCOMStarsBus:mcomstars_cluster01.000-MCOMStarsBus] CWSIS1545I: A single previous owner was found in the messaging engine's data store, ME_UUID=3FD2CC33B88EB9E
6, INC_UUID=2BA62BA6B8B62F0C

[10/18/10 15:05:55:139 EDT] 00000045 SibMessage I [MCOMStarsBus:mcomstars_cluster01.000-MCOMStarsBus] CWSIS1537I: The messaging engine, ME_UUID=3FD2CC33B88EB9E6, INC_UUID=7D467D46C0BBCCCD, has acquired an exclusive lock on the data store.

Exception: com.ibm.ws.sib.msgstore.PersistenceException: CWSIS1500E: The dispatcher cannot accept work.
[10/18/10 15:02:00:559 EDT] 00000045 SibMessage E [:] CWSIP0002E: An internal messaging error occurred in com.ibm.ws.sib.processor.impl.BaseDestinationHandler, 1:1977:1.692.1.7, com.ibm.wsspi.sib.core.exception.SIRollbackException: CWSIS1002E: An unexpected exception was caught during transaction completion. Exception: com.ibm.ws.sib.msgstore.PersistenceException: CWSIS1500E: The dispatcher cannot accept work.

The SystemOut.log should look like this
[10/22/10 15:27:54:215 EDT] 0000002a SibMessage I [MCOMStarsBus:mcomstars_cluster01.000-MCOMStarsBus] CWSIS1538I: The messaging engine, ME_UUID=3FD2CC33B88EB9E6, INC_UUID=671A671AD5695F63, is attempting to obtain an exclusive lock on the data store.

[10/22/10 15:27:54:304 EDT] 0000002b SibMessage I [MCOMStarsBus:mcomstars_cluster01.000-MCOMStarsBus] CWSIS1543I: No previous owner was found in the messaging engines data store.

[10/22/10 15:27:54:318 EDT] 0000002a SibMessage I [MCOMStarsBus:mcomstars_cluster01.000-MCOMStarsBus] CWSIS1537I: The messaging engine, ME_UUID=3FD2CC33B88EB9E6, INC_UUID=671A671AD5695F63, has acquired an exclusive lock on the data store.

[10/22/10 15:28:06:604 EDT] 0000001f SibMessage I [MCOMStarsBus:mcomstars_cluster01.000-MCOMStarsBus] CWSID0016I: Messaging engine mcomstars_cluster01.000-MCOMStarsBus is in state Started.



Resolution:
There are lots of similar situations around this error depending upon how you have your environment setup.
Run this against the database.

delete from ESBME1.SIB000;
delete from ESBME1.SIB001;
delete from ESBME1.SIB002;
delete from ESBME1.SIBCLASSMAP;
delete from ESBME1.SIBKEYS;
delete from ESBME1.SIBLISTING;
delete from ESBME1.SIBOWNER;
delete from ESBME1.SIBXACTS;


Here is one of good explanation of the problem:
This means that either you have 2 messaging engines pointing at the same
database tables, or you had another messaging engine defined that used the
same tables before (e.g. you created a bus/messaging engine, started it,
stoped it, deleted the bus and recreated, then pointed the new messaging
engine at the same database as the previous one).

If you have 2 messaging engines using the same table then the fix is to
point your messaging engine at a different set of tables (e.g. different schema or database)
If you have deleted and recreated a bus/messaging engine then clear out the
content of the messaging engine's database tables and restart it. This is a feature that prevents 2 different messaging engines from using the same tables in the database.

--
Martin Phillips
mphillip at uk.ibm.com

Thursday, September 9, 2010

"could not open session" when using sudo

last week encountered another strange error
when sudoing to one particular user, we see this error

[root@server]# sudo su - pankaj
could not open session


Resolution
cat /etc/security/limits.conf
look for user pankaj in this file

for example
#pankaj - nofile 0


negative uptime ?

Today we noticed something very unusual.

Uptime shows negative 24855 days.
ps shows java processes running since year 1945 or 1944.

New files are created with a timestamp of year 1874.

Jboss and other processes are going haywire since they can't figure what files/processes need to be redeployed or cleaned.

Also load avg is over the roof

$ date
Fri Sep 10 01:12:09 EDT 2010

$ uptime
17:44:25 up -24855 days, -3:-14, 3 users, load average: 66.89, 67.11, 67.98

$ touch pankaj
$ ls -l pankaj
-rw-r--r-- 1 p139pkg ux_mdc 0 Aug 3 1874 pankaj

[root@server /]# uname -a
Linux esu1l100.federated.fds 2.6.9-67.0.7.ELsmp #1 SMP Wed Feb 27 04:47:23 EST 2008 x86_64 x86_64 x86_64 GNU/Linux

[root@server /]# cat /etc/redhat-release
Red Hat Enterprise Linux AS release 4 (Nahant Update 6)


[app@server]$ ps -elf |grep java
0 S pankaj 4468 32506 0 76 0 - 325272 pipe_w 1945 ? 00:11:06 /usr/java/jdk1.5.0_18//bin/java -DCAV_MON_HOME=/home/pankaj/cavisson/monitors/ -DMON_TEST_RUN=3916 -DVECTOR_NAME=WS30 cm_java_gc_ex -f /www/a/logs/AppServer/gc.log -o 2 -i 10000
0 S pankaj 4554 32506 0 76 0 - 325528 pipe_w 1944 ? 00:11:21 /usr/java/jdk1.5.0_18//bin/java -DCAV_MON_HOME=/home/pankaj/cavisson/monitors/ -DMON_TEST_RUN=3532 -DVECTOR_NAME=WS30 cm_java_gc_ex -f /www/a/logs/AppServer/gc.log -o 2 -i 10000
0 S pankaj 7003 32506 0 76 0 - 326040 pipe_w 1944 ? 00:11:12 /usr/java/jdk1.5.0_18//bin/java -DCAV_MON_HOME=/home/pankaj/cavisson/monitors/ -DMON_TEST_RUN=3724 -DVECTOR_NAME=WS30 cm_java_gc_ex -f /www/a/logs/AppServer/gc.log -o 2 -i 10000
0 R pankaj 7103 32506 0 85 0 - 325784 - 1945 ? 00:16:38 /usr/java/jdk1.5.0_18//bin/java -DCAV_MON_HOME=/home/pankaj/cavisson/monitors/ -DMON_TEST_RUN=3919 -DVECTOR_NAME=WS30 cm_java_gc_ex -f /www/a/logs/AppServer/gc.log -o 2 -i 10000
0 S pankaj 9126 32506 0 76 0 - 312984 pipe_w 1944 ? 00:11:11 /usr/java/jdk1.5.0_18//bin/java -DCAV_MON_HOME=/home/pankaj/cavisson/monitors/ -DMON_TEST_RUN=3722 -DVECTOR_NAME=WS30 cm_java_gc_ex -f /www/a/logs/AppServer/gc.log -o 2 -i 10000
0 R app 11084 9687 0 78 0 - 12777 - 2081 pts/1 00:00:00 grep java
0 S pankaj 12101 32506 0 76 0 - 326296 pipe_w 1944 ? 00:11:11 /usr/java/jdk1.5.0_18//bin/java -DCAV_MON_HOME=/home/pankaj/cavisson/monitors/ -DMON_TEST_RUN=3729 -DVECTOR_NAME=WS30 cm_java_gc_ex -f /www/a/logs/AppServer/gc.log -o 2 -i 10000
0 S pankaj 12809 32506 0 76 0 - 326808 pipe_w 1944 ? 00:11:20 /usr/java/jdk1.5.0_18//bin/java -DCAV_MON_HOME=/home/pankaj/cavisson/monitors/ -DMON_TEST_RUN=3717 -DVECTOR_NAME=WS30 cm_java_gc_ex -f /www/a/logs/AppServer/gc.log -o 2 -i 10000
0 S pankaj 12900 32506 0 76 0 - 322456 pipe_w 1944 ? 00:11:19 /usr/java/jdk1.5.0_18//bin/java -DCAV_MON_HOME=/home/pankaj/cavisson/monitors/ -DMON_TEST_RUN=3718 -DVECTOR_NAME=WS30 cm_java_gc_ex -f /www/a/logs/AppServer/gc.log -o 2 -i 10000


Resolution:
I believe this is a bug in kernel 2.6.x and earlier - that it cannot handle uptime more the 497 days. This means that uptime counters reset to 0 after this 497 days, and this can also cause some funny process start times in a ps display.

It is reported to have been fixed in kernel 2.6.14.3

Wednesday, July 28, 2010

Data Stage internal error 39202

While access project thru DataStage Director client, it throws an internal error 39202
Explaination regarding this internal error:

This is an internal error within the DataStage engine. It decodes as "slave failed to give to go-ahead message" which isn't very helpful for a newcomer. It usually means that the agent process that services the connection (dsapi_server) either hasn't started successfully or has failed to respond within a certain period.

This can happen if the DataStage server machine is very busy. You haven't installed DataStage on a PDC or even a BDC by any chance?

Maybe re-booting will remedy the situation. Because you got that far you can be sure that the network connectivity and authentication are all OK.

Read up on how the DataStage RPC mechanism handles requests to connect, and the agent processes (dsapi_server and dsapi_slave) that become involved. Check (using Task Manager on the DataStage server machine) that these processes do actually get created.


IBM docs regarding this error
http://www-01.ibm.com/support/docview.wss?uid=swg21431890&myns=swgimgmt&mynp=OCSSVSEF&mync=R

Solution:
The problem found was in the dsenv file, it had a incorrect LIBPATH
LIBPATH=`dirname $DSHOME`/branded_odbc/lib:`dirname $DSHOME`/DSComponents/lib:`dirname $DSHOME`/DSComponents/bin:$DSHOME/lib:$DSHOME/uvdlls:$ASBHOME/apps/jre/bin:$ASBHOME/apps/jre/bin/classic:$ASBHOME/lib/cpp:$ASBHOME/apps/proxy/cpp/aix-all-ppc_64:/opt/odbc_progres/lib:$LIBPATH

$ASBHOME/apps/proxy/cpp/aix-all-ppc_64
instead of
$ASBHOME/apps/proxy/cpp/aix-all-ppc

restart DS again after changing the path

Thursday, June 10, 2010

crontab locked error when user is trying to do crontab -l or -e

crontab locked error when user is trying to do crontab -l or -e

Solu:
check for cron.lock under user home
check for /etc/cron.deny
check for cron tmp file under /tmp

syslogd not writing to /var/log/messages even after syslogd restarts sucessfully

syslogd not writing to /var/log/messages even after syslogd restarts sucessfully

Solu : /etc/syslogd.conf had spaces instead of tab

Friday, May 14, 2010

How to kill zombie CLOSE_WAIT() DataStage processes

When you try to restart Datastage on our Aix 5.3 machine with
$DSHOME/bin/uv -admin -stop

then $DSHOME/bin/uv -admin -start

There are no error messages but the demon is not running
Status code = 81016

ps -ef grep dsrpc shows some existing CLOSE_WAIT() connections which doesn't allow DS to start again.

grep the free 'lsof' utility for the status, such as 'CLOSE_WAIT' and use that to identify the process ID (PID) and kill it

Dell 710 (Broadcom NIC) running RH kernel 2.6 and MSI setting

One of our Dev team first reported that they cannot ssh to server (Dell 710). We had to recycle the server since we could get to the server thru DRAC or remotely. After that day they reported the same problem several times. This server is used by the doing some application profiling and load test. The server network subsystem intermittently stops responding.


After first glance of troubleshooting, we noticed that the server is up and available however the network services get completely stalled. The server cannot route any packet outside the server.

When we ran tcpdump on the interface, it revealed only ARP broadcasts and no responses.
After digging thru the logs and system configuration files for hours and hours, we couldn’t establish a pattern when it loses the connectivity. There was nothing in the logs which suggests that, it could be a hardware error and any kernel related problems.

Our first suspect was firmware; we upgraded firmware to the latest available on Dell site.After the firmware upgrade, the user reported a very weird timeout. So, we upgraded the Redhat kernel
kernel-2.6.18-164.el5 -> kernel-2.6.18-194.el5

We opened a case with Dell. They gave us another upgraded firmware to apply. And then we changed the server side Ethernet port.
Also changed switch port
Changed the CAT5e cable
And, we finally requested Dell tech support to change the hardware.

Then, contacted the Net Ops team to look into the switch configurations to see if there is any settings on the switch side that can shutdown the interface if it see heavy traffic coming from the interface.

Brainstorming sessions identified that the problem lies in the layer 3

N - network <<------ problem is here
D - data
P - physical

We reviewed Hyperic data and come to a conclusion that the actual cause was not the heavy traffic, but it directly correlates with the network throughput. The more traffic, the higher probability that interface will stop responding.

Server hardware and Redhat version info:
Hardware:System Information
Manufacturer: Dell Inc.
Product Name: PowerEdge R710
Version: Not Specified
Serial Number: 848KVH1

Base Board Information
Manufacturer: Dell Inc.
Product Name: 0YDJK3
Version: A09
Serial Number: ..CN1374003900LF.

BIOS Information
Vendor: Dell Inc.
Version: 2.0.11
Release Date: 02/26/2010
Address: 0xF0000
Runtime Size: 64 kB
ROM Size: 4096 kB

Redhat release info:
# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 5.4 (Tikanga)

Kernel info:
# uname -a
Linux md000ystls02 2.6.18-194.el5 #1 SMP Tue Mar 16 21:52:39 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
Ethernet Info is as below:On Board Device 2 Information
Type: Ethernet
Status: Enabled
Description: Embedded Broadcom 5709C NIC 1
Ethernet driver version:# ethtool -i eth0
driver: bnx2
version: 2.0.2
firmware-version: 5.0.11 NCSI 2.0.5bus-info: 0000:01:00.0


Identified Cause:
This is behavior is only found on Broadcom network cards running on kernel 2.6. Its one of the problem which is not reproducible by a pre defined steps.
By default MSI (Message Signaled Interrupts) is enable on kernel 2.6 and it’s not supported on 2.4 and that causes this intermittent network drop Broadcom cards.

Disabling MSI on Broadcom bnx2 module resolves this problem.


Solution:
Here is the fix
run modprobe bnx2 disable_msi=1

or edit modprobe.conf and options bnx2 disable_msi=1 for permanent setting

# cat /etc/modprobe.conf
alias scsi_hostadapter megaraid_sas
alias scsi_hostadapter1 ata_piix
alias scsi_hostadapter2 usb-storage
alias eth0 bnx2
alias eth2 bnx2
alias eth1 bnx2
alias eth3 bnx2
options bnx2 disable_msi=1

IBM InfoSphere Information Server 8.1 installation hiccups

Recently while installation IBM IIS, I came across some silly errors which is worth documenting. For some reason I didn't see problem with 8.0.1. I think most of the probblem arises if you have any old installation of IIS on the machine.


So, here we go
#1. The installation wizard gives a wierd no clue error message and terminates.
Suggesting to fix the issue before proceeding.

Error is something like this:
this is GUI error and you see the same using console installation




if you will look at the log also it will not make any sense.

Apr 29, 2010 2:22:05 PM , INFO: com.ascential.acs.installer.utils.uservalidation.UserValidationBuildAction execute
Apr 29, 2010 2:22:05 PM , SEVERE: com.ascential.acs.installer.utils.InstalledProductBeanWizardBeanCondition
ServiceException: (error code = 200; severity = 0; exception = [java.lang.NullPointerException])
at com.installshield.wizard.service.LocalImplementorProxy.invoke(Unknown Source)
at com.installshield.wizard.service.AbstractService.invokeImpl(Unknown Source)
at com.installshield.product.service.registry.GenericRegistryService.getSoftwareObject(Unknown Source)
at com.ascential.acs.installer.utils.ProductBeanUtil.isInstalled(ProductBeanUtil.java:106)

Solution:
Comment the +@sys_admin entry in your /etc/passwd file as below

[ibmpg:/opt/IBM] # cat /etc/passwd
root:!:0:0::/:/usr/bin/ksh
daemon:!:1:1::/etc:
bin:!:2:2::/bin:
sys:!:3:3::/usr/sys:
adm:!:4:4::/var/adm:
uucp:!:5:5::/usr/lib/uucp:
guest:!:100:100::/home/guest:
nobody:!:4294967294:4294967294::/:
lpd:!:9:4294967294::/:
lp:*:11:11::/var/spool/lp:/bin/false
invscout:*:6:12::/var/adm/invscout:/usr/bin/ksh
snapp:*:200:13:snapp login user:/usr/sbin/snapp:/usr/sbin/snappd
nuucp:*:7:5:uucp login user:/var/spool/uucppublic:/usr/sbin/uucp/uucico
ipsec:*:201:1::/etc/ipsec:/usr/bin/ksh
sshd:*:202:201::/var/empty:/usr/bin/ksh
#+@sys_admin::::::
esaadmin:*:811:0::/home/esaadmin:/usr/bin/ksh
isadmin:!:5474:5087:IBM IIS Admin:/home/isadmin:/usr/bin/ksh
wasadmin:!:5475:5088:IBM WebSphere Admin :/home/wasadmin:/usr/bin/ksh
db2as:!:209:5089::/db2home/db2as:/usr/bin/ksh
xmeta81:!:204:1:XMETA 81 Admin:/home/xmeta81:/usr/bin/ksh
x81inst1:!:206:1:XMETA81 Instance Owner:/home/x81inst1:/usr/bin/ksh
x81fenc1:!:207:1:XMETA81 Fence user:/home/x81fenc1:/usr/bin/ksh
x81das1:!:208:1:XMETA81 DAS user:/home/x81das1:/usr/bin/ksh
#2. Second error
When proceeding with installation, it won't go beyond fence user ....

Fenced user [db2fenc2] x81fenc1
Press 1 for Next, 2 for Previous, 3 to Cancel or 5 to Redisplay [1]
-------------------------------------------------------------------------------
IBM Information Server - InstallShield Wizard

Errors occurred during the installation.
- null
- null
The following warnings were generated:
- null
Press 2 for Previous, 3 to Cancel or 5 to Redisplay [2]
-------------------------------------------------------------------------------
DB2 - InstallShield Wizard

Solution:
It seems something it does not like the vpd files. After talking to IBM they suggested hiding these VPD files, and the dshome file. Here are the steps:

We need to move (rename) the following directory and file:
$ cd /usr/lib/objrepos/InstallShield/Universal/IBM/
$ mv InformationServer InformationServer801
$ cd /
$ mv .dshome .dshome.801

#3. Third error
Operating system information: AIX 5.3
ERROR: The DB2 Administration Server is already configured for this computer.

Solution:
This was simple enough compared to the first two
This error is also for the fact the there was already a DB2 installed on the server.

===== Part 1 : make sure its stopped: =====
To stop the DB2 administration server:
1. Log in as the DB2 administration server owner.
2. Stop the DB2 administration server by entering the db2admin stop command.

===== Part 2 : remove DAS =====
remove the DB2 administration server:
1. Log in as a user with root user authority.
2. Stop the DB2 administration server.
3. Remove the DB2 administration server by entering the following command:

DB2DIR/instance/dasdrop
where DB2DIR is the location you specified during the DB2 Version 9 installation.

#4. Fourth error
Finally it was Websphere error complaining about vpd files from the old installation
I don' t have the exact error for this, but when installing the Engine it cannot proceed

[ibmpg:] #
cd /usr/lib/objrepos
mv vpd.properties vpd.properties.8.0.1