my tech scribbling: 2009

Friday, August 14, 2009

Dynamic DNS update for unix servers

I think thats an amazing way to create dns entries for unix servers if you don't access to DNS servers. I believe its possible because of the automatic dns update feature in MS DNS.

Create hostname.txt

server 113.167.14.63
zone ted.com
prereq nxdomain pankaj.ted.com
update add pankaj.ted.com 86400 A 211.216.153.900
show
send

# nsupdate -v /home/scripts/dns/hosts/hostname.txt

You can similarly delete or modify the resource records
# nsupdate
>update delete pankaj.ted.com 86400 A 211.216.153.900

nsupdate is used to submit Dynamic DNS Update requests as defined in
RFC2136 to a name server. This allows resource records to be added or
removed from a zone without manually editing the zone file. A single
update request can contain requests to add or remove more than one
resource record.

Thursday, August 6, 2009

All about ntp

After a very grilling and fear-provoking experience with one of the consultants about the fact that ntp is not properly configured on few on our boxes and thats one of the primary reasons that nothing works in our environment :-). I had to get this right in my head...

Installing ntp on Linux/Solaris/AIX and OSX

Few facts about NTP:
Ntp is OS independent
NTP uses UTC as reference time
Even when a network connection is temporarily unavailable,
NTP can use measurements from the past to estimate current time and error

Stratum 0 clock - >> Reference Clocks -> Cesium Clock -> GPS
Stratum 1 clock - >> Top level NTP servers, directly connected to Stratum 0
Stratum 2 clock - >> Clients for Stratum 1
Stratum 3 clock - >> Clients for Stratum 2
---
---
Stratum 16 clock ->> Lowest level server

Peers: When servers synchronizes servers at same stratum server level, so they
may decide who has the higher quality of time and then can synchronise to the
most accurate, they are called peers.

NTP configuration model:
-Ntp can be configured in client-server model
-Peer to peer model,
-Also, a server may broadcast time to a broadcast or multicast IP addresses
and clients may be configured to synchronise to these broadcast time signals.

Few ntp commands:
#ntpq -p <-- show all peers used and configured together with their corner performance data.

bash-3.00# ntpq -p
remote refid st t when poll reach delay offset disp
==============================================================================
+pg913xs01.fe pg000xscrp01.fe 5 u 876 1024 377 0.31 -15.702 9.32
*pg913xs02.fe pg000xscrp02.fe 4 u 845 1024 377 0.23 5.291 4.26

Summary information includes the address of the remote peer,
the reference ID, the stratum of the remote peer,
the type of the peer (local, unicast, multicast or broadcast),
when the last packet was received, the polling interval, in seconds,
the reachability register,in octal, and the current estimated delay,
offset and dispersion of the peer, all in milliseconds.

#ntpdc
ntpdc> peers

#ntpdate -d 134.126.23.62 <--- Manually updating time with ntp server

Setting up and troubleshooting on AIX:

#1. Edit /etc/ntp.conf
#broadcastclient
server timeserver1
server timeserver2
server timeserver3
server timeserver4
driftfile /etc/ntp.drift
tracefile /etc/ntp.trace

#2. ntpdate 134.126.23.62 ( this is only required if you are way off )
9 Jul 21:27:48 ntpdate[299236]: step time server 11.16.4.62 offset -6059.104933

The offset must be less than 1000 seconds for xntpd to synch.
If the offset is greater than 1000 seconds,change the time manually on the client and run the ntpdate -d again.

#3. start xntpd
# startsrc -s xntpd
0513-059 The xntpd Subsystem has been started. Subsystem PID is 438386.

and

Edit uncomment the line in /etc/rc.tcpip
start /usr/sbin/xntpd -x "$src_running"

#4. Wait for atleast 6 mins before issuing, two lssrc results are listed below.
lssrc -ls xntpd

Look at the stratum value in two output listed below

bash-3.00# lssrc -ls xntpd
Program name: /usr/sbin/xntpd
Version: 3
Leap indicator: 00 (No leap second today.)
Sys peer: pg913xsfed02.ted.org
Sys stratum: 5 <------- this is good
Sys precision: -18
Debug/Tracing: DISABLED
Root distance: 0.152100
Root dispersion: 1.015091
Reference ID: 11.16.4.87
Reference time: ce014349.d0b4f000 Thu, Jul 9 2009 21:34:17.815
Broadcast delay: 0.003906 (sec)
Auth delay: 0.000122 (sec)
System flags: pll monitor filegen
System uptime: 279 (sec)
Clock stability: 0.000000 (sec)
Clock frequency: 0.000000 (sec)
Peer: time4.apple.com
flags: (configured)
stratum: 2, version: 3
our mode: client, his mode: server
Peer: pg913xsfed02.ted.org
flags: (configured)(sys peer)
stratum: 4, version: 3
our mode: client, his mode: server
Peer: pg913xsfed01.ted.org
flags: (configured)(sys peer)
stratum: 5, version: 3
our mode: client, his mode: server
Subsystem Group PID Status
xntpd tcpip 438386 active

bash-3.00# lssrc -ls xntpd
Program name: /usr/sbin/xntpd
Version: 3
Leap indicator: 11 (Leap indicator is insane.)
Sys peer: no peer, system is insane
Sys stratum: 16 <------- this is not good
Sys precision: -18
Debug/Tracing: DISABLED
Root distance: 0.000000
Root dispersion: 0.000000
Reference ID: no refid, system is insane
Reference time: no reftime, system is insane
Broadcast delay: 0.003906 (sec)
Auth delay: 0.000122 (sec)
System flags: pll monitor filegen
System uptime: 10 (sec)
Clock stability: 0.000000 (sec)
Clock frequency: 0.000000 (sec)
Peer: time4.apple.com
flags: (configured)
stratum: 16, version: 3
our mode: client, his mode: unspecified
Peer: pg913xsfed02.ted.org
flags: (configured)
stratum: 4, version: 3
our mode: client, his mode: server
Peer: pg913xsfed01.ted.org
flags: (configured)
stratum: 5, version: 3
our mode: client, his mode: server
Subsystem Group PID Status
xntpd tcpip 438386 active

Setting up on Linux:
#1. Edit /etc/ntpd.conf
server timehost1
server timehost2
server timehost3
server timehost4
driftfile /var/lib/ntp/drift

#2. /etc/init.d/ntpd start

Setting up on Solaris:
#1. Edit /etc/inet/ntp.conf
server timehost1
server timehost2
server timehost3
server timehost4
driftfile /var/lib/ntp/drift

#2. /etc/init.d/xntpd start
#3. svcadm refresh svc:/network/ntp

Setting up on OSX
#1. Edit /etc/ntp.conf
driftfile /var/lib/ntp/drift
server timehost1
server timehost2
server timehost3
server timehost4

#2. sudo /System/Library/StartupItems/NetworkTime/NetworkTime restart

---------------------------------------------------------------------------------------------------------

Problem: NTP daemon starts ok but dies after few minutes
Solutions:
1. Check the date on the machine. If it shows a strange date they could be missing /unix or /vmunix.
2. Check the TZ variable. Often a timezone variable on the client that is different than the server can cause this problem.
3. Make sure "broadcast client" line is commented out of /etc/ntp.conf.
4. How much is the time off? If it is >1000 seconds then NTP won't stay active. To correct this, run ntpdate serveripaddress.

Problem: No server suitable for synchronization found.
Solution:
If you start xntpd on a server and run ntpdate on a client to set the client's time with that of the server,
it will not update the client unless the xntpd daemon has been active for 6 minutes or longer.

Thursday, June 18, 2009

Remove a file with a dash as first character or with special characters

Recently came across a file with dash in front and couldn't able to delete it with simple rm or rm with quotes.

Special characters files can be deleted using, if you know or can see all the special characters. For example pankaj%76$

rm "pankaj%76$"

However if there is a file like -pankaj%76$ (dash in front of it). Use
rm ./-pankaj%76$
or
rm -- -pankaj%76$

Using ./ or -- prevents the dash from coming in front of the filename and also being interpreted as rm command option.

Thursday, April 23, 2009

How to identify the device(lun/scsi id) on Solaris presented from IBM DS4000?

How to identify the device(lun/scsi id) on Solaris presented from IBM DS4000?

Snapshot of DS4000:
Example: We are creating a LUN of 107GB and making it available only for hostname app1. The scsi id is 29

Array -> Logical Array - Drives

Solaris command:
root@dam-app1 # luxadm display /dev/rdsk/c4t600A0B80004715700000080F47FE4DEEd0s2
DEVICE PROPERTIES for disk: /dev/rdsk/c4t600A0B80004715700000080F47FE4DEEd0s2
Vendor: IBM
Product ID: 1815 FAStT
Revision: 0914
Serial Num: SP74850349
Unformatted capacity: 110160.000 MBytes
Write Cache: Enabled
Read Cache: Enabled
Minimum prefetch: 0x1
Maximum prefetch: 0x0
Device Type: Disk device
Path(s):

/dev/rdsk/c4t600A0B80004715700000080F47FE4DEEd0s2
/devices/scsi_vhci/ssd@g600a0b80004715700000080f47fe4dee:c,raw
Controller /devices/pci@0/pci@0/pci@8/pci@0/pci@2/SUNW,emlxs@0/fp@0,0
Device Address 201400a0b8471570,1d
Host controller port WWN 10000000c9722faa
Class primary
State ONLINE
Controller /devices/pci@0/pci@0/pci@8/pci@0/pci@8/SUNW,emlxs@0/fp@0,0
Device Address 202500a0b8471570,1d
Host controller port WWN 10000000c96307cc
Class secondary
State STANDBY

Device Address: 1d is in the hex decimal, converting it to decimal is 29. This is the same lun id which allotted to the host app1 from DS4000

Monday, March 2, 2009

Changing ftp default umask

Change ftp default umask

Step #1. Edit /etc/inetd.conf
from
ftp stream tcp6 nowait root /usr/sbin/ftpd ftpd
to
ftp stream tcp6 nowait root /usr/sbin/ftpd ftpd -u 007

Step #2. refresh -s inetd (AIX) or do a nohup on inetd

Thursday, January 22, 2009

Linux memory utilitization and disk caching

People often come to me and say, we are running out of memory or I'm not sure who is using all the memory on this box especially on a linux box.

I think, it confuses people more when they run (top command) and see 90% of the system memory is in use.

Here are some of mis-conception/interpretation of results we see in top command.
Snapshot of top command:
top - 10:20:43 up 12 days, 23:08, 2 users, load average: 0.06, 0.21, 0.18
Tasks: 90 total, 1 running, 89 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.8% us, 0.1% sy, 0.0% ni, 98.8% id, 0.1% wa, 0.1% hi, 0.1% si
Mem: 16414396k total, 10941732k used, 5472664k free, 261608k buffers
Swap: 8388600k total, 0k used, 8388600k free, 7092524k cached
21900 app_user 16 0 5976m 2.0g 80m S 22 12.5 931:56.60 WebSphere
21964 app_user 16 0 785m 675m 4236 S 16 4.2 107:04.09 mysqld

Total memory: 16GB
Used memory: 10GB+
Free memory:5GB+

In the above example, Out of 16GB, all 10GB is not used by the application.
It also includes all memory that the kernel uses for block caching and stuff.

Here's the explaination of caching and how it works by one linux guru:
Suppose that you run "ls /". What the kernel does then (after some processing), is that it requests the hard disk sectors that make up the root directory, and then it makes the ls process sleep until those sectors are received. While the hard drive is looking up those sectors, the system does some other things that don't require hard drive access. The hard drive then puts the sectors that it found in your RAM, and reports back to the kernel that the operation completed. Then the kernel continues with the ls process, which outputs what you wanted to see. After that, the kernel has two choices:

1. It could discard the sectors loaded from the hard drive.
2. It could keep the sectors in memory.

Now, if we look carefully at option number 1, which at the first glance might seem like a good thing, since it apparently makes that memory usable again, we see that it would require those sectors to be loaded once again whenever a process wants to look in the root directory. Since many files that are opened are opened with absolute path names, there are many root directory accesses in the system. If the sectors would have to be loaded each time a process accesses the root directory, the system would be slow, since the hard drive is slow. Therefore, the kernel keeps the sectors in memory. The memory used this way is what you see in the "cached" entry. That allows for extremely fast disk access, since all sectors that have once been accessed don't have to be reloaded from the physical disk.

Now, you might think that that's an extreme waste of memory, but if you do, you're very wrong. You see, the block caches are low-priority pages, which essentially means that they are free to be used be processes, should they need the memory. Every time a process needs an extra memory page and there isn't already one free, a page used for block buffering is freed by the kernel and put into the process' addressing space for the process to use in whatever way it see fit. Together with some intelligently designed algorithms for choosing which cache page to be freed, this all allows for really fast disk access.

Now, that we figured out the total 10GB in use also includes kernel caching.
The question still remains about what is the actual application memory usage?

Try (free commands)
free -m -t
free -m -s 5
free -g -t
total used free shared buffers cached
Mem: 15 10 5 0 0 6
-/+ buffers/cache: 3 12
Swap: 7 0 7
Total: 23 10 13

We still see total 15/16GB, and 10GB in use.

You will notice, that the top row 'used' (10) value will almost always nearly match the top row mem 'total' value (15). Since Linux likes to use any spare memory to cache disk blocks shown in top row 'cached' value (6).

The key used figure to look at is the buffers/cache row used value (3). This is how much space your applications are currently using. For best performance, this number should be less than your total (15) memory. To prevent out of memory errors, it needs to be less than the total memory (15) and swap space (7).

If you wish to quickly see how much memory is free look at the buffers/cache row free value (12).
This is the total memory (15)
The actual used (3)
Free memory (15 - 3) = 12

Ok, Now we also know that actually the application is only using 3GB out of 16GB, instead of 10GB reported by top. We may need to find out who is using that 3GB memory.

Try ps command
ps aux
ps aux | awk '{print $4"\t"$1"\t"$11}' | sort -nr

snapshot of ps:
%MEM USER COMMAND
12.6 app_user startServer.sh
4.2 app_user /data/App_Server/mysql/bin/mysqld

To see only the memory resources occupied by each category of processes, such as Apache httpd, MySQL mysqld or Java, use the following command:

ps aux | awk '{print $4"\t"$11}' | sort | uniq -c | awk '{print $2" "$1" "$3}' | sort -nr

Now, if we calculate again. We have startServer.sh using 12.6% of 16GB is 2.01GB.
Similarly, mysql is taking about 4.2% of 16GB, is 670MB, as shown above by the top command.

Thursday, January 8, 2009

System monitoring using sar

System monitoring using sar

SAR(system activity report) is very useful to troubleshoot and pinpoint issues related to system performance.

Though it can be used to gather some useful data regarding system performance, the sar command can increase the system load that can exacerbate a pre-existing performance problem if the sampling frequency is high.

Step 1 - Collecting data using sar:
sar comes with almost all unix system, however you have enable it.exi

#crontab -e

# Collect measurements at 10-minute intervals
0,10,20,30,40,50 * * * * /usr/lib/sa/sa1
# Create daily reports and purge old files
0 0 * * * /usr/lib/sa/sa2 -A

In the above cronjob, there are two commands
the first command sa1, calls sadc to collect the performance data in binary log file and it is running every 10 mins.

The second command sa2,dumps the data from the binary log file into a text file and also deletes the files older than 7 days.

Step 2 - Extracting useful information:

Now that data is being collected, we have to extract useful information related to what we are trying to find out about the system performance at the exact time.

sar data are collected in /var/adm/sa ( you can verify this path in /usr/lib/sa/sa1 script )

You will see two kind of files in that directory, saXX and sarXX
saxx - binary file
sarxx - text file

xx - represents the day of the month.

You can also modify /usr/lib/sa/sa1 to change this format.

when you run just sar, it is going to read the latest saxx file created.

Now we need to filter the status of particular time period. In that case we point sar by specifying -f option to read that time period.

for example:
sar -f /var/adm/sa/sa08 -P ALL <-- All processors utilization
sar -f /var/adm/sa/sa08 -u -P 0,1 <-- Only 0 and 1 Processors
sar -f /var/adm/sa/sa08 -k <-- kernel activity
sar -f /var/adm/sa/sa08 -u <-- cpu utilization

Output:
AIX ibm66p2 3 5 00C9B0704C00 01/08/09

System configuration: lcpu=2 mode=Capped

00:01:09 %usr %sys %wio %idle physc
00:11:09 1 1 0 98 1.00
00:21:09 1 1 0 99 1.00
00:31:09 1 1 0 99 1.00
00:41:09 1 0 0 99 1.00
00:51:09 1 0 0 99 1.00
01:05:54 1 1 0 99 1.00
01:15:54 1 1 0 98 1.00

Output column meanings are at the bottom of this page.

sar -f /var/adm/sa/sa08 -d <-- disk IO utilization

Output:
AIX ibm66p2 3 5 00C9B0704C00 01/08/09

System configuration: lcpu=2 drives=4 mode=Capped

00:01:09 device %busy avque r+w/s Kbs/s avwait avserv
00:11:09 hdisk0 0 0.0 1 4 1.5 11.0
hdisk1 0 0.0 1 4 1.6 11.2
hdisk2 0 0.0 0 1 0.3 9.2
hdisk3 0 0.0 0 0 0.0 6.3
00:21:09 hdisk0 0 0.0 1 4 1.1 10.9
hdisk1 0 0.0 1 4 1.0 10.9
hdisk2 0 0.0 0 1 0.0 7.4
hdisk3 0 0.0 0 0 0.0 0.0

Also, I have seen there is option in Redhat/Solaris to monitor memory usage
sar -f /var/adm/sa/sa08 -r

which is not available in AIX 5.3

However you can use AIX topasout command, which I found is very useful for most of the cases.

Try reading the process recording files in /etc/perf/daily/ directory using topasout
topasout -R detailed -mem -i 5 -b 1000 -e 2300 /etc/perf/daily/xmwlm.090108
topasout -R summary -mem -i 5 -b 1000 -e 2300 /etc/perf/daily/xmwlm.090108

post process xmwlm recording, which is also called local recording
and topas recording, which is also called CEC (central electronic complex) recording

------------------------------------------------------------------------------------
Output columns meaning:
%usr: The percentage of time the CPU is spending on user processes, such as applications, shell scripts, or interacting with the user.

%sys: The percentage of time the CPU is spending executing kernel tasks. In this example, the number is high, because I was pulling data from the kernel's random number generator.

%wio: The percentage of time the CPU is waiting for input or output from a block device, such as a disk.

%idle: The percentage of time the CPU isn't doing anything useful.

kexit/s
Reports the number of kernel processes terminating per second.

kproc-ov/s
Reports the number of times kernel processes could not be created because of enforcement of process threshold limit.

ksched/s
Reports the number of kernel processes assigned to tasks per second.

my tech scribbling