Monday, March 21, 2011

Tune Interrupt and Process CPU affinity

When an interrupt signal arrives, the CPU must stop what it's currently doing and switch to a new activity. Interrupts consume CPU time, too many interrupts could cause high system CPU usage as observed from mpstat. In Symmetric Multiprocessing model (SMP ), balancing IRQ is NOT usually needed, because  the  /proc/irq/default_smp_affinity or irqbalance daemon can distribute IRQ signals  among  CPUs automatically.
If you find a particular IRQ is unevenly distributed from the output of “cat /proc/interrupts “, then you can try to Set smp_affinity manually.
Tunning IRQ affinity with smp_affinity flag
It is pointless to set smp_affinity manually with  irqbalance running.

##Check current smp_affinity setting
$ grep -H * /proc/irq/*/smp_affinity
default_smp_affinity:00000000,00000000,ffffffff,ffffffff
/proc/irq/0/smp_affinity:00000000,00000000,00000000,00000020
..
/proc/irq/24/smp_affinity:00000000,00000000,00000000,00000010
##The value of smp_affinity is hex format, which is 128 bit in binary in my host. (It might be  32bit or 256bit depending kernel version  ).  It is called bit mask, 128 represent the position of 128 CPUs, if the value of the position is “1”, then the CPU is allowed to accept the IRQ.
##e.g to allow CPU 0,1,3, the binary presentation is “1011”, which is “b” in hex.
$echo -e "obase=16 \n ibase=2 \n 1011 \n" | bc
B
##Changing IRQ 24 from CPU 4 to CPU 0,1,3
$cat /proc/irq/24/smp_affinity 
00000000,00000000,00000000,00000010
$ echo 0b >/proc/irq/24/smp_affinity
$ cat /proc/irq/24/smp_affinity 
00000000,00000000,00000000,0000000b
Tunning process affinity with taskset 
Pin a process to a particular CPU  to improve cache hit
$taskset -p $$
pid 21698's current affinity mask: ff
$taskset -cp 0,1,3  $$
pid 21698's current affinity list: 0-7
pid 21698's new affinity list: 0,1,3
$taskset -p $$
pid 21698's current affinity mask: b
##you can check which CPU a process is running on with "psr" parameter
$ ps axo  psr,pid,cmd | grep $$
  0 21698 -bash
Isolate CPUs to run particular process only.
Kernel parameter “isolcpus” can isolate particular CPUs from doing other tasks. Together with taskset, you can have particular CPUs to run designated tasks only.
E.g put “isolcpus=2,3 “ in grub.conf will isolate CPU 2 and 3.

Saturday, March 19, 2011

Calculate chunk size for RAID device

Chunk size is a term often used in Linux Software RAID , In hardware RAID, different vendor has different definition e.g EMC call it element size.
Chunk size is the minimum amount of data written to each member before moving to to the next. So it is only significant in Round-Robin Raid types: Raid 0/RAID5/RAID 6 ..etc. The purpose of tuning  Chunk size is to evenly distribute request to each member in RAID.
Chunk Size=avgrq-sz/ number of data disks
avgrq-sz:
The average size (in 512 Byte sectors) of the requests  that  were issued to the device.
number of data disks: Data disk only, excluding parity disk in RAID5/6
#Get avgrq-sz for a device since host is up
$iostat -x /dev/sdc
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sdc               1.96     1.68    1.33    0.54   274.67   273.13   293.56     0.02    8.61   0.90   0.17
#Size in kb
$ echo 293.56*512/1024|bc -l
146.78
#RAID 0 with 2 disks
Chunk Size(KB)=146.78/2=73.39
#Chunk size should be 2^n
Chunk Size (KB)=73.39=~64
#Create RAID 0 with chunk size=64 
 
$mdadm -C /dev/md0 –l 0 -n 2 –chunk-size 64 /dev/sda1 /dev/sdb1
#Create file system with optimal stride
#stride and chunk has the same meaning but different unit.
#Stride is on file system level, it is the number of blocks written to disk before moving to the next.
stride=chunk size / block size
#If I choose block size =4096, then stride is 64/4=16
$mkfs.ext3 -b 4096 -E stride=16 /dev/md0

Stride is irrelevant for hardware RAID, which is presented to host as single harddisk.

Wednesday, March 9, 2011

Understanding Linux CPU scheduling priority

Scheduling priority depends on scheduling class.
scheduling classes
- SCHED_FIFO: A First-In, First-Out real-time process
- SCHED_RR: A Round Robin real-time process
- SCHED_NORMAL: A conventional, time-shared process
Most  processes are SCHED_NORMAL
How to find out the scheduling class of a process.
# ps  command with “class” flag
#   TS  SCHED_OTHER (SCHED_NORMAL)
#   FF  SCHED_FIFO
#   RR  SCHED_RR
$  ps -e -o class,cmd | grep sshd
TS  /usr/sbin/sshd
#chrt command
$ chrt -p 1836
pid 1836's current scheduling policy: SCHED_OTHER
pid 1836's current scheduling priority: 0

Scheduling priorities
- Real-time process (SCHED_FIFO/SCHED_RR)  real-time priority , ranging from 1 (lowest priority) to 99 (higest priority).
- Conventional process  static priority(SCHED_NORMAL ),  ranging from 100 (highest priority) to 139 (lowest priority).
Nice value and static priority
Conventional process's  static priority = (120 + Nice value)
So user can use nice/renice command to  change nice value in order to change conventional process's priority.
By default, conventional process  starts with nice value of 0 which equals static priority 120
Checking  Real-time/Conventional process priority.
$ ps -e -o class,rtprio,pri,nice,cmd
CLS RTPRIO PRI  NI CMD
TS       -  21   0 init [3]
FF      99 139   - [watchdog/0]
Watchdog is a real time process (CLASS=SCHED_FIFO), whose real time priority is 99 (I think the PRI column  is irrelevant  for it)
init is  a conventional process (CLASS=SCHED_OTHER), whose nice is 0 and dynamic priority is 121 (100+21)(I think the RTPRIO column  is irrelevant for it  )
Why init's priority is 121 not 120? Please noted I used the term: dynamic priority not static priority.
dynamic priority = max (100, min (  static priority - bonus + 5, 139))
bonus is ranging from 0 to 10,  which is set by scheduler depends on the past history of the process; more precisely, it is related to the average sleep time of the process.
Changing  Real-time/Conventional process priority.
#Real-time process
$chrt 80  ps -e -o class,rtprio,pri,nice,cmd
..
FF      80 120   - ps -e -o class,rtprio,pri,nice,cmd
# Conventional  process
$nice -n 10  ps -e -o class,rtprio,pri,nice,cmd
...
TS       -  12  10 ps -e -o class,rtprio,pri,nice,cmd

Monday, March 7, 2011

Proactive monitoring by snmptrap

Pulling snmp information is used in most monitoring solutions, however pushing information  is  an alternative monitoring solution by snmptrap.
This post demonstrates how to email alarms being pushed to receiver: snmptrapd from snmp agent.
Tested on Centos 5.5 +NET-SNMP  5.3.2.2
Install email daemon and net-snmp
$yum install postfix net-snmp net-snmp-utils
$cat /etc/snmp/snmptrapd.conf
#authCommunity   TYPES COMMUNITY  [SOURCE [OID | -v VIEW ]]
authCommunity  execute public  default  .1
traphandle  default /usr/bin/traptoemail -s localhost -f snmp@localhost root@localhost
start snmptrapd  and start postfix
Test by snmptrap tool
Send email if eth0 operation status is up (1)
(IF-MIB::linkUp is notification object defined in MIB file: IF-MIB.txt)
$snmptrap -v 2c -c public 127.0.0.1 "" IF-MIB::linkUp  .iso.org.dod.internet.mgmt.mib-2.interfaces.ifTable.ifEntry.ifOperStatus.1 i 1
Sample email received
$mail
>N 86 snmp@localhost.local  Mon Mar  7 15:59  18/747   "trap received from localhost: IF-MIB::linkUp"
& 86
Message 86:
From snmp@localhost.local.net  Mon Mar  7 15:59:13 2011
X-Original-To: root@localhost
Delivered-To: root@localhost.local.net
To: root@localhost.local.net
From: snmp@localhost.local.net
Subject: trap received from localhost: IF-MIB::linkUp
Date: Mon,  7 Mar 2011 15:59:13 +1100 (EST)
Host: localhost (UDP: [127.0.0.1]:35453)
DISMAN-EVENT-MIB::sysUpTimeInstance  0:6:08:29.87
SNMPv2-MIB::snmpTrapOID.0  IF-MIB::linkUp
IF-MIB::ifOperStatus.1  up
The above configuration make snmptrapd ready to receive traps, the following steps is to
configure snmp agent to send traps.
A SNMP v3 USM user need to be created, even the trap is intended for snmp v1/v2c only.
Check my previous post for creating and managing SNMP v3 USM users
$ cat /etc/snmp/snmpd.conf
#authuser    read,write [-s secmodel] user [noauth|auth|priv [oid|-V view]]
authuser   read -s v2c guest_user noauth  .1
authuser   read -s usm guest_user noauth  .1
authcommunity read  public  default .1
trap2sink 127.0.0.1 public
iquerySecName guest_user
agentSecName  guest_user
monitor   -u guest_user  -r 60 "interface down" -o ifDescr ifOperStatus != 1
If you shutdown any interface and restart snmpd, following email notification should appear
$mail 
..
N 87 snmp@localhost.local  Mon Mar  7 16:24  23/1030  "trap received from localhost: DISMAN-EVENT-MIB::mteTriggerFired"
& 87
Message 87:
From snmp@localhost.local.net  Mon Mar  7 16:24:28 2011
X-Original-To: root@localhost
Delivered-To: root@localhost.local.net
To: root@localhost.local.net
From: snmp@localhost.local.net
Subject: trap received from localhost: DISMAN-EVENT-MIB::mteTriggerFired
Date: Mon,  7 Mar 2011 16:24:28 +1100 (EST)
Host: localhost (UDP: [127.0.0.1]:46356)
DISMAN-EVENT-MIB::sysUpTimeInstance  0:0:00:00.84
SNMPv2-MIB::snmpTrapOID.0  DISMAN-EVENT-MIB::mteTriggerFired
DISMAN-EVENT-MIB::mteHotTrigger.0  interface down
DISMAN-EVENT-MIB::mteHotTargetName.0
DISMAN-EVENT-MIB::mteHotContextName.0
DISMAN-EVENT-MIB::mteHotOID.0  IF-MIB::ifOperStatus.4
DISMAN-EVENT-MIB::mteHotValue.0  2
IF-MIB::ifDescr.4  eth2
You can enable  “linkUpDownNotifications yes” to track interface status, but I found this type of  notification didn’t have interface name information.
Troubleshooting
1.failed to run mteTrigger query error
- make sure the user has permission in sec mode: usm as well.  “authuser   read -s usm guest_user noauth  .1”
- specifically set user with  “–u guest_user” in monitor command
2.Start snmpd in debugging mode for disman (Distributed Management )
/usr/sbin/snmpd -Ddisman -Lsd -Lf /var/log/snmpd.log -p /var/run/snmpd.pid -a

Saturday, March 5, 2011

Setup SNMP V3 USM with encryption.

SNMP v3 introduces advanced security which support USM(user-based security model) and data encryption,  SNMPv1 and SNMPv2 only support access control  based on community string and  send data in clear text. SNMP V3 on longer has the term: community string and (it seems) the ability to control access based on source network.
The following instructions are based on Centos 5.5 + NET-SNMP   5.3.2.2
Create user
Create user guest_user whose password is "Pass0001" and shared key for encryption is "sharedkey001"
 Put create user command into file /var/net-snmp/snmpd.conf, once snmpd restarted, the line will be deleted for security reason and the user will be created in usmUsertable
$cat /var/net-snmp/snmpd.conf
createUser guest_user     MD5 "Pass0001" DES "sharedkey001"
Grant user permission to all OIDs (.1) 
$ cat /etc/snmp/snmpd.conf
##authuser    read,write [-s secmodel] user [noauth|auth|priv [oid|-V view]]
#auth=authentication no privacy (encryption)
#priv=authentication plus privacy (encryption)
authuser   read -s usm  guest_user priv  .1
Restart snmpd
service snmpd restart
Test  by snmpget
$snmpget -v 3 -u guest_user -l Priv -a MD5 -A Pass0001 -x DES -X sharedkey001 192.168.56.31 sysName.0
NMPv2-MIB::sysName.0 = STRING: centos64.local.net
List users
$ snmptable -v 3 -u guest_user   -l Priv  -a MD5 -A Pass0001 -x DES -X sharedkey001  192.168.56.31 usmUsertable
SNMP table: SNMP-USER-BASED-SM-MIB::usmUserTable
guest_user
Add  user
#add  user guest_user2  by cloning guest_user
#The connecting user must be given write access (authuser read,write …. )  in order to add/delete users
$snmpusm -v 3 -u guest_user   -l Priv  -a MD5 -A Pass0001 -x DES -X sharedkey001  192.168.56.31 create  guest_user2  guest_user
User successfully created
Delete user
$snmpusm -v 3 -u guest_user   -l Priv  -a MD5 -A Pass0001 -x DES -X sharedkey001  192.168.56.31 delete  me2
Client configuration file snmp.conf You can put most command options in client config file: /etc/snmp/snmp.conf or  ~/.snmp/snmp.conf
$cat ~/.snmp/snmp.conf
defVersion 3
defSecurityName guest_user
defAuthType MD5
defSecurityLevel authPriv
defAuthPassphrase Pass0001
defPrivType  DES
defPrivPassphrase sharedkey001
#the long command can be simplified to
$snmpget  192.168.56.31 sysName.0
SNMPv2-MIB::sysName.0 = STRING: centos64.local.net

Tuesday, March 1, 2011

When Centos hung on starting up boot services, how to get to shell without rescue CD

Centos 5.5 hung on starting up udev service.  My first instinct was to try to go to interactive startup  mode to skip udev, as message hints  “press i to enter interactive startup”.
I later discovered “ interactive startup  mode” is almost useless, firstly, it is hard to active this mode by press “I” key, secondly not all services observe this mode. Network service seems to be the only one.
A flag file: /var/run/confirm will be created,  when key “I” (case insensitive) is pressed. It seems only network service check this file.
[root@centos64 init.d]# grep -C 2 /var/run/confirm /etc/init.d/*
/etc/init.d/network-            fi
/etc/init.d/network-            # If we're in confirmation mode, get user confirmation.
/etc/init.d/network:            if [ -f /var/run/confirm ]; then
/etc/init.d/network-                    confirm $i
/etc/init.d/network-                    test $? = 1 && continue

So how can you gain shell access without rescue CD?  The answer is to append “init=/bin/sh” to kernel line in grub boot loader.
Let’s  review the Linux boot order
The BIOS ->MBR->Boot Loader->Kernel->/sbin/init->
/etc/inittab->
/etc/rc.d/rc.sysinit->
/etc/rc.d/rcX.d/ #where X is run level in /etc/inittab
run script with K then script with S
By default “init=/sbin/init”, which will transfer control in above order.
If you set “init=/bin/sh”, it will stop there and give login shell.
Booting to single user mode won’t fix udev startup issue, because udev starts before single user mode (udev is in /etc/rc.d/rc.sysinit , single user mode is in /etc/rc.d/rc1.d)
Instructions:
In Grub menu, select the kernel,  press “a” to edit boot option, then append “init=/bin/sh”, then press enter to boot
After gaining the login shell, the fs is most likely in Read-only file system state.
 Remount partitions to rewrite mode by  “mount –o rw,remount / “