Wednesday, January 27, 2010

Benchmarking disk IO accurately by avoiding file cache

Linux tends to allocate memory to file cache as much as possible. It is good to boost file system performance in general, but it is not desired when measuring pure disk IO performance. The general practice for popular benchmarking tools e.g. Iozone/Bonnie is to create test file twice the size the memory. It can’t eliminate file cache effect completely and is not convenient if the server has huge memory.
The trick is to use ramfs to “pin” data into file cache to consume almost all free memory, so there will be no additional free memory for file cache during benchmarking. I don’t use tmpfs because it use virtual memory( physical memory + swap).
####Current free memory
[root@centos-5.2]$ free -m
total used free shared buffers cached
Mem: 249 30 218 0 2 4
-/+ buffers/cache: 23 226
Swap: 509 4 505

####mount as ramfs

$mkdir –p /mnt/ram
$mount -t ramfs ramfs /mnt/ram

####use –a option to show all mount.

$ df –ah | grep ramfs
ramfs 0 0 0 - /mnt/ram

####Create dummy file to consume almost all free memory, I will leave 20M free mem to run benchmarking program

$ dd if=/dev/zero of=/mnt/ram/ramfs.file bs=1024k count=195
195+0 records in
195+0 records out
204472320 bytes (204 MB) copied, 0.834913 seconds, 245 MB/s

####The195M data is pinned in memory as shown in cached column

$ free -m
total used free shared buffers cached
Mem: 249 226 23 0 2 200
-/+ buffers/cache: 23 226
Swap: 509 4 505

Friday, January 22, 2010

Double network throughput by tuning network parameters on Solaris and Linux

The default network buffer parameter in Solaris is too conservative, Linux from kernel 2.6.x is ok. The tuning applies to network environment where high throughput is needed e.g. ISCSI/NFS/CIFS storage Server, it is not wise to raise network buffer on firewall, because the parameters are per connection based, It just waste large memory to handle small data flow.

ENV

Virtualbox 3.1+ Centos 5.3 VM + OpenSolaris-2009.06 VM + Intel Pro/1000 desktop NIC for each VM +1G RAM for each VM. I use Iperf to test throughput by transfer 900Mb data (memory to memory, no disk IO involved),

Firstly, raise MTU to 9000 on the Gigabit Ethernet interface (no point to change default network buffer for Fast Ethernet).

Before Tuning

Start iperf on Solaris as Server, the TCP window size detected is 48KB.

root@opensolaris:~# /usr/local/bin/iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 48.0 KByte (default)
------------------------------------------------------------
[ 4] local 172.16.1.12 port 5001 connected with 172.16.1.11 port 39019
[ ID] Interval Transfer Bandwidth

Start iperf on Linux to transfer 900M data,the TCP window size detected is 27.5KB Please note the MSS is 8948 as expected. Bandwidth is 390 Mbits/sec
[root@centos1 ~]# iperf   -n 900M -mc 172.16.1.12
------------------------------------------------------------
Client connecting to 172.16.1.12, TCP port 5001
TCP window size: 27.5 KByte (default)
------------------------------------------------------------
[ 3] local 172.16.1.11 port 39019 connected with 172.16.1.12 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-19.3 sec 900 MBytes 390 Mbits/sec
[ 3] MSS size 8948 bytes (MTU 8988 bytes, unknown interface)

Tuning commands


Solaris tuning:

ndd -set /dev/tcp tcp_xmit_hiwat 983040
ndd -set /dev/tcp tcp_recv_hiwat 983040
ndd -set /dev/tcp tcp_max_buf 4194304

Save the command in startup script e.g /etc/rc3.d/S99local to survive reboot


Linux Tuning:

sysctl -w net.ipv4.tcp_rmem="40960       1048560 4194304"
sysctl -w net.ipv4.tcp_wmem="40960 196608 4194304"
sysctl -w net.core.rmem_max=4194304
sysctl -w net.core.wmem_max=4194304
All values are per connection based in bytes. net.core.rmem_max/net.core.wmem_max is for all protocols. No need to change net.core.[rw]mem_default because the 2nd value of net.ipv4.tcp_[rw]mem overrides it.

You don't need to change net.ipv4.tcp_mem, the default values are just fine, It is in pages(normally 4KB) for overall cap for all connections. There are other advanced parameters e.g net.ipv4.tcp_sack/net.ipv4.tcp_timestamps, I don’t change them ,because the effect seems unpredictable in complex network environment. Save the commands without leading "sysctl -w" in /etc/sysctl.conf to survive reboot.

run "man tcp" for more information.

After Tuning


After tuning the bandwidth is more than doubled: 955 Mbits/sec Please note the default Window Size was changed on both Solaris and Linux

root@opensolaris:~# /usr/local/bin/iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 960 KByte (default)
[root@centos1 ~]# iperf   -n 900M -mc 172.16.1.12
------------------------------------------------------------
Client connecting to 172.16.1.12, TCP port 5001
TCP window size: 192 KByte (default)
------------------------------------------------------------
[ 3] local 172.16.1.11 port 36932 connected with 172.16.1.12 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 7.9 sec 900 MBytes 955 Mbits/sec
[ 3] MSS size 8948 bytes (MTU 8988 bytes, unknown interface)

Reference:


Solaris TCP Tunable Parameters


Change MTU for Solaris


Download iperf for Solaris 10

Thursday, January 21, 2010

Change MTU for Solaris on e1000g interface

I have Linux and OpenSolaris installed on Virtualbox with Intel Pro/1000 network interface, Changing MTU on Linux worked fine but it failed on Solaris
$ifconfig e1000g1 mtu 9000
ifconfig: setifmtu: SIOCSLIFMTU: e1000g1: Invalid argument

It turns out that Solaris's driver doesn't have jumbo frame enabled by default, you have to enable it manually. Following enables jumbo frame on e1000g1 only.
$/kernel/drv/e1000g.conf 
MaxFrameSize=0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0;
# 0 is for normal ethernet frames.
# 1 is for upto 4k size frames.
# 2 is for upto 8k size frames.
# 3 is for upto 16k size frames.
# These are maximum frame limits, not the actual ethernet frame
# size. Your actual ethernet frame size would be determined by
# protocol stack configuration (please refer to ndd command man pages)
# For Jumbo Frame Support (9k ethernet packet)
# use 3 (upto 16k size frames)

Now set MTU to 9000 instead of 16K by editing file hostname.e1000g1. hostname.e1000g1 is interface configuration file, the first entry is your ip address or a name, the name must be resolvable in /etc/hosts.
$/etc/hostname.e1000g1
ip/name mtu 9000

#shutdown -i6 to reboot to take effect.

dladm is supposed to be the new method, But it didn't work.
#dladm show-linkprop -p mtu e1000g1
LINK PROPERTY PERM VALUE DEFAULT POSSIBLE
e1000g1 mtu rw 16298 1500 --
#dladm set-linkprop -p mtu=9000 e1000g1
dladm: warning: cannot set link property 'mtu' on 'e1000g1': try again later

Wednesday, January 13, 2010

Troubleshooting a high system CPU usage issue on Linux/Solaris

A Linux server has high %system CPU usage, following are steps to find the root cause of the issue and how to resolve it.
Vmstat show %system CPU usage is high.
# vmstat 2
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st

1  0      0 191420   8688  35780    0     0     0     0 1006   31  1  4 96  0  0
1  0      0 124468   9208  98020    0    0 15626  2074 1195  188  0 76  0 24  0
0  1      0 110716   9316 110996    0    0  3268  4144 1366   84  0 94  0  7  0
0  3      0  97048   9416 122272    0    0  2818 11855 1314  109  1 80  0 20  0
0  4      0  80476   9544 137888    0    0  3908  2786 1272  172  0 54  0 46  0
2  1      0  72860   9612 145848    0    0  1930     0 1193  141  0 42  0 58  0
0  1      0  74300   9620 145860    0    0     0     6 1208   67  0 38  0 62  0
0  0      0  75680   9620 145860    0    0     0  6929 1364  101  0 70  6 24  0

Let’s run mpstat to show more detailed CPU usage,it showed CPU was busy with interruptions.

# mpstat 2
Linux 2.6.18-92.el5 (centos-ks)         01/14/2010

02:03:50 AM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
02:04:04 AM  all    1.33    0.00   41.78    0.00    0.44    3.56    0.00   52.89   1015.56
02:04:06 AM  all    0.00    0.00    8.04   38.69   29.65   23.62    0.00    0.00   1326.63
02:04:08 AM  all    0.00    0.00    8.70   30.43   27.54   28.50    0.00    4.83   1327.54
02:04:10 AM  all    0.00    0.00    5.47   46.77   27.36   20.40    0.00    0.00   1280.10
02:04:12 AM  all    0.50    0.00    6.47   63.18   19.40   10.45    0.00    0.00   1183.08
02:04:14 AM  all    1.01    0.00    6.53   62.31   21.11    9.05    0.00    0.00   1190.95
02:04:16 AM  all    0.00    0.00    8.04   26.63   43.72   21.61    0.00    0.00   1365.83
02:04:18 AM  all    0.00    0.00    1.50    0.00    0.00    0.50    0.00   98.00   1006.50
Use sar to find out which interrupt number was culprit. #9 was the highest excluding system interrupt #0.
# sar -I XALL 2 10
02:07:10 AM      INTR    intr/s
02:07:12 AM         0    992.57
02:07:12 AM         1      0.00
02:07:12 AM         2      0.00
02:07:12 AM         3      0.00
02:07:12 AM         4      0.00
02:07:12 AM         5      0.00
02:07:12 AM         6      0.00
02:07:12 AM         7      0.00
02:07:12 AM         8      0.00
02:07:12 AM         9    350.50

[ Solaris equivalent command]
Solaris# intrstat 2 

device |      cpu0 %tim      cpu1 %tim 
-------------+------------------------------ 
bge#0 |         0  0.0       128  0.6 
cpqary3#0 |         0  0.0        14  0.0
# cat /proc/interrupts
CPU0
0:     702980          XT-PIC  timer
1:        439          XT-PIC  i8042
2:          0          XT-PIC  cascade
6:          2          XT-PIC  floppy
8:          1          XT-PIC  rtc
  9:      14464          XT-PIC  acpi, eth2
11:         12          XT-PIC  eth0
12:        400          XT-PIC  i8042
14:       6091          XT-PIC  ide0
15:         22          XT-PIC  ide1
NMI:          0
LOC:     700623
ERR:          0
MIS:          0
[ OpenSolaris equivalent command ]
Solaris#echo ::interrupts | mdb –k
Native Solaris has to search the interrupt from output of prtconf -v
Solution:
When the card transmits or receives a frame, the system must be notified of the event. If the card interrupts the system for each transmitted and received frame, the result is a high degree of processor overhead. To prevent that, Gigabit Ethernet provides a feature called Interrupt Coalescence. Effective use of this feature can reduce system overhead and improve performance.

Interrupt Coalescence essentially means that the card interrupts the system after sending or receiving batch of frames. 

you can enable adaptive moderation ( Adaptive RX: off  TX: off) to let system choose value automatically or set individual values manually.

A interrupt is generated by the card to the host when either frame counter or timer counter is met. Values 0 means disabled.

RX for example:
Timer counter in microseconds: rx-usecs/rx-usecs-irq
Frames counter:rx-frames/rx-frames-irq

# A sample output with default values.
# ethtool -c eth1
Coalesce parameters for eth1:
Adaptive RX: off  TX: off
stats-block-usecs: 999936
sample-interval: 0
pkt-rate-low: 0
pkt-rate-high: 0

rx-usecs: 18
rx-frames: 6
rx-usecs-irq: 18
rx-frames-irq: 6

tx-usecs: 80
tx-frames: 20
tx-usecs-irq: 80
tx-frames-irq: 20

rx-usecs-low: 0
rx-frame-low: 0
tx-usecs-low: 0
tx-frame-low: 0

rx-usecs-high: 0
rx-frame-high: 0
tx-usecs-high: 0
tx-frame-high: 0
<>
[ Solaris equivalent command]
Varying on driver, Find out the driver's capability. 
Solaris#ndd -get /dev/e1000g0 \? | egrep ‘interrupt |intr’
The Value shoud be set in driver conf file:
Solaris#/platform/`uname -m`/kernel/drv/*.conf
Alternative Workaround:
I couldn't config Interrupt Coalescence because virtual machine NIC didn't support it. but as workaround, Increasing mtu can also decrease interrupt, ifconfig eth2 mtu 9000 resolved the issue. It needs to set on both hosts peer, if they are not directly connected, make sure the switch supports jumbo frames. 
You don't need to care Interrupt Coalescence if your CPU resource is abundant, But for high load NFS/CIFS/ISCSI/ NAS servers, it is very useful.