Showing posts with label Linux Internals. Show all posts
Showing posts with label Linux Internals. Show all posts

Tuesday, April 26, 2011

Do we really need to set partition type to fd(Linux auto raid) for Linux software RAID?

Almost all Linux RAID documents mandate that partition type must be fd(Linux auto raid)  before building Linux software RAID. Actually, this step is optional, it helps a little if your RAID device is /dev/md0 in Centos.
What is fd(Linux auto raid)?
As the name implies, it is for auto detection of  raid  when OS boots. If you have created /dev/md0 but didn't put it  in configuration file /etc/mdadm.conf, OS is able to detect the partitions and assemble /dev/md0.
But, this way of assembling RAID device only works for /dev/md0 in Centos by default.
It is because Centos only enable raidautorun for /dev/md0 by default. Any other md will be assembled by reading /etc/mdadm.conf
[Centos 5 ] $grep -A 3 raidautorun  /etc/rc.sysinit 
[ -x /sbin/nash ] && echo "raidautorun /dev/md0" | nash --quiet
if [ -f /etc/mdadm.conf ]; then
/sbin/mdadm -A -s
fi
#The auto detecting behavior is logged in kernel buffer
$ dmesg | grep -i auto
md: Autodetecting RAID arrays.
md: autorun ...
md: ... autorun DONE.
fd VS  RAID superblock
Don't confuse fd with RAID superblock,  fd is an optional flag recognized by  nash raidautorun command. But RAID superblock is, in every RAID device member, an essential piece of information, which contains RAID level, state and parent  MD device UUID (man 4 md).
#Examine superblock on logical device will encounter an error
#It is expected because superblock only exist in RAID member device
 $ mdadm --examine /dev/md0
mdadm: No md superblock detected on /dev/md0.

#Examine  superblock on RAID member
$ mdadm --examine /dev/sdb2
/dev/sdb2:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : a31e6699:4360a3b7:38c544fa:f4e6faa9
  Creation Time : Wed Apr 27 11:19:34 2011
     Raid Level : raid1
  Used Dev Size : 104320 (101.89 MiB 106.82 MB)
     Array Size : 104320 (101.89 MiB 106.82 MB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 0

    Update Time : Wed Apr 27 12:51:58 2011
          State : clean
Internal Bitmap : present
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0
       Checksum : 58c72673 - correct
         Events : 20

#Scan  partitions superblock to find existing raid device.
$ mdadm --examine --brief --scan --config=partitions
ARRAY /dev/md1 level=raid1 num-devices=2 UUID=da55e1e2:c781a461:73d6dfa6:8c7cf6d6
##The above output can be saved to /etc/mdadm.conf; then mdadm -A -s will activate the RAID device.
##DEVICE member list is optional, because default is “DEVICE partitions”.
Conclusion
Partition type FD is a way of assembling raid used by nash raidautorun command and it only works for /dev/md0 in Centos by default.
If you use /etc/mdadm.conf  to assemble RAID, the FD flag is optional.  But setting this flag can help you to recognize RAID members from “fdisk -l”.

Monday, March 21, 2011

Tune Interrupt and Process CPU affinity

When an interrupt signal arrives, the CPU must stop what it's currently doing and switch to a new activity. Interrupts consume CPU time, too many interrupts could cause high system CPU usage as observed from mpstat. In Symmetric Multiprocessing model (SMP ), balancing IRQ is NOT usually needed, because  the  /proc/irq/default_smp_affinity or irqbalance daemon can distribute IRQ signals  among  CPUs automatically.
If you find a particular IRQ is unevenly distributed from the output of “cat /proc/interrupts “, then you can try to Set smp_affinity manually.
Tunning IRQ affinity with smp_affinity flag
It is pointless to set smp_affinity manually with  irqbalance running.

##Check current smp_affinity setting
$ grep -H * /proc/irq/*/smp_affinity
default_smp_affinity:00000000,00000000,ffffffff,ffffffff
/proc/irq/0/smp_affinity:00000000,00000000,00000000,00000020
..
/proc/irq/24/smp_affinity:00000000,00000000,00000000,00000010
##The value of smp_affinity is hex format, which is 128 bit in binary in my host. (It might be  32bit or 256bit depending kernel version  ).  It is called bit mask, 128 represent the position of 128 CPUs, if the value of the position is “1”, then the CPU is allowed to accept the IRQ.
##e.g to allow CPU 0,1,3, the binary presentation is “1011”, which is “b” in hex.
$echo -e "obase=16 \n ibase=2 \n 1011 \n" | bc
B
##Changing IRQ 24 from CPU 4 to CPU 0,1,3
$cat /proc/irq/24/smp_affinity 
00000000,00000000,00000000,00000010
$ echo 0b >/proc/irq/24/smp_affinity
$ cat /proc/irq/24/smp_affinity 
00000000,00000000,00000000,0000000b
Tunning process affinity with taskset 
Pin a process to a particular CPU  to improve cache hit
$taskset -p $$
pid 21698's current affinity mask: ff
$taskset -cp 0,1,3  $$
pid 21698's current affinity list: 0-7
pid 21698's new affinity list: 0,1,3
$taskset -p $$
pid 21698's current affinity mask: b
##you can check which CPU a process is running on with "psr" parameter
$ ps axo  psr,pid,cmd | grep $$
  0 21698 -bash
Isolate CPUs to run particular process only.
Kernel parameter “isolcpus” can isolate particular CPUs from doing other tasks. Together with taskset, you can have particular CPUs to run designated tasks only.
E.g put “isolcpus=2,3 “ in grub.conf will isolate CPU 2 and 3.

Wednesday, March 9, 2011

Understanding Linux CPU scheduling priority

Scheduling priority depends on scheduling class.
scheduling classes
- SCHED_FIFO: A First-In, First-Out real-time process
- SCHED_RR: A Round Robin real-time process
- SCHED_NORMAL: A conventional, time-shared process
Most  processes are SCHED_NORMAL
How to find out the scheduling class of a process.
# ps  command with “class” flag
#   TS  SCHED_OTHER (SCHED_NORMAL)
#   FF  SCHED_FIFO
#   RR  SCHED_RR
$  ps -e -o class,cmd | grep sshd
TS  /usr/sbin/sshd
#chrt command
$ chrt -p 1836
pid 1836's current scheduling policy: SCHED_OTHER
pid 1836's current scheduling priority: 0

Scheduling priorities
- Real-time process (SCHED_FIFO/SCHED_RR)  real-time priority , ranging from 1 (lowest priority) to 99 (higest priority).
- Conventional process  static priority(SCHED_NORMAL ),  ranging from 100 (highest priority) to 139 (lowest priority).
Nice value and static priority
Conventional process's  static priority = (120 + Nice value)
So user can use nice/renice command to  change nice value in order to change conventional process's priority.
By default, conventional process  starts with nice value of 0 which equals static priority 120
Checking  Real-time/Conventional process priority.
$ ps -e -o class,rtprio,pri,nice,cmd
CLS RTPRIO PRI  NI CMD
TS       -  21   0 init [3]
FF      99 139   - [watchdog/0]
Watchdog is a real time process (CLASS=SCHED_FIFO), whose real time priority is 99 (I think the PRI column  is irrelevant  for it)
init is  a conventional process (CLASS=SCHED_OTHER), whose nice is 0 and dynamic priority is 121 (100+21)(I think the RTPRIO column  is irrelevant for it  )
Why init's priority is 121 not 120? Please noted I used the term: dynamic priority not static priority.
dynamic priority = max (100, min (  static priority - bonus + 5, 139))
bonus is ranging from 0 to 10,  which is set by scheduler depends on the past history of the process; more precisely, it is related to the average sleep time of the process.
Changing  Real-time/Conventional process priority.
#Real-time process
$chrt 80  ps -e -o class,rtprio,pri,nice,cmd
..
FF      80 120   - ps -e -o class,rtprio,pri,nice,cmd
# Conventional  process
$nice -n 10  ps -e -o class,rtprio,pri,nice,cmd
...
TS       -  12  10 ps -e -o class,rtprio,pri,nice,cmd

Monday, July 26, 2010

Dissect Linux memory cache

Command “free” has buffers and cached columns, What is the difference? How to dig further to find the size of Dentry cache, inode cache and page cache in cached column.

Difference between Buffers and cached


$free -m
total       used       free     shared    buffers     cached
Mem:          3777       3746         31          0        160        954
-/+ buffers/cache:       2631       1145
Swap:          753          1        751

Buffer Pages
Whenever the kernel must individually address a block, it refers to the buffer page that holds the block buffer and checks the corresponding buffer head.
Here are two common cases in which the kernel creates buffer pages:
· When reading or writing pages of a file that are not stored in contiguous disk blocks. This happens either because the filesystem has allocated noncontiguous blocks to the file, or because the file contains "holes"
· When accessing a single disk block (for instance, when reading a superblock or an inode block).


Raw disk operation such dd use buffers.
Read 10M of raw disk block
$dd if=/dev/sda6 of=/dev/zero bs=1024k count=10
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 0.209051 seconds, 50.2 MB/s

Buffers sized increased by 10MB( 170M-160M)
$free -m
total used free shared buffers cached
Mem: 3777 3754 23 0 170 952
-/+ buffers/cache: 2631 1145
Swap: 753 1 751
Dentry cache, inode cache and page cache in cached column.
Dentry cache: Directory Entry Cache, pathname (filename) lookup cache.
Inode cache: Cache for inode, not actual data block.
Page cache: Cache for actual data block
[FROM: http://www.mjmwired.net/kernel/Documentation/filesystems/vfs.txt ]
The combined value of dentry and inode cache is not bigger than whole slab size
$grep -i slab /proc/meminfo 
Slab: 183896 kB

Examine in detail by checking /proc/slabinfo.
$ awk '/dentry|inode/ { print $1,$2,$3,$4}' /proc/slabinfo 
nfs_inode_cache 122787 123312 984
rpc_inode_cache 24 25 768
ext3_inode_cache 9767 9770 776
mqueue_inode_cache 1 4 896
isofs_inode_cache 0 0 624
minix_inode_cache 0 0 640
hugetlbfs_inode_cache 1 7 576
ext2_inode_cache 0 0 728
shmem_inode_cache 441 455 776
sock_inode_cache 231 235 704
proc_inode_cache 670 756 608
inode_cache 2415 2415 576
dentry_cache 99060 110162 200
#sum up them (bytes)
$awk '/dentry|inode/ { x=x+$3*$4} END {print x }' /proc/slabinfo 
153348952

view live stats with slabtop by sorting by cache size
$slabtop -s c
Active / Total Objects (% used) : 332028 / 361849 (91.8%)
Active / Total Slabs (% used) : 45630 / 45631 (100.0%)
Active / Total Caches (% used) : 102 / 139 (73.4%)
Active / Total Size (% used) : 171443.16K / 175338.12K (97.8%)
Minimum / Average / Maximum Object : 0.02K / 0.48K / 128.00K
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
125948 125864 99% 0.96K 31487 4 125948K nfs_inode_cache
113962 105262 92% 0.20K 5998 19 23992K dentry_cache
14609 14460 98% 0.52K 2087 7 8348K radix_tree_node
9770 9765 99% 0.76K 1954 5 7816K ext3_inode_cache
59048 44057 74% 0.09K 1342 44 5368K buffer_head
2415 2410 99% 0.56K 345 7 1380K inode_cache

I haven’t found to way to get page_cache size directly, It needs bit of calculation page_cache=~ cached – inode – dentry
alternatively observe the value change  by releasing pagecache

#write dirty pages to disk
sync
#To free pagecache:
echo 1 > /proc/sys/vm/drop_caches
#To free dentries and inodes:
echo 2 > /proc/sys/vm/drop_caches
#To free pagecache, dentries and inodes:
echo 3 > /proc/sys/vm/drop_caches

As this is a non-destructive operation and dirty objects are not freeable, it is highly recommended to run command “sync” first.
#Suppress page cache usage.
sysctl -w vm.swappiness=0

vm.swappiness  # value range: 0 - 100, lower value tends to shrink page cache to get free memory, higher value  tends to use  swap to get free memory

Tuesday, March 2, 2010

Linux memory management study notes

32-bit architectures can reference 4 GB (2^32) of physical memory.
 
#Virtual Memory: User-space virtual space resides lower 3G, Kernel virtual space resides upper 1G
 
#Physical Memory: Three zones:
ZONE_DMA=0-16M, ZONE_NORMAL=16-896M  (896M-1024M kernel reserved ) and   ZONE_HIGHMEM=>1024M

32-bit architectures
Kernel has page tables  to map "virtual addresses" to "physical addresses"
The kernel virtual area  is mapped from  HIGH  1GB ( 3GB-4GB)  virtual space to LOW ( 1 GB)  physical RAM.

- RAM size is less than 896 MB
Liner mapping is possible from 1GB kernel address to 1GB  of physical RAM,which are ZONE_DMA and ZONE_NORMAL (not including 128M reserved space).
Kernel page tables must transform linear addresses starting from 0xc0000000 (3GB) into physical addresses starting from 0.

- RAM size is between 896 MB and 4096 MB
Dynamic remapping is done in the 128M reserved space, because ZONE_HIGHMEM zone includes page frames that cannot be directly accessed by the kernel through the linear mapping.

- RAM size is more than 4096 MB
dynamic rempapping with three-level paging model.
(With PAE capable hardware and hugemem Linux kernel , 32bit Linux can support up to 64G memory)

64-bit architectures
ZONE_HIGHMEM is empty, all are ZONE_NORMAL, no remapping needed.
#/proc/meminfo displays physical memory info
#32bit Kernel has both low and high memory

[ 32bit Kernel]$ cat /proc/meminfo 
MemTotal:      3897500 kB
MemFree:       3280456 kB
..
HighTotal:     3014592 kB
HighFree:      2685548 kB
LowTotal:       882908 kB
LowFree:        594908 kB

#64 bit kernel ZONE_NORMAL is huge, all fits in Lowtotal, so HighTotal is zero

[64bit Kernel]$ cat /proc/meminfo  
MemTotal:     37025752 kB
MemFree:      17509720 kB
..
HighTotal:           0 kB
HighFree:            0 kB
LowTotal:     37025752 kB
LowFree:      17423152 kB
..

#From kernel document
HighTotal:

HighFree: Highmem is all memory above ~860MB of physical memory, Highmem areas are for use by userspace programs, or for the pagecache.  The kernel must use tricks to access this memory, making it slower to access than lowmem.

LowTotal:
LowFree: Lowmem is memory which can be used for everything that highmem can be used for, but it is also available for the kernel's use for its own data structures.  Among many other things, it is where everything from the Slab is allocated.  Bad things happen when you're out of lowmem.

#Links
BOOK:Understanding the Linux Kernel By Daniel Pierre Bovet, Marco Cesatí
http://books.google.com.au/books?id=h0lltXyJ8aIC&dq=Understanding+the+Linux+Kernel+By+Daniel+Pierre+Bovet,+Marco+Cesat%C3%AD&source=gbs_navlinks_s
High Memory In The Linux Kernel
http://kerneltrap.org/node/2450
Kernel document
http://www.mjmwired.net/kernel/Documentation/filesystems/proc.txt