Tuesday, May 17, 2011

GFS (Global File System) quickstart

What is GFS?
GFS allow all nodes to have direct CONCURRENT write access to the same shared BLOCK storage.
For local file system e.g ext3, A shared BLOCK storage can be mounted in multiple nodes, but CONCURRENT write access is not allowed
For NFS, the CONCURRENT write access is allowed, but it is not direct BLOCK device, which introduce delay and another layer of failure.
GFS requirements:
- A shared block storage (iSCSI, FC SAN etc.. )
- RHCS (Red hat Cluster suite) (although GFS can be mounted in standalone server without cluster, it is primarily used for testing purpose or recovering data when cluster fails)
- RHEl 3.x onwards (RHEL derivatives: Centos/Fedora), it should work in other Linux distributions, since GFS and RHCS have been open sourced.
GFS specifications:
- RHEL 5.3 onwards use GFS2
- RHEl 5/6.1 supports maximum 16 nodes
- RHEL 5/6.1 64 bit supports maximum file system size of 100TB (8 EB in theory)
- Supports: data and metadata journaling, quota, acl, Direct I/O, growing file system online, dynamic inodes (convert inode block to data block) 
- LVM snapshot of CLVM under GFS  is NOT yet supported.
GFS components:
RHCS components: OpenAIS, CCS, fenced, CMAN and CLVMD (Clustered LVM)
GFS specific component: Distributed Lock Manager (DLM)
Install RHCS and GFS  rpms
Luci (Conga project) is the easiest way to install and configure RHCs and GFS.
#GFS specific packages:
#RHEL 5.2 or lower versions 
$yum install gfs-utils    kmod-gfs 
#RHEL 5.3 onwards, gfs2 module is part of kernel 
$yum install gfs2-utils   
Create GFS on LVM
You can create GFS on raw device, but LVM is recommended for consistent device names and the ability to extend device
#Assume you have setup and tested a working RHCS
#Edit cluster lock type in /etc/lvm/lvm.conf on ALL nodes

#Create PV/VG/LV as if in standalone system ONCE in any ONE of the nodes

#Start Cluster and clvmd on ALL nodes 
#Better use luci GUI interface to start whole cluster 
$Service cman start
$Service rgmanager start
$servcie clvmd start

#Create GFS ONCE in any ONE of the nodes
# -p lock_dlm is required in cluster mode. Lock_nolock is for standalone system
# -t cluster1:gfslv      ( Real cluster-name: arbitrary  GFS name )
# Above information is stored in GFS superblock, which can be changed with “gfs_tool sb” without re-initializing GFS e.g change lock type: "gfs_tool sb /device proto lock_nolock" 
#-j 2: the number of journals, minimum 1 for each node. The default journal size is 128Mib, can be overridden by -J
#additional journal can be added with gfs_jadd
gfs_mkfs -p lock_dlm -t cluser1:gfslv -j 2 /dev/vg01/lv01

#Mount GFS in cluster member by /etc/fstab
#put GFS mount in /etc/fstab in ALL nodes
#Cluster service can mount GFS without /etc/fstab after adding GFS as resource, but It can only mount on one node (the active node).  Since GFS is supposed to be mounted on all nodes at the same time. /etc/fstab is a must, GFS resource is optional.
#GFS mount options: lockproto, locktable are optional, mount can obtain the information from superblock automatically
$cat /etc/fstab
/dev/vg01/lv01          /mnt/gfs                gfs     defaults 0 0

#Mount all GFS mounts
service gfs start
GFS command lines
####Check GFS super block 
#some values can be changed by “gfs_tool sb”
$gfs_tool sb /dev/vg01/lv01 all
sb_bsize = 4096
sb_lockproto = lock_dlm
sb_locktable = cluster1:gfslv01
####GFS tunable parameters 
#view parameters
gfs_tool gettune <mountpoint>
#set parameters 
#The parameters don’t persist after re-mount, You can customize /etc/init.d/gfs to set tunable parameters on mounting
gfs_tool settune <mountpoint>

####Performance related parameters
#like other file system, you can disable access time update by mount option “noatime”
#GFS can also allow you to control how often to update access time
$gfs_tool gettune /mnt/gfs | grep atime_quantum   
atime_quantum=3660          #in secs

#Disable quota, if not needed
#GFS2 remove the parameter and implement it in mount option “quota=off”
$gfs_tool settune /mnt/gfs quota_enforce 0

#GFS direct I/O
#Enable directI/O for database files, if DB has its own buffering mechanism to avoid “double” buffering 
$gfs_tool setflag directio /mnt/gfs/test.1     #file attribute
$gfs_tool setflag inherit_directio /mnt/gfs/db/     #DIR attribute
$gfs_tool clearflag directio /mnt/gfs/test.1              #remove attribute
$gfs_tool stat  inherit_directio /mnt/gfs/file     # view attribute

#enable data journal for very small files
#disable data journal for large files
$gfs_tool setflag inherit_jdata  /mnt/gfs/db/     #Enable  data journal (only metadata  has journal  by default) on a dir. (if operate on a file, the file must be zero size)

###GFS backup, CLVM doesn't support snapshot
$gfs_tool freeze /mnt/gfs          #change GFS to read-only (done once in any one of the nodes)
$gfs_tool unfreeze /mnt/gfs

###GFS repair 
#after unmount GFS on all nodes
$gfs_fsck  -v /dev/vg01/lv01         # gfs_fsck -v -n /dev/vg01/lv01 : -n answer no to all questions, inspect gfs only without making changes

GFS implementation scenarios:
GFS’s strength is the ability to do concurrent write to the same block device, It make it possible for Active-Active cluster nodes to write to the same block device, but there are few such cases in real life.
In Active-Active cluster nodes (all nodes perform the same task), RHCS can’t do load balancing itself, it requires external load balancer
 - Database server cluster: In theory, all nodes can write to the same DB file concurrently, However, the performance will be degraded, because all nodes try to lock the file via Distributed Lock Manager.  You can assign different task to cluster nodes to write to different DB file, e.g. node-A run DB-A and node-B run DB-B, but this can be done, without GFS, by mounting  ext3 on individual iSCSI/FC disk.
GFS doesn’t lose to ext3 in above scenario, but its lack of LVM snapshot of in GFS‘s CLVMD kills my inspiration of using DB on GFS
 - Application server cluster: e.g. Apache, Jboss server cluster. It is the true that GFS can simply application package deployment because all nodes can share the same application package binaries. But if you only use two nodes cluster, deploying application twice is not big hassle. Maintaining single copy of application binaries is convenient, but at risk of single point of failure.
 - NFS Cluster: Because NFS is I/O bound, Why would you run Active-Active NFS cluster with CPU/memory resource in nodes are not being fully utilized? 

No comments:

Post a Comment