TECH NOTES JOURNEY THROUGH A DECADE: Scripting

Showing posts with label Scripting. Show all posts

Wednesday, July 18, 2018

Python script to generate Ansible ini inventory file from csv file

Ansible in memory inventory file created by add_host is often used in AWS EC2 provisioning. Inventory file can be generated easily,however it has drawback. Because it is in memory, all server post build tasks have to be in one big playbook. Which means it is not easy to re-run failed tasks if there is failure and existing post build playbooks can't be reused.

I created a Python script to generate a temporary inventory file from csv file used in EC2 provisioning. The inventory file can be used in multiple post build playbooks. The file name is static, however it will not be overwritten,if you set concurrent build limit to 1 in CI/CD server.
Some AWS EC2 instances in my company need static hostname The ip field will be changed automatically with the EC2 private IP return right after provisioning and and there is a playbook to create host record in infoblox. the group are Ansible group vars,multipe groups are separated by semicolon and the order is important,vars in last group will take precedence
The csv file

name,ip,group,zone,env
awselk1,,elasticsearch;elasticsearch-master,2a,prod
awselk2,,elasticsearch;elasticsearch-data,2a,prod

The script

#!/usr/bin/python
# Takes a file CSV file "xxx.csv" and outputs xxx.ini for Ansible host inventory data
import csv
import sys
import os
 
if len(sys.argv) <= 1:
   print "Usage:" +sys.argv[0]+" input-filename"
   sys.exit(1) 
net_dn = {'prod':'prod.example.com', 'preprod':'preprod.example.com',
          'test':'test.example.com', 'dev':'dev.example.com'}
groups = []
envs = set()
hosts_ini = {}

csvname = sys.argv[1]
scriptpath = os.path.dirname(sys.argv[0])

ansible_ini = os.path.join(scriptpath, 'hosts-aws-tmp.ini')

lines = []
hosts_text = ''
with open(csvname) as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        domain = net_dn[row['env'].strip()]
        line = row['name'].strip()+'.'+domain
        #lines.append(line)
        envs.add(row['env'])
        # support multiple groups separated by ;
        for g in row['group'].strip().split(';'):
          g = g.strip()
          if (not g in groups):
            groups.append(g)
          hosts_ini.setdefault(g, []).append(line)

#groups=set(groups)
if ( len(envs) !=1 ):
   print "ERROR: only single enviroment is supported!"
   sys.exit(1)
env = list(envs)[0]
env_text = "["+env+":children]"+"\n"+"\n".join(groups)   
vars_text = "\n\n["+env+":vars]"
vars_text += """
ansible_user=ansible
ansible_ssh_private_key_file=~/.ssh/id_rsa
ansible_become=true
ansible_become_user=root
ansible_become_method=sudo
ansible_gather_facts=no
"""
vars_text+="aws_env=aws-"+env+'\n'
#generate groups in order as input
for g in groups:
   hosts_text+='\n['+g+']\n'
   hosts_text+='\n'.join(hosts_ini[g])
   hosts_text+='\n'
 
all_text = env_text+vars_text+hosts_text
print all_text
with open(ansible_ini,'w') as new_ini_file:
    new_ini_file.write(all_text)   
print "INFO:Generated Ansible host inventory file: " + ansible_ini

The Ansible inventory file generated

[prod:children]
elasticsearch
elasticsearch-master
elasticsearch-data

[prod:vars]
ansible_user=ansible
ansible_ssh_private_key_file=~/.ssh/id_rsa
ansible_become=true
ansible_become_user=root
ansible_become_method=sudo
ansible_gather_facts=no
aws_env=aws-prod

[elasticsearch]
awselk1.prod.example.com
awselk2.prod.example.com

[elasticsearch-master]
awselk1.prod.example.com

[elasticsearch-data]
awselk2.prod.example.com

Thursday, October 30, 2014

Python script to run remote SSH commands with sudo permission

I created a Python script to run remote SSH command with sudo permission. Linux SSH command doesn’t support password as command option, you have to use expect script to connect to multiple servers for automation. plink tool in Windows support password as command option.
The trick to accept sudo password is ‘-S’ option in sudo, which accept sudo password piped from stdin.It seems to be safe, I turned on debug and I couldn’t see the password recorded in secure/messages logs.
There are two versions of the script: the command line one and the class/module one.

The command line version.

if the clear text password is an concern, you can wrap the script by getpasswd module in Python,which read password from stdin.Read password once and apply the password to multiple servers.

[root@~]# ./pyssh.py  -s server1 -u admin -p Passwd123 date
Thu Oct 30 15:36:27 EST 2014

#'service sshd status' command  ran successfully with sudo enabled '-t'
[root@~]# ./pyssh.py  -t -s server1 -u admin -p Passwd123  'service sshd status'
openssh-daemon (pid  15686) is running...

#!/usr/bin/env python
import sys
import paramiko
import argparse
import socket
parser = argparse.ArgumentParser()
parser.add_argument("-s", "--servername", help="hostname or IP", required=True)
parser.add_argument("-P", "--port", help="ssh port default=22", default=22)
parser.add_argument("-t", "--sudo", help="enable sudo,sudo password will use the value of --password",action='store_true')
parser.add_argument("-u","--username",help="username",required=True)
parser.add_argument("-p","--password",help="password",required=True)
parser.add_argument("cmd",help="command to run")
args=parser.parse_args()

host = args.servername
port = args.port
user = args.username 
password = args.password
cmd = args.cmd
if args.sudo:
    fullcmd="echo " + password + " |   sudo -S -p '' " + cmd
else:
    fullcmd=cmd

#if __name__ == "__main__":
client = paramiko.SSHClient()
#Don't use host key auto add policy for production servers
client.set_missing_host_key_policy(paramiko.AutoAddPolicy())
client.load_system_host_keys()
try: 
    client.connect(host,port,user,password)
    transport=client.get_transport()
except (socket.error,paramiko.AuthenticationException) as message:
    print "ERROR: SSH connection to "+host+" failed: " +str(message)
    sys.exit(1)
session=transport.open_session()
session.set_combine_stderr(True)
if args.sudo: 
    session.get_pty()
session.exec_command(fullcmd)
stdout = session.makefile('rb', -1)
print stdout.read()
transport.close()
client.close()

The class version

The class version allow multiple commands to run in an existing SSH transport,which is more efficient.To use the class,copy pyssh.sh to a folder and create a new script to import the class 'from pyssh import PySSH',then reference the code in MAIN section without if statement.

#!/usr/bin/env python
import sys
import socket
import paramiko
#=================================
# Class: PySSH
#=================================
class PySSH(object):
  
  
    def __init__ (self):
        self.ssh = None
        self.transport = None  

    def disconnect (self):
        if self.transport is not None:
           self.transport.close()
        if self.ssh is not None:
           self.ssh.close()

    def connect(self,hostname,username,password,port=22):
        self.hostname = hostname
        self.username = username
        self.password = password

        self.ssh = paramiko.SSHClient()
        #Don't use host key auto add policy for production servers
        self.ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
        self.ssh.load_system_host_keys()
        try:
            self.ssh.connect(hostname,port,username,password)
            self.transport=self.ssh.get_transport()
        except (socket.error,paramiko.AuthenticationException) as message:
            print "ERROR: SSH connection to "+self.hostname+" failed: " +str(message)
            sys.exit(1)
        return  self.transport is not None

    def runcmd(self,cmd,sudoenabled=False):
        if sudoenabled:
            fullcmd="echo " + self.password + " |   sudo -S -p '' " + cmd
        else:
            fullcmd=cmd
        if self.transport is None:
            return "ERROR: connection was not established"
        session=self.transport.open_session()
        session.set_combine_stderr(True)
        #print "fullcmd ==== "+fullcmd
        if sudoenabled:
            session.get_pty()
        session.exec_command(fullcmd)
        stdout = session.makefile('rb', -1)
        #print stdout.read()
        output=stdout.read()
        session.close()
        return output

#===========================================
# MAIN
#===========================================        
if __name__ == '__main__':
    hostname = 'server1'
    username = 'admin'
    password = 'password123'
    ssh = PySSH()
    ssh.connect(hostname,username,password)
    output=ssh.runcmd('date')
    print output
    output=ssh.runcmd('service sshd status',True)
    print output
    ssh.disconnect()

Friday, September 5, 2014

Build Puppet module to use Hiera lookup

Puppet can use Hiera to look up data. This helps you disentangle site-specific data from Puppet code, for easier code re-use and easier management of data that needs to differ across your node population

One typical example is IP address and NTP/DNS servers, the IP address is unique for each server and NTP/DNS is global. I built a linux-network test module to demonstrate the usage of Hiera
Hiera supports yaml and Json as backend by default, however you can write your custom backend using Hiera API.

Define datadir in hiera.yaml

[root@server1 modules]# cat /etc/puppetlabs/puppet/hiera.yaml 
---
:backends:
  - yaml

:hierarchy:
  - defaults
  - "%{clientcert}"
  - "%{environment}"
  - global

:yaml:
# datadir is empty here, so hiera uses its defaults:
# - /var/lib/hiera on *nix
# - %CommonAppData%\PuppetLabs\hiera\var on Windows
# When specifying a datadir, make sure the directory exists.
  :datadir: /etc/puppetlabs/puppet/hieradata

Set all values in YAML file instead of manifest file
You can also add class name in YAML file, then assign class to node with hiera_include

[root@server1 modules]# cat /etc/puppetlabs/puppet/hieradata/global.yaml 
---
#
# ntp.conf
ntpservers: [10.1.1.11, 10.1.1.12]

#resolv.conf
domainname: example.com
searchdomain: [example1.com, example2.com]
nameservers: [10.1.1.13, 10.1.1.14]

[root@server1 modules]# cat /etc/puppetlabs/puppet/hieradata/server1.example.com.yaml 
eth1:
   device: eth1
   ipaddr: 172.16.1.2
   netmask: 255.255.255.0
   routes: ['192.168.1.0/24 via 172.16.1.254', '192.168.2.0/24 via 172.16.1.254']
   gateway: 172.16.1.254
eth3:
   device: eth3
   ipaddr: 172.16.1.3
   netmask: 255.255.255.0
   #routes: ['192.168.1.0/24 via 172.16.1.254', '192.168.2.0/24 via 172.16.1.254']

Execute the whole class or a function of the class in site.pp, the codes in site.pp become universal.
The site.pp manifest file is just generic code

[root@server1 modules]# cat /etc/puppetlabs/puppet/manifests/site.pp

node "server1" {
include "linux-network"
}

node "server2" {
linux-network::setinterface { 'eth1': }
}

linux-network module manifest files

[root@server1 modules]# cat ./linux-network/manifests/init.pp 
class linux-network {
 linux-network::setinterface { 'eth1': ; 'eth3': }
 linux-network::setroute { 'eth1': ; 'eth3':}

 linux-network::setconf_ntp {'ntp.conf':}
 linux-network::setconf_resolv {'resolv.conf':}
}

[root@server1 modules]# cat ./linux-network/manifests/setconf_ntp.pp 
define linux-network::setconf_ntp  ( ) {

$ntpservers=hiera_array('ntpservers')

file {"/etc/ntp.conf":
 ensure => present,
 owner => root,
 mode => 644,
 content => template("${module_name}/ntp.conf.erb")
 }
}

[root@server1 modules]# cat ./linux-network/manifests/setconf_resolv.pp 
define linux-network::setconf_resolv  ( ) {

$domainname=hiera('domainname')
$searchdomain=hiera_array('searchdomain')
$nameservers=hiera_array('nameservers')

file {"/etc/resolv.conf":
ensure => present,
owner => root,
mode => 644,
content => template("${module_name}/resolv.conf.erb")
 }
}

[root@server1 modules]# cat ./linux-network/manifests/setinterface.pp 
define linux-network::setinterface  ( ) {

$device=$title
$eth=hiera($device)
$ipaddr=$eth['ipaddr']
$netmask=$eth['netmask']
$gateway=$eth['gateway']

file {"/etc/sysconfig/network-scripts/ifcfg-$device":
 ensure => present,
 owner => root,
 mode => 644,
 content => template("${module_name}/ifcfg.erb")
 }

}

[root@server1 modules]# cat ./linux-network/manifests/setroute.pp 
define linux-network::setroute  ( ) {

$device=$title
$eth=hiera($device)
$routes=$eth['routes']

file {"/etc/sysconfig/network-scripts/route-$device":
ensure => present,
owner => root,
mode => 644,
content => template("${module_name}/route.erb")
 }
}

linux-network module template files

[root@server1 modules]# cat ./linux-network/templates/ifcfg.erb 
DEVICE=<%=@device %>
BOOTPROTO=static
ONBOOT=yes
USERCTL=no
IPADDR=<%=@ipaddr%>
NETMASK=<%=@netmask%>
<%- if @gateway =~ /(\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3})/   -%>
GATEWAY=<%=@gateway %>
<%- end -%>

[root@server1 modules]# cat ./linux-network/templates/ntp.conf.erb 
 tinker panic 0
 restrict default kod nomodify notrap nopeer noquery
 restrict -6 default kod nomodify notrap nopeer noquery
 restrict 127.0.0.1 
 restrict -6 ::1
#
 <%- @ntpservers.each do |x| -%>
 server <%= x %>
<%- end  -%>
 driftfile /var/lib/ntp/drift

 [root@server1 modules]# cat ./linux-network/templates/resolv.conf.erb
#
# resolver configuration file...
#
options         timeout:1 attempts:8 rotate
domain       <%=@domainname %>
<%-  if !@searchdomain.empty?   -%>
search <%=@domainname  -%> <%=  @searchdomain.join(' ') %>
<%- end -%>
<%-  @nameservers.each do |  x | -%>
nameserver <%= x %>
<%- end -%>

[root@server1 modules]# cat ./linux-network/templates/route.erb 
<%- if defined?(@routes)   -%>
<%- @routes.each do | x | -%>
<%=x %>
<%- end -%>
<%- end -%>

Wednesday, March 12, 2014

Automate Server deployment with Ansible

There are many server automation applications in the market: puppet,chef,cfengine and salt. Ansible is relatively new, but I think it is better than puppet in server deployment automation tasks.
1. Dependency packages
Ansible depends on python, which is installed by default at least for Red Hat alike distributions
Puppet depends on ruby, which is not installed by default.
2. Agent
Ansible is agentless, it rely on SSH
Puppet need agent running in target server as a daemon.
3. Security
Ansible use SSH as transport method, so Username and password are required for each connection.(Ansible is smart enough to cache the SSH and sudo password, so it will be only prompted once for the first server)
Puppet: agent is controlled by master server, if master server is compromised, all hosts can be brought down easily
4. Setup
Ansible is easy to setup, as there is no agent. Ansible server is easy to setup too, there are just python scripts. You can even run it without installing it.
Puppet need packages installed in agent host or server, the agent certificate need to be signed before server can talk agent.
Ansible use SSH TCP port 22, which is standard firewall port already opened in most infrastructure.
Puppet use customized TCP port , typically 8139
5. Command line mode
Ansible supports command line mode for ad-hoc tasks, so you don’t need to write tasks definitions, just pass the command to ansbile such as return date for a number servers.
ansbile myservers –k –K –u admin –m raw –a “date”

The following example show a typical server deployment

[root@centos1 post]# cat setup.yml 
---        #ansible playbook use YAML syntax http://en.wikipedia.org/wiki/YAML
- hosts: server1          #It is a server or server group as defined in /etc/ansbile/hosts
  user: admin 
  sudo: yes
  vars_files:
    - vars/settings.yml   #global variables
    - vars/{{ ansible_hostname }}.yml             #server specific variable . ansible_hostname is variable, it is server1.yml for server1
  tasks:

  - name: yum
    action: yum name=${item}  state=present      #install yum packages
    with_items:
      - kernel-devel-{{ ansible_kernel }}
      - ed
      - ksh
      - ntp
  - script: ./scripts/sshd.sh        #- The script will insert 'UseDNS no' , - script is shorthand  for - name: XX ,action: YY

  - name: users | Delete users       #delete users delusers is list if users defined in setting.yml
    action: user name=$item state=absent
    with_items: delusers

  - name: ifcfg-eth0 | Configuration file      #ansible template engine is Jinja2 http://jinja.pocoo.org/docs/
    action: template src=./templates/ifcfg-eth0.j2 dest=/etc/sysconfig/network-scripts/ifcfg-eth0 owner=root group=root

  - name: route-eth0 | Configuration file, /etc/sysconfig/network-scripts/route-eth0
    action: template src=templates/route-eth0.j2 dest=/etc/sysconfig/network-scripts/route-eth0

  - name: resolv.conf | Configuration file, /etc/resolv.conf
    action: template src=templates/resolv.conf.j2 dest=/etc/resolv.conf

  - name: ntpd | Configuration file, /etc/ntp.conf
    action: template src=templates/ntp.conf.j2 dest=/etc/ntp.conf
    notify:
    - restart ntpd

  - name: snmpd | Configuration file, /etc/snmp/snmpd.conf
    action: copy src=files/snmpd.conf dest=/etc/snmp/snmpd.conf owner=root group=root mode=0644
    notify:
    - restart snmpd


  - copy: src=files/clock dest=/etc/sysconfig/clock owner=root group=root mode=0644
  - command: ln -fs /usr/share/zoneinfo/Australia/Sydney /etc/localtime


  handlers:
  - name: restart sshd
    action: service name=sshd enabled=yes state=restarted
  - name: restart ntpd
    action: service name=ntpd enabled=yes state=restarted
  - name: restart snmpd
    action: service name=snmpd enabled=yes state=restarted

####----Global variables 
[root@centos1 post]# cat vars/settings.yml 
#
# ntp.conf
ntpservers: [10.1.1.1, 10.1.1.2]

#users to delete
delusers: [user1, user2]

#resolv.conf
domainname: .example.com
searchdomain: [example.com]
nameservers: [10.1.1.1, 10.1.1.2]

####----Server specific  variable
[root@centos1 post]# cat vars/server1.yml 
eth1: 
   device: eth1
   ipaddr: 172.16.1.2
   netmask: 255.255.255.0
   routes: ['192.168.1.0/24 via 172.16.1.254', '192.168.2.0/24 via 172.16.1.254']
 
####----How the tempalate reference the variable
[root@centos1 post]# cat templates/resolv.conf.j2 
#
# resolver configuration file...
#
options         timeout:1 attempts:8 rotate
domain          {{domainname}}
search          {{domainname}} {{ searchdomain | join (' ') }}

{% for host in nameservers %}
nameserver {{host}}
{% endfor %}

[root@centos1 post]# cat templates/ifcfg-eth1.j2 
DEVICE={{eth1.device}}
BOOTPROTO=static
ONBOOT=yes
USERCTL=no
IPADDR={{eth1.ipaddr}}
NETMASK={{eth1.netmask}}
{% if eth1.gateway is defined  %} 
GATEWAY={{eth1.gateway}}
{%endif%}



####----a separate playbook to create LVM and file system 
[root@centos1 post]# cat setup-lvm.yml 
---
- hosts: server1
  user: admin
  sudo: yes
  gather_facts: no
  vars:
    mntp:  /opt
    vgname: vg01
    pvname: /dev/sdb1
    lv1: opt
 
  tasks:

  - script: ./scripts/disks.sh $pvname       #a script to create LVM partion and create physical volume
  - name: filesystem | Create pv,vg,lv and file systems
    action: lvg  vg=$vgname pvs=$pvname

  #- name: filesystem | create lv
  - lvol: vg=$vgname lv=$lv1 size=51196

 # - name: filesystem | create fs
  - filesystem: fstype=ext4 dev=/dev/${vgname}/${lv1}

  #- name: filesytem | mount dir
  - mount: name=${mntp} src=/dev/${vgname}/${lv1} dump=1 passno=2 fstype=ext4 state=mounted

How to run the playbook?

[root@centos1 post]# ansible-playbook -k -K setup.yml

  -k, --ask-pass        ask for SSH password
  -K, --ask-sudo-pass   ask for sudo password

Download all the files
https://drive.google.com/file/d/0B-RHmV4ubtk8Y2wyazhZRS1pSVk/edit?usp=sharing

Thursday, June 20, 2013

VMware PowerCLI: Map datastore name to LUN devicename.

It is not obvious as it is thought to be to map datastore name to LUN devicename in native PowerCLI codes,
The esxcli interface exposed to PowerCLI make it very easy.(Tested in ESXi 5.0)

PowerCLI>$esxcli=get-esxcli -vmhost esx01
PowerCLI>$esxcli.storage.vmfs.extent.list() | ft devicename,volumename -autosize

DeviceName                           VolumeName
----------                           ----------
naa.600601605bc02e00007fb97cacbee211 datastore01
naa.600601605bc02e00fac8cd88acbee211 datastore02

Script to automatically partition a new disk and create LVM PV

It is a very common task to create a single partition on whole disk and create LVM PV, How to automate it?

fdisk doesn't support making partition in script mode, sfdisk can, but it is not as good as the powerful parted tool. parted can also optimize partition alignment automatically(parted -a optimal).

#!/bin/ksh
#Create a single primary partiton with whole disk size and create LVM PV on it
disk=$1
partno=1
if [[ -z $disk ]]; then
 echo "Usage: $0 disk device name: e.g $0 /dev/sdb"
 exit
fi


if [[ -e ${disk}${partno} ]]; then
 echo "==> ${disk}${partno} already exist"
 exit
fi

echo "==> Create MBR label"
parted -s $disk  mklabel msdos
ncyl=$(parted $disk unit cyl print  | sed -n 's/.*: \([0-9]*\)cyl/\1/p')

if [[ $ncyl != [0-9]* ]]; then
 echo "disk $disk has invalid cylinders number: $ncyl"
 exit
fi

echo "==> create primary parition  $partno with $ncyl cylinders"
parted -a optimal $disk mkpart primary 0cyl ${ncyl}cyl
echo "==> set partition $partno to type: lvm "
parted $disk set $partno lvm on
partprobe > /dev/null 2>&1
echo "==> create PV ${disk}${partno} "
pvcreate ${disk}${partno}

Friday, April 19, 2013

Understanding SysV-style Initscripts in Red Hat Linux

It is often needed to write your own init start/stop script, the following is the minimum requirement for your script to behave as expected. The discussion is based on Red Hat Linux family, other distributions like Debian use LSB (Linux Standard Base Specification) Init Scripts.

Location of the script

/etc/init.d is the well known location, but actually /etc/rc.d/init.d is the real original location. Since /etc/init.d is a hard link to /etc/rc.d/init.d, it makes no difference.

Header of the script

It needs at least 3 lines. The shell script interpreter (/bin/sh, /bin/bash .. etc), the chkconfig header and script description

#!/bin/sh
#   chkconfig: 345 56 10
#   description: Startup/shutdown script for the Common UNIX

Body of the script

Obviously, it need to accept parameter “start”, which /etc/rc3.d/S* will call on OS startup and accept parameter “stop”, which /etc/rc0.d/K* script will call on OS shutdown.
The lockfile is often overlooked, it is used to check the existence of the daemon on OS shutdown, otherwise the stop action won’t be called. If you found an issue that a script started on OS startup but never stop properly on shutdown, you need to create lockfile. note: lockfile is not pidfile which contains PID of the process, lockfile is usually a blank file.

lockfile=/var/lock/subsys/$(basename $0)
case $1 in
 start)
  start
  [ $? = 0 ] && touch $lockfile
 ;;
 stop)
  stop
  [ $? = 0 ] && rf –f  $lockfile

Others

It is recommended to import functions in /etc/rc.d/init.d/functions to use ‘daemon’ to startup your application or killproc to shutdown your application instead of reinventing the wheel.

LSB headers

You may see something like this in an init script.

 # Provides: boot_facility_1 [ boot_facility_2 ...]
 # Required-Start: boot_facility_1 [ boot_facility_2 ...]
 # Required-Stop: boot_facility_1 [ boot_facility_2 ...]
 # Should-Start: boot_facility_1 [ boot_facility_2 ...]
 # Should-Stop: boot_facility_1 [ boot_facility_2 ...]
 # Default-Start: run_level_1 [ run_level_2 ...]
 # Default-Stop: run_level_1 [ run_level_2 ...]
 # Short-Description: short_description
 # Description: multiline_description

They are LSB(Linux Standard Base) headers, they are supported by default in Debian and SUSE Linux.
Red Hat Linux supports this by additional package “redhat-lsb” and it is not installed by default, Be warned,50+ dependences need to installed as well.

Reference

http://refspecs.linuxfoundation.org/LSB_2.1.0/LSB-generic/LSB-generic/initscrcomconv.html

https://fedoraproject.org/wiki/Packaging:SysVInitScript?rd=Packaging/SysVInitScript

Friday, February 8, 2013

Monitor customized application in Windows by SNMP

The native SNMP service in Windows can provide basic metrics like CPU, memory and disk etc, but it doesn’t have “extend” feature in net-snmp, which allows you run a script for application monitoring. Net-snmp can’t be used as replacement for Windows SNMP service because some SNMP extension agent relies on it and known issue like HOST-RESOURCES MIB doesn’t work in net-snmp.

The good news is that you can have net-snmp co-exist with Windows SNMP, you can have nice features like extend ability, in the mean time, pass the other functions to native Windows SNMP service.

As of Net-SNMP 5.4, the Net-SNMP agent is able to load the Windows SNMP service extension DLLs by using the Net-SNMP winExtDLL extension. The extension requires the net-snmp binary to be native (32bit net-snmp extension won’t work in 64bit Windows).

Net-snmp 64bit binary is hard to find, it seems only net-snmp-5.5.0-2 has 64bit binary pre-compiled, you might need to compile yourself for other versions.

Install net-snmp

Run the net-snmp binary installer select “with Windows Extenstion” instead of standard agent, unselect “net-snmp trap service” and “Perl SNMP modules”, the default path is c:\usr

Configure net-snmp

Register net-snmp as Windows service

Edit c:\usr\registeragent.bat to disable modules conflicting to Windows by adding parameter.

“-I-udp,udpTable,tcp,tcpTable,icmp,ip,interfaces,snmp_mib”

(Note: if system_mib is also disabled, SNMPv2-MIB::sysuptime won’t report correct time)

Run c:\usr\registeragent.bat

Edit C:\usr\etc\snmp\snmpd.conf

rocommunity public 192.168.1.10
#Test extend feature to execute a script, the script path must use Unix style ‘/’
extend userscript c:/temp/test1.bat

Start Windows service “net-snmp agent”(Native SNMP service must be stopped)

Test

#Test standard SNMP metrics, the HOST-RESOURCES-MIB is provided by native SNMP service, not net-snmp
[root@zabbix]#/usr/bin/snmpwalk -v 2c  -c public 192.168.1.20   HOST-RESOURCES-MIB::hrSystemUptime
HOST-RESOURCES-MIB::hrSystemUptime.0 = Timeticks: (640892116) 74 days, 4:15:21.16

#The extend feature is provided by net-snmp, Execute the script by snmpwalk
[root@zabbix]#/usr/bin/snmpwalk -v 2c -Ov -c public 192.168.1.20 'NET-SNMP-EXTEND-MIB::nsExtendOutLine."userscript"'
STRING: web-time=80
STRING: web-status=[ok]

Troubleshooting

C:\usr\log\snmpd.log

Check which Windows modules loaded, start snmpd in command line with debugging “WinExtDLL”

Snmpd.exe -I-udp,udpTable,tcp,tcpTable,icmp,ip,interfaces,snmp_mib -DwinExtDLL

Reference:

http://net-snmp.sourceforge.net/docs/README.win32.html

Thursday, February 7, 2013

Shell script to check Oracle Tablespace usage

I searched a shell script to check Oracle Tablespace usage, most scripts returned use complex SQL statements and they don’t report usage accurately, because auto-extend or multiple data files was not taken into account for calculation. Actually, there is a built-in view “dba_tablespace_usage_metrics” for the purpose starting from Oracle 10g.
The following script check the Oracle database availability or tablespace usage and measure the response time.The scripts output “key=value” format, which can be easily discovered by LLD in Zabbix.(with LLD, Zabbix can dynamically discover any number of items to monitor without adding the items manually )

Script sample output

db-time= 71
db-status=[OK]: Name:SYSAUX SizeMB:1024 Used%: 73 ; Name:SYSTEM SizeMB:1024 Used%: 72 ; Name:USERS SizeMB:5 Used%: 20 ; Name:TEMP SizeMB:2048 Used%: 2 ; Name:UNDOTBS1 SizeMB:2048 Used%: 1 ;  8 rows selected.

The Oracle login in the script should have permission to read the view or have “select_catalog_role” role granted.

Script detail

#!/bin/ksh
function checkdb {
TNSNAME=$1
OUSER=$2
OPASS=$3

ORACLE_HOME=${ORACLE_HOME:=/u01/app/oracle/product/11.2.0/client_1}
export ORACLE_HOME
t1="$(date +%s%N)"

rt=$($ORACLE_HOME/bin/sqlplus -S ${OUSER}/${OPASS}@${TNSNAME}<< _END
set heading off
set linesize 200
select
   'Name:'|| tablespace_name,
   'SizeMB:'||round(TABLESPACE_SIZE*8/1024)||' Used%:',
   round(used_percent),
   ';'
from
   dba_tablespace_usage_metrics
order by 3 desc;
exit;
_END)

t2="$(date +%s%N)"
echo "db-time= $((($t2 - $t1)/1000000))"
#remove blank lines,ignore UNDOTBS,get the numeric value by removing tab and spaces
tbpct=$(echo "$rt" | egrep -v '^$|UNDOTBS' | head -1 | sed 's/.*Used%:\(.*\);/\1/'  |  sed 's/[ \t]*//g')
#Critical condition: thresh-hold > 95 or non-numeric value returned
if [ $tbpct -gt 95 ] || [[ "$tbpct" != +(\d) ]] ; then
 echo "db-status=[CRITICAL]:" $rt
else
 echo "db-status=[OK]:" $rt
fi
}