I am using puppet to configure most of my machines. Unfortunately I am not perfect and introduce errors in my modules. Of course I only test such modules on machines that are not affected. On an affected machine puppet starts running, works on some modules, detects an error and stops. So sometimes I have a happily running puppet that does only half of the tasks it should do. Using stages in puppet I can hopefully detect such situations.
First I define stages in my manifest/nodes.pp:
stage { 'start':
before => Stage['main'],
}
stage { 'last': }
Stage['main'] -> Stage['last']
class { 'createstamp':
stage => 'last',
}
class { 'resolv_conf':
stage => 'start',
}
I have one stage start that is executed at the beginning and one stage last that shall be done when everything else is ready. Everything else will run in stage main.
At the moment I only have one module resolv_conf at the beginning. DNS should always work as expected. The only module in the last stage is createstamp that just creates a temporary file containing a time stamp.
class createstamp {
file { 'stamp':
path => "/usr/local/nagios/createStamp",
ensure => file,
mode => '0644',
owner => 'root',
group => 'root',
source => [
"puppet:///modules/createstamp/stamp"
],
}
}
The file in this module will be created on the puppetmaster with a cronjob that runs every two hours:
#!/bin/bash
STAMPFILE=/etc/puppet/code/environments/production/modules/createstamp/files/stamp
s2000=`date +%s --date="Jan 1 00:00:00 UTC 2000"`
now=`date +%s`
echo $((now-s2000)) > $STAMPFILE
No I just have to check this file with nagios and a custom nrpe check like:
#!/bin/sh
STAMPFILE=/usr/local/nagios/createStamp
s2000=`date +%s --date="Jan 1 00:00:00 UTC 2000"`
if [ ! -f $STAMPFILE ]; then
echo "CRITICAL - no stampfile available here"
exit 2
fi
now=`date +%s`
if [ -f $STAMPFILE ]; then
stampTime=`cat $STAMPFILE`
fi
diff=$((now-s2000-stampTime))
if [ $diff -gt 60000 ]; then
echo "CRITICAL - stamp to old: $now / $((now-s2000)) $stampTime"
exit 2
else
echo "OK - stamp ok $now / $((now-s2000)) $stampTime"
fi
exit 0
In this case I wait for 60000s before nagios complains. This is due to some external machines running nagios only every 8h. So I wait 16h before everything goes red.