Contents: I. Taming Nagios A. Stuff that should have been included 1. check_generic_wrapper 2. check_multi_tcp B. Making life easier 1. Alert subscriptions 2. Pager acknowledgments 3. Conditional qmail aliases 4. Failsafe changes C. Capturing business logic 1. check_cluster2 2. nagvis Summary: I. Taming Nagios Nagios is good stuff. It's simple, highly configurable, and rock solid. I've been working with it for a few years now, and what I've noticed is that people starting out have some of the same problems, and people who've been using it for a while have the same headaches. This article will provide some tips to help them both. A. Stuff that should have been included Although the plugins tarball has almost everything one could ask for, there are two "missing" plugins that newcomers invariably require. 1. Generic performance data wrapper Nagios has hooks for exporting performance data returned by plugins to other applications. These hooks are most often used to provide statistics to graphing programs like rrdtool and mrtg. Unfortunately, the plugins themselves support performance data output for the hooks to be useful, and the preponderant quantity of plugins do not include this support. This shell script solves that problem by wrapping around any other plugin, via a simple symlink, thus providing any plugin with the missing support for Nagios Performance Data Commands. 2. Multi-port checker Check_tcp and check_udp allow nagios to query the status of any single tcp or udp port. Many services however span multiple ports. Since it feels crufty to have multiple nagios services configured to watch a single logical entity, this shell script will wrap around the existing check_(tcp|udp) plugin to easily provide for an arbitrary number of space separated ports to be checked in a single service definition. B. Making life easier Admins who love nagios, love it despite some quirks. Many admins for example, consider its complex config files par for the course. Many more realize that the flexibility introduced by that complexity is a blessing in the face of end users who, as a result of actually getting what they want for once, tend toward a never ending stream of increasingly complex requests. Most admins however are slow to realize that nagios' flexibility makes it possible to offload some of that configuration headache to scripts, or even back toward those beloved end users. Here are a few examples. 1. Alert subscriptions There are aspects to systems monitoring I like very much, but managing lists of people and contact groups is not one of them. The argument could be made however, that contacting people when things break is the whole point of this endeavor. By integrating mailing list software with Nagios, we provide users the ability to subscribe themselves to whatever combination of alerts they desire, thereby bringing you one step closer to never editing contacts.cfg again. Scripts for auto-detecting new services, and bootstrapping the list creation are included. 2. Pager acknowledgments Nagios allows alert recipients to 'acknowledge' problems, thereby stopping the re-occurring problem notifications while at the same time optionally providing a helpful comment to your fellow admins. Doing this however requires you to login to the web page, find the broken service, open it's detail screen, and fill out the acknowledge form. Wouldn't it be nice if you could ack the problem by simply replying to the page nagios sent you? Now you can with this script. 3. Conditional qmail aliases "I'd like notification A to be sent to my pager, during the day, and to my email at night, and notification B sent to my email, but only on weekends, and never on the 15th of the month, unless it's the second Wednesday in May." Sound familiar? With Nagios, complex configurations like this are possible, but it's going to take some text editing. You'll need to add several custom time periods to timeperiods.cfg, and multiple instantiations of the same service with different time periods and contacts to services.cfg. Some contacts will also need to be defined. Perhaps what you need is an easier way to capture conditional logic. This section will show you how to use, and more importantly, stack special qmail aliases like '.qmail-nighttime-default' and '.qmail-secondwednsdayinmay-default' to get very weird behaviors while saving some keystrokes. 4. Failsafe changes Nagios' clear text config files make it easy to do things like use CVS to track configuration changes. I do this, and it's saved me so many times that I enthusiastically recommend it to every potential Nagios user I meet. Since all configuration changes require a service HUP, you'd almost be insane to not have some sort of quick rollback capability. But even if you don't use CVS, you can do some things to ensure that bugs in recently changed config files don't stop your production nagios daemon. This script for example will check for errors in the configs before it politely restarts the nagios daemon for you. If you use the alert subscriptions tip above, this script will also check for new hosts and services, and create mailing lists for them. C. Capturing business logic Chances are you deal with some type of management on a fairly regular basis (if you don't, are you hiring?). In my experience, management cares less about the Exchange-IS service on Mail4, and more about "email" in general. Red light, or green light, is "email" working right now? 1. check_cluster2 To answer that question you need a service aggregation capability. You need to know the status of 27 different services on 5 different hosts, in a single instantiated entity. Although this is not a documented feature in nagios, there's a plugin in contrib that will get you close to where you want to be. Here are its capabilities, and its limitations. 2. Nagvis Simply put, nagvis allows you to easily "animate" Visio type diagrams with real time information from nagios. It's like catnip for managers, but at the price of yet another config file. This will have a short description and some pretty pictures.