A Counter-Rant

I was a little dismayed to find James Turnbell on the Nagios-bashing bandwagon. Honestly, what IS the world on about these days? Anyway, here’s a 8-months late counter-rant for you James (love you man).

But before I get to all that, I want to rant a little bit about #monitoringsucks in general, now that we have a few years of hindsight.

#monitoringsucks kind of… sucked.

To be sure, we had some good discussion and built some new tools but as someone who has spent years implementing, gluing together, and working to improve monitoring systems and infrastructure, #monitoringsucks was almost immediately boring. At the onset I thought “ok infratalk, lets get together and talk about furthering the state of the art in systems monitoring”, but from the start it was a movement determined to define itself by throwing the baby out with the bath-water. In proclaiming the uselessness of all it preceded, and claiming for its own everything that came after, the movement didn’t just refuse to acknowledge the giants on whose shoulders it stood, but rather danced around on their shoulders with its pants down, openly mocking them like a middle-school brat, and making impossible a lot of very necessary conversation.

So, while I freely admit that we’ve gotten some great tools out of this, I’ll go ahead and be the one to point out that #monitoringsucks has spent the better part of two years acting like an angsty-teenage jerk who, despite whatever valid points and great ideas he had, completely alienated everyone around him with his snide, fart-in-your-general-direction snobbyness, and I can’t stand to hang out with him anymore.

You want examples? They’re myriad, but for now here’s a pretty good example of what I’m talking about. A smart guy doing good work who (probably as a direct result of #monitoringsucks) can’t help but preface it with “psh Nagios. Iknowright?”, as if there’s a tacit, universal understanding that:

  1. Polling for CPU load is silly
  2. Nagios can only do silly things like poll CPU load
  3. Nagios magically sends you pages that you didn’t ask for.

He then goes on to describe his work which, when it eventually becomes useful for problem detection, will be made into a Nagios check about 4 minutes after it hits github.

I’m not sure what people think Nagios is anymore, but it isn’t a CPU-polling program. If you think polling and capturing CPU load is silly, then don’t do it – Nagios certainly doesn’t demand it of you. So while no one is arguing that we can’t do better, it is a mistake to assume we can’t do better with existing tools; the very suggestion that we can’t implies a shallowness of comprehension or ulterior motive on your part. Baron, you had an awesome talk bro, drop the non-sequitur Nagios bashing, and let your work speak for itself.

#monitoringsucks gave us more tools, but not better ways to use them

In the same way the ZOMG PYTHON crowd gave us Shinken, Psymon, Graphite, and a gmond rewrite, and the ZOMG RUBY crowd beget Bluepill, GOD, Amon, Sensu, and etc.. #monitoringsucks spewed out a bunch of new tools, some of which are great and most of which are reinventions of the wheel in the ZOMG language of the week. And ironically, along with the mind-boggling preponderance of new tools came this constant side-channel complaining about the Nagios + collectd + graphite + cacti + pnp4nagios antipattern. Are we to understand that Bluepill + Psymon + Ganglia + graphite is not only fundamentally different but objectively superior? Are we to throw away everything we already have and replace it with only Sensu? Are we to do that every time the next great monitoring system comes along?

The weird thing to me is that although NONE of us use a single monitoring tool anymore in real life (and for most of us this feels quite natural), we seem to be fixated on making some new UBERTOOL to replace all of the pieces we already have. So I think a really important thing that the movement failed to provide is a good way to use together whatever combination of tools make sense for us Indeed, we couldn’t even have this conversation because anyone who mentioned pre-existing tools was lolbuts laughed off the stage.

In spite of all that, I think that #monitoringsucks was healthy. It forced us to recognize that the infrastructure had changed and the monitoring hadn’t, and it focused our attention on exploring scale, and concurrency, and distributed systems, and better metrics and visualization, but it’s past time to stop bellyaching about the tools we don’t like and either make them better or adopt/build new ones. Everyone gets it. Monitoring sucks. Lets move on.

Ok James, lets do this

With that said, I want to talk a little (a lot) about the points made in this Kartar rant, because they typify much of the Nagios bashing that’s gone on since #monitoringsucks was born. But before I get into it in earnest, I want the record to reflect that I think James Turnbell is awesome, I’ve followed his work for years, and continue to, and this is in no way a personal attack or flame against James. I just want to have the conversation, because I think it’s time, and because it’ll be a healthy conversation for us to have.

1 Nagios doesn’t scale

James says:

It doesnt scale. Despite being written in C and reasonably fast, a lot of
Nagios works in series especially checks. With thousands of hosts and tens of
thousands of checks Nagios simply cant cope.

First I need to nitpick the statement that checks work in serial. In point of fact, service checks are forked from Nagios core, and their results injected into a queue where they are eventually reaped by a reaper process. It’s arguable whether fork was a great decision from a scalability standpoint, but really I want to take exception to the more general sentiment that no thought was put in to scale in the context of the Nagios event loop, because it’s an oft-repeated fallacy, and one that is almost always accompanied by some incorrect factoid about how Nagios works internally.

More generally, in years past it’s become in-vogue to sort of wave ones hand in the general direction of Nagios and proclaim that it is poorly designed. This is actually a great litmus test to detect someone who doesn’t know Nagios very well, because in fact, it’s a well engineered piece of software. Studying the 1.x source as a young sysadmin taught me most of what I know about function pointers and callbacks in C. If you want to learn about how real C programs work, you could do a lot worse than studying the Nagios Core internals.

That said, I don’t think James is guilty of misunderstanding the internals; I suspect he meant that check results are serialized by the reaper, which is a valid point. There are better ways to do it, but Ethan didn’t exactly have libevent in 2001, and we can’t fault him for not inventing it. Andreas et. al have been busy at work looking at new and improved concurrency models for Nagios 4.

And that brings me to my second point (and I’m sure James knows this): A centralized, polling-based monitoring system is only going to scale so far no matter what concurrency hacks you employ. At some point, if you want to stay with a centralized polling strategy, you’re going to need to look at distributing the load, and Nagios is ahead of the curve here compared with its direct competition in my opinion. There are eleventybillion ways to run Nagios in real life, several of which involve the use of event broker modules that make service checks (and other internal operations) distributed. These include Merlin, mod gearman, and DNX.

It is entirely possible today, to create distributed Nagios infrastructure that scales to tends of thousands of hosts and hundreds of thousands of services. This is not hackish bleeding edge type stuff, and there are documented real-world examples.

2 Configuration is Hard

James says:

It requires complex and verbose text-based configuration files. Despite
configuration management tools like Puppet and Chef the Nagios DSL is not
easily parseable. Additionally, the service requires a restart
to recognize added, changed or removed configuration. In a virtualized or cloud
world that could mean Nagios is being restarted tens or hundreds of times in a
day. It also means Nagios cant readily auto-discover nodes or services you want
it to monitor.

I agree that Nagios configuration syntax doesn’t lend itself to being machine parsable, but I disagree that it’s incompatible with configuration management engines. This kind of thing is going on right now all over the place, and isn’t considered a big deal.

Further, there are Nagios configuration parsing libs in just about every language out there, so even should you decide to roll your own, it’s not like you need to write the parser.

3 Binary views?

James says:

It has a very binary view of the world. This means it's not a useful tool for
decision support. Whilst it supports thresholds it really can only see a
resource as in a good state or in a bad state and it usually lacks any context
around that state. A commonly cited example is disk usage. Nagios can trigger
on a percentage threshold, for example the disk is 90% full. But it doesnt have
any context: 90% full on a 5Gb disk might be very different from 90% full on
1Tb drive. It also doesnt tell you the most critical piece of information you
need to make decisions: how fast is the disk growing. This lack of context and
no conception of time series data or trending means you have to investigate
every alert rather than being able to make a decision based on the data you
have. This creates inefficiency and cost.

Nagios is not the sum of its plug-ins tarball, and thresholds are very much a thing built into plug-ins and not Nagios Core. If you have something smarter than a percentage threshold detecting disk trouble, Nagios will happily execute it for you and report the result. So, by all means, write a plug-in that uses a chi-squared Bayesian computation, or holt-winters forecasting instead of a threshold. Nagios really doesn’t care (and it should be mentioned the built-in disk plug-in can use thresholds other than percent). We have oodles of clever tests that Nagios runs for us at my day job, and it can hardly be argued that the cucumber plug-in or webinject have “binary” views of the world.

If I wanted to throw notifications on rate of disk growth I’d run Nagios checks against that metric in either Ganglia or Graphite. Again, nobody uses one monitoring tool anymore and I don’t think that’s a bad thing. I do not expect Nagios to be a metrics collection engine or decision support system any more than I wrestle knife-wielding men. Nor do I rant about the extensibility of my jiujitsu on bullshido.com, because, although I love my jiujitsu, it is the wrong tool for that job.

4 It’s not very stateful

James says:

It is not very stateful. Unless you add additional components Nagios only
retains recent state or maintains state in esoterically formatted files.
Adding an event broker to Nagios, which is the recommended way to make it
more stateful, requires considerable configuration and still does not ensure
the data is readily accessible or usable.

Getting data out of Nagios is an age-old dilemma that is really only solved today via event broker modules. If you refuse to use NEB modules, then yes, this problem is not solved for you. The thing is, nobody in the Nagios community can agree what the ideal solution looks like. REST interfaces, MySQL databases, and even DSL’s to interrogate the current state of the Nagios process in RAM have all been imagined and built as NEB modules. These solutions are robust, production ready and widely used today. So while I might have agreed that this was a problem 3 or 4 years ago, given the preponderance of right answers today it seems silly to enforce something in core that only a subset of the users are going to be happy with. If you want enforcement of that kind, Nagios XI uses MySQL and Postgres. Personally, I prefer MK Livestatus and abhor the notion of a MySQL database, but most people disagree with me on that, and that’s perfectly fine.

As an OPS in real life, I consider the NEB modules that I use part of my Nagios installation, and I don’t think about them all that much. They deploy as Nagios deploys. I don’t think they’re particularly difficult to configure, and I’m thankful that they’re there.

5 it isn’t easily extensible

James says:

It isnt easily extensible. Nagios has a series of fixed interface points and it
lacks a full API. Its also written in C, which isnt approachable for a lot of
SysAdmins who are its principal users. It also lacks a strong community
contributing to its development.

I’m not sure what we’re comparing it to, but Nagios can be made into an endlessly scalable distributed monitoring infrastructure. It can take input from transient entities in 7 different ways, remotely execute checks on every operating system in existence and runs in the cloud, against Cygwin, and on dd-wrt and linux wristwatches. I met an unfortunate windows sysadmin at NagiosCon last year who literally was only allowed to run SNMP on his corporate network and had constructed a several thousand host, trap-only passive monitoring architecture on Nagios. It lets you define what it means to check a thing, what it means to notify when that thing breaks, and what it means to escalate that notification. It literally lets you make up your own configuration parameters, and has a hook everywhere it has a definition. The Broker API lets you inspect, interrupt, and override every action the event loop takes internally. I struggle with what extensible means if Nagios isn’t it, and would certainly like to get my hands on a “full” API, if the NEB API isn’t one.

The Nagios community is certainly non-standard, but I don’t think it’s fair to say they aren’t strong. Only a handful of guys have commit access to core, so most of the contributions are in the form of plug-ins and add-ons. The ichinga fork happened specifically because of the frustration surrounding the fact that Nagios doesn’t really “get” open source development, but forking Nagios to a more open development model hasn’t made Ichinga more ‘easily extensible’, and although Nagios has been rewritten in several languages none of those ports are what I’d call more easily extended.

IMO, this is because Nagios is about as extensible as a centralized poller can be. There is an underlying design here, a real and physical limit, and there just isn’t much more we can demand from this design (which makes complaining about it unproductive). If Nagios doesn’t extend to what you want, it isn’t what you want.

6 It is not modular

James says:

It is not modular. The core product contains monitoring, alerting, event
scheduling and event processing. Its an all or nothing proposition.

Agreed. Nagios is a special-purpose scheduling and notification engine. It does its thing and that’s it. To believe otherwise is to confuse a deficiency with a design goal.

Summary stuff

I won’t quote James here, but he goes on to bash add-ons, and in general question the long-term viability of centralized pollers like Nagios before hoping for a paradigm shift that will deliver the next big thing in monitoring.

At this point in my career the word “monitoring” is so laden and pregnant with connotation and nuance that it is effectively meaningless. It is however, difficult for me to imagine a world where centralized polling as a monitoring strategy is wholesale replaced with an uber-technique that optimally meets literally everyones needs. This is not just an engineering observation but a business-needs one. Centralized polling makes more sense than the alternatives in a lot of situations, and as long as the technique has a place along side the myriad other techniques we employ to monitor things, Nagios will be around because it’s a good, free, centralized poller, with a massive support community, a commercial version, and an annual conference of its own.

More than that; with the litany of monitoring systems out there that can execute Nagios checks out of the box, Nagios, like it or not, has become a specification language for prototyping systems monitoring solutions. Once you understand Nagios and the various ways it’s been extended, you pretty much understand the problem domain, by which I mean you know what humanity knows about how to centrally monitor computational entities. You also have a good mental model of the data structures required in the field and how much and what kind of metric and availability data need to be transmitted, parsed and stored. So even if you don’t use Nagios in your environment, learning about how it works makes you a better OPS – one who is adept at designing and communicating monitoring solutions to other engineers in an implementation agnostic way.

But Nagios, as far as I know, has neither claimed nor aspired to be the final, ultimate solution to the monitoring problem, so please stop flogging it. I remember a time not too long ago when we could talk about new and exciting ideas in this field without having to slander the ideas from which they were derived, and I welcome a return to that time.