Monitoring, post3, tools

Alright, it's time to write this. I had intended to write this much, much sooner, but then the "Group By" realization hit me, and I got stick for a while trying to prove out some features in the system I'm actually using.

Our goal is a system that gathers whitebox and blackbox data from applications, and whitebox data from machines. Lets us easily query, view, graph, and alert on that data with the feature set from my previous post: .

There are also some soft requirements in practice for really using a DB:
  • Can run at some scale
  • We either need someone to host it, or we need to run it ourselves
  • Tools tend to only go so far, and it's nice not to have to throw it all away if you hit the edge. This makes free software (FSF definition) nice
  • It's nice to use other pre-existing tools with it without much modification

Nagios et al.

Nagios is the industry standard solution. Nagios is what many many systems are trying to emulate, and as a result it's flaws are endemic to the world of monitoring systems.

First, let me say that I have not used Nagios, but I have used some of it's competitors
The main problem is that fundamentally these systems mostly are not meant to do the type of things I discussed in my previous post. It's meant mostly to alert on directly monitored values (e.g. is memory usage too high?) They then give you some small finite set of "aggregations" so you can see, for example, how much memory *all* your machines are using together. This works fine until it doesn't. Values like that tend to be scale dependant and take constant tweaking.

A popular meme right now is "automated thresholding", that's when your system figures out what "normal" is, and alerts when a value is some statistically significant distance outside that "normal". This is an attempt to solve the problem of scale independence. But it has 2 problems. The first is that sometimes your "normal" drifts, and that's indicative of e.g. your users slowly outgrowing how far your system can scale. After a year of slow drift "normal" could be 5% of queries being dropped... that's not okay.

Note that I said 5% of queries being dropped. Ratios are a much simpler solution to this problem. In practice this does not entirely solve scale independence. If you don't already know why, Google for the "law of large numbers". That said, while it's not perfect, it gets you 90% of the way there, and I would argue that the fallability and complexity of AI-type approaches are not worth the tiny bit of scale independence it buys you.

On top of this it's designed to be "easy to use", which means everything is based on clicky-button interfaces, no backup or versioning for your configs, no idempotent setup, etc. It's a nightmare from the perspective of a reliability engineer as soon as you view that system as something you have to manage and not just use, despite that you're paying someone else a ton of money to run it for you.
Basically, don't believe a monitoring tool will ever solve a problem for you, or allow you to think less about your system. I'm just going to leave it there, and let you generalize out to all of the other systems in this same family.

Collection -> DB -> alerting solutions

Okay, so if you get frustrated with the tools in the above model and start poking around what you quickly find is that folks in the know are using systems with 3 seperate components. Data collection, a timeseries database, and an alerting system that queries the DB.

This has some nice advantages. You can use lots of different collection engines for collecting different types of data (e.g. statsd for application level, various services for whitebox probing, etc.) Yet all the data ends up in the same DB, making building UIs and integrating data across those collection engines a breeze.

There is one notable downside, which is that your alerting is fundamentally polling based. It has to do expensive queries against the DB every N seconds so it can alert you if something goes bad. If you're doing polling you can usually assume you're doing something wrong. The *right* answer would let alerting trigger at the collection level. BUT we still want to incorporate old data in some cases, so it also needs to go back to the DB to get data, or keep a local store, or something.

There is one system out there that strives to do this a better way called Prometheus. Unfortunately, it's not ready for production yet, but keep an eye on it.


This part of the equation is boring, honestly. We need something to gather stats from machines and applications and feed those stats in to the database. There are a hundred ways to do this that are fine. As long as it doesn't block the application, we can get application and machine level data, and the data gets to it's destination, it'll work.


Before we dive too deep we need to pair down the field. So, looking at my earlier requirements list it's pretty clear that we need a rich query interface with some understanding of timeseries. Based on that I'm tossing out options like Postgres, MySQL, etc.

There are also time-series plugins, modes, etc. for some databases, but all of the databases like this that I found store time-series in the wrong way. To do the calculations I've discussed earlier we need to be able to query for a timeseries, and then compute on a subset of that timeseries. The "plugin" approach tends to store a *set* of timeseries as something you can query for, which really doesn't help us much.

Here's a nice list of open source timeseries databases.

Here's the ones I've looked at
Now, I'm certain that I'm going to get something wrong in this post. There's simply too many details. My goal is to let others share some of the realizations and research I've had and save a little time. So, please bear with me, and if you find mistakes drop a comment and I'll try and fix it.

Looks like a promising system, but everyone says it's a total bear to actually run. It's based on zookeeper, and it uses a mysql instance for it's metadata. That's already some interesting requirements, but not terrible. Then you start looking at it's pieces, it has a controller node, a broker node, a historical node. A minimal system is just very complex... too much for me.

Seems to be the established "common-sense" answer among the options. It's open source. It has a relatively rich query language. As of version 0.10.0 it has "map" and "reduce" which give a generalized Group By semantic like I talked about in previous posts. It's not terrible to run yourself, though it does have some scaling limitations. There are hosted options where they've already worked out the scaling issues for the most part - sadly hostedgraphite doesn't support 0.10.0 yet, but they are working on it (I've been talking with them :D).

The biggest win of graphite though is that it's just got tons of community support. Everything integrates with it, all the collection tools, and many of the front-end alerting tools out there.

InfluxDB is the shiny new guy in town. I was really excited to read through their website. It's open source, the developers are also doing a hosted option. The language is very rich, and unlike graphite features like "group by" were built in from the beginning, so it's supported properly. Unfortunately, it's rather new, so still a bit too new for my blood. If you want the new hotness though, give it a try.

From what I understand this is basically a re-write of OpenTSDB. Like OpenTSDB it runs on hbase (translation: it's impossible to administer unless you're already a hadoop guru). Unlike OpenTSDB it can also run Cassandra Cassandra is the open-source bigtable written in Java, except that unlike bigtable it's also a storage-stack. The problem is, it's not that reliable. They still haven't hammered out a lot of bugs. Is it usable? Yes. Do you want to deal with it if you don't have to? No. It also has nice query semantics supporting most of the features you might want.
This might be a decent option if you can find a good hosting provider. Non-hosted though, it's probably a no-go for a small organization. For a big organization it might be just the ticket.

See Kairos. OpenTSDB is a true standard, it's all open source, all that great stuff. But, like kairos it's nasty to actually run it. Unlike kairos it only runs on hbase, that is, the hadoop stack. If you already run a hadoop stack that's all well and good, no big deal. If you don't it's a heck of an understaking just for your monitoring database. That said, I understand that it scales quite well. Again it has support for rich queries and all the shinies.
From IBM, it's closed source. It's been around a pretty long time, so it's really optimized for a different set of use-cases from what I gather. From my perspective I'm putting a lot of time and energy into a system, and closed-source scares me because I can't jump ship if I want to. Systems like graphite have standard interfaces that are supported by lots of systems. Informix is like the exact inverse of this, being 100% proprietary and no-one wants to go within a 100 yards of something owned by IBM (for good reason).

Of these, I ended up picking graphite. I state that here to explain the next section


This part is harder. After scouring the fields for systems that alert based on data in graphite. They are all seriously flawed.

The biggest flaw is that they all use configuration backed by databases. This means that my data about what to alert on, which is fundamentally configuration, is instead live-data in production. If I lose that database in production I'm going to be stuck rebuilding all of my monitoring from scratch. If I make a mistake and bump the wrong button, or someone changes something and we decide it was wrong, we have no rollback, no tracking. Any features related to versioning have to be built in. Also, I can't do code-reviews and such on changes, review them branch them, and do everything else source-control lets me do.

For me, this shot down every single system I could find. My conclusion? Build one myself.