Monitoring, post3, tools

Alright, it's time to write this. I had intended to write this much, much sooner, but then the "Group By" realization hit me, and I got stick for a while trying to prove out some features in the system I'm actually using.

Our goal is a system that gathers whitebox and blackbox data from applications, and whitebox data from machines. Lets us easily query, view, graph, and alert on that data with the feature set from my previous post: .

There are also some soft requirements in practice for really using a DB:
  • Can run at some scale
  • We either need someone to host it, or we need to run it ourselves
  • Tools tend to only go so far, and it's nice not to have to throw it all away if you hit the edge. This makes free software (FSF definition) nice
  • It's nice to use other pre-existing tools with it without much modification

Nagios et al.

Nagios is the industry standard solution. Nagios is what many many systems are trying to emulate, and as a result it's flaws are endemic to the world of monitoring systems.

First, let me say that I have not used Nagios, but I have used some of it's competitors
The main problem is that fundamentally these systems mostly are not meant to do the type of things I discussed in my previous post. It's meant mostly to alert on directly monitored values (e.g. is memory usage too high?) They then give you some small finite set of "aggregations" so you can see, for example, how much memory *all* your machines are using together. This works fine until it doesn't. Values like that tend to be scale dependant and take constant tweaking.

A popular meme right now is "automated thresholding", that's when your system figures out what "normal" is, and alerts when a value is some statistically significant distance outside that "normal". This is an attempt to solve the problem of scale independence. But it has 2 problems. The first is that sometimes your "normal" drifts, and that's indicative of e.g. your users slowly outgrowing how far your system can scale. After a year of slow drift "normal" could be 5% of queries being dropped... that's not okay.

Note that I said 5% of queries being dropped. Ratios are a much simpler solution to this problem. In practice this does not entirely solve scale independence. If you don't already know why, Google for the "law of large numbers". That said, while it's not perfect, it gets you 90% of the way there, and I would argue that the fallability and complexity of AI-type approaches are not worth the tiny bit of scale independence it buys you.

On top of this it's designed to be "easy to use", which means everything is based on clicky-button interfaces, no backup or versioning for your configs, no idempotent setup, etc. It's a nightmare from the perspective of a reliability engineer as soon as you view that system as something you have to manage and not just use, despite that you're paying someone else a ton of money to run it for you.
Basically, don't believe a monitoring tool will ever solve a problem for you, or allow you to think less about your system. I'm just going to leave it there, and let you generalize out to all of the other systems in this same family.

Collection -> DB -> alerting solutions

Okay, so if you get frustrated with the tools in the above model and start poking around what you quickly find is that folks in the know are using systems with 3 seperate components. Data collection, a timeseries database, and an alerting system that queries the DB.

This has some nice advantages. You can use lots of different collection engines for collecting different types of data (e.g. statsd for application level, various services for whitebox probing, etc.) Yet all the data ends up in the same DB, making building UIs and integrating data across those collection engines a breeze.

There is one notable downside, which is that your alerting is fundamentally polling based. It has to do expensive queries against the DB every N seconds so it can alert you if something goes bad. If you're doing polling you can usually assume you're doing something wrong. The *right* answer would let alerting trigger at the collection level. BUT we still want to incorporate old data in some cases, so it also needs to go back to the DB to get data, or keep a local store, or something.

There is one system out there that strives to do this a better way called Prometheus. Unfortunately, it's not ready for production yet, but keep an eye on it.


This part of the equation is boring, honestly. We need something to gather stats from machines and applications and feed those stats in to the database. There are a hundred ways to do this that are fine. As long as it doesn't block the application, we can get application and machine level data, and the data gets to it's destination, it'll work.


Before we dive too deep we need to pair down the field. So, looking at my earlier requirements list it's pretty clear that we need a rich query interface with some understanding of timeseries. Based on that I'm tossing out options like Postgres, MySQL, etc.

There are also time-series plugins, modes, etc. for some databases, but all of the databases like this that I found store time-series in the wrong way. To do the calculations I've discussed earlier we need to be able to query for a timeseries, and then compute on a subset of that timeseries. The "plugin" approach tends to store a *set* of timeseries as something you can query for, which really doesn't help us much.

Here's a nice list of open source timeseries databases.

Here's the ones I've looked at
Now, I'm certain that I'm going to get something wrong in this post. There's simply too many details. My goal is to let others share some of the realizations and research I've had and save a little time. So, please bear with me, and if you find mistakes drop a comment and I'll try and fix it.

Looks like a promising system, but everyone says it's a total bear to actually run. It's based on zookeeper, and it uses a mysql instance for it's metadata. That's already some interesting requirements, but not terrible. Then you start looking at it's pieces, it has a controller node, a broker node, a historical node. A minimal system is just very complex... too much for me.

Seems to be the established "common-sense" answer among the options. It's open source. It has a relatively rich query language. As of version 0.10.0 it has "map" and "reduce" which give a generalized Group By semantic like I talked about in previous posts. It's not terrible to run yourself, though it does have some scaling limitations. There are hosted options where they've already worked out the scaling issues for the most part - sadly hostedgraphite doesn't support 0.10.0 yet, but they are working on it (I've been talking with them :D).

The biggest win of graphite though is that it's just got tons of community support. Everything integrates with it, all the collection tools, and many of the front-end alerting tools out there.

InfluxDB is the shiny new guy in town. I was really excited to read through their website. It's open source, the developers are also doing a hosted option. The language is very rich, and unlike graphite features like "group by" were built in from the beginning, so it's supported properly. Unfortunately, it's rather new, so still a bit too new for my blood. If you want the new hotness though, give it a try.

From what I understand this is basically a re-write of OpenTSDB. Like OpenTSDB it runs on hbase (translation: it's impossible to administer unless you're already a hadoop guru). Unlike OpenTSDB it can also run Cassandra Cassandra is the open-source bigtable written in Java, except that unlike bigtable it's also a storage-stack. The problem is, it's not that reliable. They still haven't hammered out a lot of bugs. Is it usable? Yes. Do you want to deal with it if you don't have to? No. It also has nice query semantics supporting most of the features you might want.
This might be a decent option if you can find a good hosting provider. Non-hosted though, it's probably a no-go for a small organization. For a big organization it might be just the ticket.

See Kairos. OpenTSDB is a true standard, it's all open source, all that great stuff. But, like kairos it's nasty to actually run it. Unlike kairos it only runs on hbase, that is, the hadoop stack. If you already run a hadoop stack that's all well and good, no big deal. If you don't it's a heck of an understaking just for your monitoring database. That said, I understand that it scales quite well. Again it has support for rich queries and all the shinies.
From IBM, it's closed source. It's been around a pretty long time, so it's really optimized for a different set of use-cases from what I gather. From my perspective I'm putting a lot of time and energy into a system, and closed-source scares me because I can't jump ship if I want to. Systems like graphite have standard interfaces that are supported by lots of systems. Informix is like the exact inverse of this, being 100% proprietary and no-one wants to go within a 100 yards of something owned by IBM (for good reason).

Of these, I ended up picking graphite. I state that here to explain the next section


This part is harder. After scouring the fields for systems that alert based on data in graphite. They are all seriously flawed.

The biggest flaw is that they all use configuration backed by databases. This means that my data about what to alert on, which is fundamentally configuration, is instead live-data in production. If I lose that database in production I'm going to be stuck rebuilding all of my monitoring from scratch. If I make a mistake and bump the wrong button, or someone changes something and we decide it was wrong, we have no rollback, no tracking. Any features related to versioning have to be built in. Also, I can't do code-reviews and such on changes, review them branch them, and do everything else source-control lets me do.

For me, this shot down every single system I could find. My conclusion? Build one myself.


Monitoring, post2, what we want

In my last post I talked about what we're trying to accomplish with monitoring http://www.blog.computersarehard.net/2014/06/monitoring.html
I think it's hard to see what we need in a tool without some hard examples of what we want to compute. So, here are some examples of the types of things we want to compute. The discussion in that previous post should be sufficient to motivate why each of these metrics would be interesting and useful.
  • Percent errors returned to users, overall and broken down by error type, across the entire service and per server
  • 99'th 95'th and 50'th percentiles, of total system latency, per component latencies, and per server latency.
  • Average and 95'th percentile CPU/disk/memory utilization, over the last quarter.

Rates and sums

Lets start with percent errors returned to users:

This is basically error_rate / responses_rate. What we have though are discrete query response events that we're counting. Chances are we can't afford to send or record metrics for every response, so instead they are going to get bundled somehow on a periodic basis, another discrete event. This means we don't really have a continuous function that we can simply take a differential of, instead we need to take a period of time that includes several samples and compute a differential over that - so we have a few points to work with.

This means our rate isn't just a rate it's a rate with a period parameter. In general you want to make as many decisions as possible in your monitoring system, rather than your application, so you can easily change them without re-releasing your software. So we really want to set this period parameter in our monitoring.

This means that we don't want to export a rate, instead we want to export a constantly incrementing counter. We can then compute a differential over an arbitrary period post-facto in the monitoring system. Thus getting a 1 minute, 5 minute or 20 minute rate as we prefer. This period acts like a low-pass filter, the larger it is the more it "smoothes" the jumps in your rates. For the sake of example lets say we have a pretty large system with high query rates, and we want fairly low resolution and high sensitivity to quick changes, so lets go with a 5 minute rate. So now we have:

percent_errors = rate(errors, 10m) / rate(responses, 10m)
Now we want to compute this over our entire service, which looks like:

rate(sum_over_servers(errors), 10m) / rate(sum_over_servers(responses), 10m).

Dimensionality of data

We also want  

percent_errors = rate(errors, 10m) / rate(responses, 10m)

for each server as well. That way once we see that the error rate shot up, we can tell if it's a particular machine causing our problems.

And we want

percent_errors = rate(specific_error, 10m) / rate(responses, 10m)

So we can break down what problem is being passed back to the user.

So, there's several interesting things going on here. We basically have 2 dimensions. We have servers and error types. We *could* write out every one of these equations across both dimenisions, but that would be a LOT of equations, one for every error type, and one for every server... and wait! we probably want one for every error type and server combination! Even worse, if we add or remove a server our rules change. This doesn't sound at all like how a properly lazy software engineer approaches a problem.

Instead of describing every calculation we do, we want to describe each category of calculation. To compute the error rates for each I basically want to say "do this calculation over every error type". In haskell terms this is something like a list monad. If you're used to matlab it's like operating on matrices. I'm going to describe a bit of a formalism here, not because it's the only one that would work, but to try and clarify the problem. To accommodate this new "parallel computations" model we can think of every timeseries as being described by an unordered list of labels. That is a dictionary, struct, or record, depending on your favorite terminology. So for example 404 errors on server 10 might look like this:

{response_type: error_http404, server: myhost10, property: response}

Using this model we can now drop a key... say

{server: myhost10, property: response}

To request everything that matches the two keys we do supply. This is like an array of timeseries, a 1 dimensional matrix. Thus this gets us all of the response_types for myhost10. If we drop 2 keys we'll get a 2 dimensional matrix, etc. Great, so now we can do something like

percent_errors = rate({server: myhost10, property: response}, 10m) / rate(sum({server: myhost10, property: response)}, 10m)

But we still have to write this for every server. If we try and write:

rate({property: response}, 10m) / rate(sum({property: response)}, 10m))

It all falls apart. We end up summing over all servers and all errors for our divisor, while our quotiant is calculated per server. Dividing these makes no sense at all! To solve this, we need to tell "sum" what it should sum over. As it turns out, our result is now going to be arrays on both sides. So, we also probably need to give "/" some clue how to match up those two arrays, so it knows what to do if they aren't identically sized or something.

Group By

I've actually never used SQL, but it seems this is by far the best terminology out there for I'm describing here. I finalized realized the connection a couple of days ago while talking with Jess, my girlfriend, about the problem I was trying to find a monitoring system to solve. It turns out the solution to the problem describe above is something SQL folks call a "Group By" clause. The idea is to say "retain this set of keys", while you collapse over all others. So for example:

sum({property: response}) Group By server

Would calculate the sum of all response_types, but wouldn't sum across servers and would thus return a 1 dimensional matrix, an array, or results, one for each server.

Group by isn't usually used in this context, but we might as well use it in our pretend formalism since we already have it as an operator. For binary operators lets just say that it uses the key you define as the variable it matches on the two sides. So to fix our calculation above we get:

rate({property: response}, 10m) / Group By server rate(sum Group By server ({property: response)}, 10m))

Obviously this is a bit messy with both infix and prefix operators, and our Group By clause as an addendum to each, but I didn't want to change from the initial syntax to much, and wanted to leverage people's understanding of the SQL concept...

So, what have we found so far?

We've noticed so far a few things that we really need for our monitoring system
  • We need multiple levels of highely flexible aggregations... in other words, we need full mathematical formula support
  • We need to compute on sets of data all at the same time, so we don't write hundreds of rules that change when we turn up new servers
  • We need this "group by" concept, somehow, built into our language again so we don't have to write rules over and over again
I'd like to add one more note which is that our dictionary syntax is cute, but misses one a point. What if we wrote down these tags in an ordered list:


Then, we could use regular expressions to parse out our tags. A query for all response types on server 10 would look like this:


Note that this syntax is actually *more* general than our previous syntax, since we can also match on just parts of labels, so for example we could do this:


Now we're selecting only for response_types with an error code. Before we would've had to change the export of our variables to get this data into a separate dimension, now we can get new dimensions on the fly whenever we want them. This isn't great of course because the syntax is obnoxious for general use (and probably for the implementation as well), but it's definitely a useful property of a system to be able to pick out pieces of a name add new dimensions on the fly like this.

histograms and percentiles

Percentiles and percentiles.

Most systems give you the ability to compute a percentile over a set of variables, so for example:

percentile({property:response, response_type:error_http404})
That's great and all, but it's not usually what you want. This is a percentile of variables, but frequently you want a percentile of *events*. That is something like a latency percentile. You can't represent every query as a variable or everything goes heywire and your index space explodes far too large to store. In fact, frequently you can't even afford to write data down for every query. Instead you're probably going to bucket your query latencies into a histogram. This only gives you an approximation, but it can be a good one if you're careful about your histogram selection. E.g. even sized buckets are probably not what you want, since latency is theoretically unbounded upwards. Instead you probably want exponential bucket sizes, so your resolution is relative to magnitude.

Bucketing into a histogram gives you another advantage. Given a percentile latency for each machine, you can't aggregate these into a percentile latency for all the machines combined. Percentiles just don't work that way. To be able to do this calculation you need a lot more information about the distribution. With histograms you can sum your histograms across all your servers, then compute an approximate percentile for all your servers. Simple if histograms are reasonable to work with in your system.

There's one more trick about percentiles. This doesn't relate directly to monitoring tools, but I would be remiss not to mention it while on the topic. For a two stage pipeline ->A->B->, you cannot use a histogram of the latency of A and a histogram of the latency of B to compute the histogram of the latency of A->B. The reason is that the latencies of these systems aren't guaranteed to be uncorrelated. In fact, since latencies frequently depend on the exact query, they are far more likely to be correlated than not to be. This fact itself is the type of thing your trying to see with your monitoring. To properly measure the latency of this system, and be able to break it down and look at each piece, you need to measure the latency of A, the latency of B, and the latench of A->B seperately, export *each* as a histogram, and compute your percentiles across each. It sucks, but it's mathematical reality.

Calculations on history

If I want to know how many resources my system is using there are a lot of things I want to look at. But I'm probably interested in our peak (or some percentile thereof), our average, and how much it varies. Maybe if I spend some engineering effort I could flatten out my usage. Is it worth it? How much money would that save in resources (or rather, how much growth-space would that give us, without having to buy more).

Another great example of this is computing quarterly SLO rollups. An SLO is frequently measured in "9's" of uptime. Well, systems aren't really either up or down. They can be in-between. In fact this is what our metric we were discussing earlier measures, the error rate of our system when it is serving users. Given this we probably have an SLO that looks like "5 9's 4 9's of the time". Meaning that we should have 99.999% availability 99.99% of the time.

At the end of a quarter we want to see how well we've been doing. So we're going to take our original key-metric, threshold it at 99.999, and then ask how often it was above or below that threshold.

There are almost certainly other ways to do this, but by far the most obvious is to compute on historical data. We want to be able to graph history, but we also want to be able to look at history numerically to help us pull out trends, and examine the past to help predict the future. There's a whole chunk of monitoring that fundamentally is all about modeling, and modeling is all about looking at the past.

Alerts come from same system as graphs

This is just common sense. When I get paged, I want to go look at the data that paged me. I want to see a history of it and look for the event that tripped the alert. Remember that by the time I'm looking the event is likely already over, so history is all I've got. It may happen again, but I want to fix it *before* that happens, after all, that's my job.

Data that is kindof similar to the data that paged me doesn't cut it. I want exactly the data that paged me, so I can be absolutely certain of what's going on.

When debugging a system you often have hunches, and one of the major purposes of monitoring is to give hard data to either confirm or deny those hunches. Monitoring systems are complex and often need debugging in their own right. Keep it simple and easy to examine.

Configs are stored in config files

I would've thought this was obvious, but looking at the extant monitoring systems, it apparently isn't. This is a general principle, but I'm going to bring it up here anyway.

When trying to build stable systems, the simplest solution that does the job is the best. Config files are simple. If something goes wrong in production and data gets lost, I've got the config file right here in a versioning system. If I notice wonky behavior and want to look for recent changes, again I have a versioning system. If I want to generate a config, generating a flat-file is easy and automatically idempotent.

Could you get these properties from a database? Sure you could, but now things are complex. Your synchronization could go wrong, and then your monitoring is not doing what you think it is.

Realtime alert and graph data latency

We need soft realtime constraints around say... two minutes at worst, for data to make it to alerting. Data making it to graphs can be delayed by another couple of minutes without causing issues, I'd probably put that maximum acceptable delay for hitting graphs at around 5 minutes. Smaller would clearly be nicer.


Lets review the properties discussed we've said we need:
  • Multiple levels of highly flexible aggregations... in other words, we need full mathematical formula support
  • Computation on sets of data all at the same time, so we don't write hundreds of rules that change when we turn up new server. This also requires the GroupBy concept, or an analog.
  • Usable percentile and histogram support (or ability to write it)
  • Alerts coming from the same system as graphs
  • Configs stored in config files
  • Realtime alert and graph data latency
I thought I would get to the tools this time... but describing what we're looking for took an entire post. So in the 3'rd post in this series I'll cover tools.

I'm still trying to understand some of the computational models (specifically graphite's) sufficiently to write that post, but hopefully I'll lock it down relatively soon.



I have some datastructures exerpiments I really want to finish, but I've been busy starting a new job (and moving) and haven't a had a chance.

So, in the meantime, lets talk about distributed system or cluster system monitoring.

The goal of "monitoring" is:

  • 1) To have insight into the health of the system.
  • 2) To have insight into what's impacting the health of the system.
  • 3) To have insight into how health and usage numbers compare to historical values.
  • 4) To receive notification when your system is, or is about to be, unhealthy.

So, how does one accomplish this? I have found through operating services myself, and watching other folks to do, that you basically always want the same things. Front-end, back-end, large or small. We have two tools at our disposal to acheive these goals. Whitebox and blackbox monitoring. Whitebox monitoring is when we get the data directly from the system, we're treating the system as an open box that we understand, and asking it to tell us what it thinks is going on. Logs are a good example of whitebox monitoring (though not one I like to use much). Whitebox monitoring is in contrast to blackbox monitoring, where you have no idea about the internals of the system. Instead you have a system completely external to your systems "probing" that system. Often this means it's acting like a user or client, pinging your servers, and watching things like latency and error rates.

So, Given our 2 tools lets go through the 4 goals above and talk about how to accomplish each.

1) Insight into the health of the system.

This is very straightforward. Your goal is to have 3 to 5 metrics that tell you if your system is working. If none of those metrics is out of threshold, then (in lieu of a known issue), you can generally assume the system is stable. I'm going to call these metrics your "key metrics". Some groups use the term "SLOs", though that term conflates this concept with client communication which I'd rather not do here.

You want these key metrics to cover as much of the system's behavior as possible, but you have a competing goal of them being be simple and easy to understand. The first goal is important because you don't want to make sure something on that front-page of graphs gets wonky when your system gets wonky, if not you'll miss problems. The second is equally important though because a single red light saying "IT'S BAD!" doesn't tell you much, and thresholds are hard to set, if you can't reason about what that metric means. When that metric is out of threshold you should be able to easily understand the impact it has, so you can decide how to proceed.

Note that getting these statistics is complex. Basically all systems have 2 relevant properties that you want to know about. 1) is it doing it's job 2) is it doing it fast enough. We can get each of these as whitebox or blackbox. Well, whitebox is better, since it reflects what the user sees right? So lets just use whitebox for both! Well, no. Here's why that's a bad idea. In general your probing is going to be a very small portion of your traffic. You usually can't afford to probe every interesting query all the time, so your users may be making different queries than your prober is. Between these two this means whitebox monitoring is usually more noisy, less granular, and likely to miss special cases. Blackbox of course suffers from not representing your users, or networking and such connecting you to your users. As a result you usually want both whitebox AND blackbox monitoring... optimally you probably want both for BOTH metrics.

For all your key metrics you want to avoid choices that cause them to change dramatically as your system scales, or as load scales. For example saying "we want to make sure users are always hitting our site" and alerting if your hits per second drops below a constant is guaranteed to alert every new years eve, and every world cup. You may need something like that, but keep these to a minimum, instead if you can use things like error rates as a fraction of total queries. Ratios are great.

Latency is weird. If you have 10 million queries per second flowing through your system, you don't want to alert because one was too slow. On the other hand you really care if some are extremely slow, or most are kindof slow. Because of this you often want to pick a couple of percentiles, maybe 99'th and 50'th (unfortunately this does depend on the scale of your system, but only very loosely), and alert on their latency being high. I'm going to cheat a bit and count those as one metric, since you can easily put them on one graph. And I'm making the rules anyway :).

Think about what your system does, and make sure your metrics are representative. If 90% of your queries are of one type that's super-cheap and fast, and 5% are expensive and incredibly slow, maybe you want to break those out into seperate metrics. You really can't do this process blind, you have to look at your system, what metrics you can get reasonably, and what they will tell you.

2) Insight into what's impacting the health of your system

Key metrics tell us whether our system is healthy, and give a 10,000 foot view of how it's unhealthy, but not a clue at all as to why or where to look. This is what the rest of your metrics are for. Here is where you go crazy, the more metrics the merrier. That said, piles and piles of metrics don't really help you if you can't find them. Think about what you want to know while debugging a given problem, and what metric you would want to help dig down and see what's wrong, or prove that something is or is not the issue.

For example. I wake up at 2am to a page saying that the 99'th percentile latency is high on my webservice. This webservice is backed by OpenTSDB sitting on HSpace on HFS and the whole thing is running on EBS backed EC2 instances. What do I do? If I've set up measurement of latency and compute percentiles *per component*. I click on a link by my latency graph and it takes me to a breakdown of latency per component and per query type. I look and I see that operations with *'s on the first parameter are crazy slow, that is whole-table scans at the HSPace layer. Every layer is saying things are stupid slow, so the other breakdown is useless today. Well, I think maybe it's HSpace so I look click on a link for that and I see that one tablet is slow. Huh, now I log into the EC2 instance backing that tablet and find that the EC2 instance takes 30 seconds to authenticate my ssh connection... well shit, my EC2 instance is probably getting hosed by a competing workload, I can dig around on the machine and maybe I'll find the competing workload from someone else is blowing all my cachelines... so I go buy a larger nicer machine, and tomorrow I'll see if I can get dedicated machines or something. I file a bug to get on that and go back to bed. (To be clear I've never run OpenTSDB, HSpace, or HFS, I just wanted an example with a relatively deep stack behind it)

That's ideally how you want debugging to look. It takes a LOT of work and a LOT of metrics to make things that smooth - and most of the time it won't be nearly that nice, but the closer you get the nicer it'll be.

3) To have insight into how these compare to historical values.

It's third quarter. My manager comes to me and asks what I need for budget next year. I ask around and find out that we're taking on a new large client that's half again the size of our largest client. After getting numbers it turns out that it's actually about half the size of our largest client... but that still means their query load is 20% of our total on average. Additionally I dig into their use-case and find out that it matches that of another client... okay.

So, I dig up the client that their use-case matches and look at their historical query load on various parts of our system. I take the peaks over the last 2 months and compare that to the average to get an idea of spikiness. Peaks are about 300% of normal, and occur at 12 noon. I compare that to the system as a whole and find that the spike is the same as that in the system overall. Damn, that sucks.

So I take our system as a whole, match it to a growth function, and based on that predict our traffic 4'th quarter next year due to organic growth. Then I check that function against our representative client and find that it matches. So I scale the represenative client up to the scale of our new client, add that to the other curve, and I've got our required capacity. From there we backsolve to how much of each resource we need, maybe adding a few percent slop here and there for systems that don't scale linearly, a little extra headroom, and the like. In short, you spitball it, but not until you get some backed numbers.

This sort of solving requires history, and it requires being able to query and analyze that history. I've repeatedly tried to build a generalized tool for capacity planning and have yet to succeed, in fact I've yet to succeed in even building a specialized tool for a specific task. If anyone knows of any I'd be very interested, for now I do it with ad-hoc queries similar to the process described above. Again the above is not a real scenario, but it maps closely to the process I have used, and will use again, when capacity planning.

4) To receive notification if your system is, or is about to be, unhealthy.

And now to the last bullet point. The one that all operations engineers hold near and dear to their hearts, and yet hate with a passion... Alerting.

We have out key metrics, so obviously we want to get notified when those are out of whack. One could argue that the key metrics are all we need for monitoring, and ideally this is the case. That said, the world we live in is never ideal, and the reality is that our key metrics are almost certainly going to miss some cases. Also, key metrics tend to be designed to tell you about user impacting issues. What about issues that you know are going to be user impacting? You can surface these sooner if you alert on them directly, rather than waiting for them to impact the key metrics enough. Examples of these are, part or all of the service is simply absent to our monitoring, our monitoring itself is noticably broken somehow, we are missing capacity that we're supposed to have and are just lucky that we're not in a peak load. All of these are clearly interesting pagable events, even though the key metrics are looking healthy.

There's another interesting set of cases as well. Our key metrics going out of whack are almost certainly pagable events. What about little niggling things? Things that are wrong, but we really don't want to get paged for. Things that may not show up in the key metrics until several things go wrong at once. For example, lets say that once we lose over 10% of our machines things start going south because we'll be out of capacity if we lose a few more. Or our system is supposed to be N+2 but we're at 95% of capacity before we become only N+1. Systems are constantly segfaulting, but never enough to actually cause a user-impacting problem. These are *interesting* events, and we want to hear about them, but don't want to get woken up in the middle of the night. For these events you want some sort of notification, or some kind.


So, in my view, that's what monitoring is for. That's what we're trying to accomplish. With that in mind my next post is going to be about tools. Since starting at Meteor I've been researching all of the tools available outside of Google, and I have to say that I'm a bit disappointed. I expected awkward kludgy tools, but I expected the tools to be able to do the things I needed. I'll go into first describing what we need to accomplish the goals listed here, and then talk about some of the extant systems and how they do or do not fall short.


Sound multiplexing in Linux

Linux's default sound system is pretty bare-bones. These days the go-to solution for sound drivers is Alsa. Alsa is just enough to get you the ability to get sound in and out of the machine, and adjust whatever the hardware supports. On thing it does not do by default is multiplexing. That is, only one application can make sound at a time.

This is a problem, and a problem many people have solved many ways. The result has been a plethora of different sound daemons and standards to go with them, jack, esd, pulseaudio, etc. The idea is you run a daemon that talks to the hardware (via the kernel) and everyone else sends it's sound stream through the daemon, which mixes it for you.

Now, when I set up my laptop with ubuntu-server, what I got by default was pulse-audio. My laptop has an interesting sound card feature, the sound-card can play out of both the headphone jack, and the speakers, at the same time. Often plugging in headphones cuts out the speaker, but not on this machine. Instead there are seperate volume controls for the two outputs.

That's all well and good, but pulse-audio is designed to need minimal configuration, and similar to so much software of it's ilk, that means it's not configurable. As a result it has different ideas about what should happen when I plug in my headphones than I do (in particular, it likes to crank the volume WAY up). It does mute the speakers, but I find the changing of my volume to ear-bleeding to me unacceptable. Since I couldn't change this, I removed pulse-audio.

I use my laptop to listen to music, as well as sometimes to receive phone-calls via gmail, skype, etc. These applications require sound, and without multiplexing I can't hear the phone ring if I'm listening to music. So, I needed a new multiplexer.

In comes dmix. dmix is alsa's own multiplexing solution. It's built right in, no need to use some other protocol, or some daemon that's too smart for it's own good. It's not shiny, it's not featureful, but it's simple and works. To make it the default you just edit the file "/etc/asound.conf". I'm not going to go into details on how, there's plenty of pages out there on how, but should you want this basic feature, without some heavyweight solution, give it a try.


Someone requested some references, I hadn't bother as I had little trouble finding them but I may have gotten lucky. So here we go


Complexity vs. Constant Factors

If you're formally trained, you've seen lots and lots of asymptotic analysis and the like in Computer Science.

In an earlier post about trees I noted one way you can be led astray by asymptotic analysis when I compared AVL trees and Red-Black trees. The question there is whether the part of the work people analyze is the part that makes up the most work.

There's another simpler way it often goes wrong though. Constant factors *matter*. A computer only has so much memory, and so much hardware. Every problem that can be solved on a given computer can be solved in 2**number_of_bits where number_of_bits is all storage on the machine. After that the machine must cycle to an already met state. That number is *extremely* large, (and in point of fact likely takes all the energy in the universe to cycle through) but it is finite.

It turns out that small local operations on arrays can still be VERY fast, particularly because the general locality helps with caching. A BTree is a way of taking advantage of these speedups, but within the small finite space of your node size.. that is, your arity. This makes BTrees MUCH faster than either AVL or RedBlack trees, despite the asymptotic analysis being identical to RedBlack trees.

I was curious about exactly where this tradeoff lies for BTrees. A btree stores many elements in each node, and usually keeps the elements sorted within that node in an array. This means inserts and removals from a node require shifting everything over in the node - a linear time operation in the size of the array, but B-tree nodes are of finite size, so still O(1). Anyway the number of elements in a BTree corrosponds to the number of children it has (-1 actually), so we can call it the "arity". So, I tried running my benchmark (same one used in the earlier tree experiments) at varying arities to see where our efficiency peaked. Note that these results are for 32 bit integers. The hardware and other dependencies are: Dell XPS 12 convertable laptop/tablet with: Intel(R) Core(TM) i7-4500U CPU @ 1.80GHz, ubuntu desktop, gcc 4-8-1-10ubuntu9. Linux kernel 3-11.0-19-generic.

It appears to peak around 70 elements. The peak is pretty wide though so it's hard to be sure given the noise in the data. This was done with a single trial on a laptop that wasn't necessarilly perfectly isolated. For details on the exact experiement, and ability to run it yourself see

This isn't a surprising result at all, it's about what I expected. My guess had been between 100 and 200, which isn't far off.

Now if you read my post about real-time http://www.blog.computersarehard.net/2014/05/worst-case-vs-average-case.html You probably just stopped to wonder about this algorithm.... Yup, you can bet that the worst-case goes up a bit as we add more elements even while the average might be dropping. So, it's not clear what the best ARITY to use would be. Given no information about use-case there's a pretty clear cliff where we get a huge win by increasing the arity. So, not knowing I'd be tempted to set it at maybe 55. This will get us basically all of the gains of increasing arity, but keep the worst-case and variance comparitively small.

Consider the minimum there, ~25 seconds or so. The best run I've seen out of AVL is ~49 seconds. That's nearly a 2x difference. Even at Arity 5 the btree destroys the AVL tree at only ~39 seconds.

Conclusion: Constant factors matter, and can matter a lot. What's neat is that we still keep our asymptotic bounds TOO, so on both big and small problems we can be fairly sure BTree will keep beating the AVL tree by the same factors. I think that's pretty neat.


Worst case vs. Average case

For years and years everyone's been focused on the average case and amortized analysis. Quicksort is most programmers favorite sort. Splay trees are popular in some groups due to their amortized performance, etc.
I'd like to propose an alternative. I would like to propose that for most problems this view is incorrect, and that we should be focused on the worst case performance rather than the average. The reason for this is that ALL code is real-time.

The definition of realtime is:
A program is realtime if it must complete it's task within a specified real world time bound T.

 If I'm playing a videogame and 1 out of every 100 frames takes longer than a 100'th of a second to render I can visually see a flicker/stall in the video. As a result the videogame looks crappy and I hate playing it. This is a clear case where our software is realtime. If it takes longer than a certain amount of time, then it's not performing correctly.

Now lets say I go to visit a website, I sit there for 5 minutes waiting for it to load and it never does. So I give up on the website close the browser and never visit it again. Clearly this websites behavior was incorrect, and as a result it caused the website to lose a user. So, part of the correctness criteria for that website is a realtime bound. That websites server code is real-time!

Worst case vs. Average case

If I need my code to meet a real-time criteria, always, then I need to make sure that it's worst-case is less than that bound. Thus, we want to choose algorithms with low variability in time, and possibly a worst average case, in exchange for ensuring that the worst-case falls within our limitations.

This concept is especially true in distributed systems. In a normal system if our worst-case is highly unlikely it won't impact us often at all. On the other hand if rendering the website the user wanted requires hitting 100,000 servers, then it's pretty likely that that worst-case that only happens 0.0001% of the time will trigger on ONE of those servers. Examples of this include search-engines which fan out the request to servers each of which is responsible for part of the space of all data being searched.

Surprisingly this even comes up on back-end systems. Lets say I'm building an enormous dataset on a large cluster. I do this once a day.  If I'm doing a mapreduce and 1 out of every 1000 shards of my mapreduce is running really slowly, then my total mapreduce time is bound on those worst case shards, not the average. And I have a realtime bound of 1 day (probably less, since it'd be nice if I had a little leeway).

On top of that, while waiting for those couple of shards to finish their work we're probably waisting the rest of the cluster's computational ability - it's just idle. We could try and cram in other workloads, but this gets really complicated. It's much easier if you can flatten every workload as much as possible.

Lies and Deceit

I've just told a lie. In practice on really large systems it's often actually the 99.99% latency (or some number of 9's anyway), and you can may be able to just throw-out/time-out on the other cases. You probably can't do this at every layer though! So, as you distribute and add more layers to a system you still need to worry more and more about the wost-case performance.

Note for wankers, I'm using O not big theta because it's easier to type, and while the distinction would be formally useful here it's not widely used.

The point

When we start thinking of algorithms this way, the cost of mis-sizing a hashtable is an O(1) operation suddenly jumping to an O(n) operation so we can resize. Our quicksort looks pretty bad at O(n). Balanced trees, and the extra memory needed for mergesort or the extra moves for heapsort start to look a lot more tempting.

That is not to say average case, or even better amortized analysis, is not useful. If you do 100,000 operations on your dict, and all you care about is the time bound of that total set of operations, but this is only true if no-one else depends on the intermediate results.

So think about the real bounds of the code you are writing, and whenever someone says they want something to be "faster" stop for a second ask yourself if worst case or average is really what needs to improve.

AVL trees or RedBlack Trees?

Before launching into this. I'm talking about Sorted Binary Tree balancing algorithms. If you don't know what those are, you'll want to before bothering to read onward.


For this post I want to focus on AVL vs RedBlack trees, but first a little background. In a paper called "An Algorithm for the Organization of Information" G. M. Adel'son-Vel'skii and E. M. Landis layed out the first datastructure algorithm invented. It's a great paper too, very approachable, give it a read. You may want to skim Wikipedia's explanation as well though. Papers back then were written a bit differently, but more on that later.

Anyway. BTrees were invented later. RedBlack trees were then invented as an isomorph to BTrees where data never had to be shifted, one node was one piece of data. This can prove useful, for example if you're using pointers to the data stored in the tree.

There are tons of other balancing algorithms, but RedBlack trees and AVL trees have the distinction of  O(log(n)) for all operations, without having to amortize the results. This means that EVERY operation is O(log(n)) it doesn't just average out to that. That property is great if you have realtime bounds for your code, rather than just wanting it to run "fast". They do this because they guarantee that they are approximately "balanced" meaning every subtree has about the same number of children (transitively) to it's left and to it's right.

AVL trees have a tigher bound on how balanced they are than do RedBlack trees. This makes them a touch faster for lookup, but it means there are more balancing operations on average as data goes in and out. In particular RedBlack trees are guaranteed O(1) rotations for all operations, where AVL trees are not.

The Experiment

For fun I wrote myself a small forest of trees. The experiments aren't complete, but I got some slightly surprising results. I wrote them up on Google+, but I wanted to put them somewhere more permanent, so here it is.

All of the code for what I'm talking about is here:
Check PERFORMANCE for details on the experiments.

What surprised me is that AVL outperformed the RedBlack tree, it's not by a lot, but it's by some. I ran several further experiments making sure that it wasn't the depth difference impacting lookups. I ran a ton of extra lookups to see what would happen, and the gap didn't widen. This implies the lookups are too small a factor in my workload to matter.

I then spent quite a bit of time trying to figure out what else it could be. I already have 2 redblack implementations, and I'm running against the faster of the two. Thus I'm fairly confident that the algorithm is doing the right thing. I also have extensive consistancy checking which is turned off for the benchmark, but which is used in the unittest, so I'm fairly sure the algorithm is correct.

After much testing my best theory is that since RedBlack trees were written and declared faster the world of hardware has changed. See, when inserting or removing from an AVL tree you can tell whether to rotate by looking only at the nodes in the path to the one you modified. In a RedBlack tree you need to look at some other neighbors. Additionally note that as machines have sped up, the gap between cache layers has widened at every layer. L1 is many times faster than L2 than it used to be, same L2 to L3 and L3 to main memory.

My theory is that pulling in the cache-lines to look at the neighbors is in fact so expensive that it overrides having to do more rotations. Thus while theoretically RedBlack trees only do O(1) rotations per insert, this doesn't matter because the main expense is actually *looking* at nodes.

If anyone else comes up with another theory I'd be very interested. That's what I've got for now.

First Post

I've been keeping a blog about my outdoor adventures for some time over at blog.smalladventures.net.

I worked at Google for over 5 years as an SRE. In that time most of the computer work I was doing was fairly not public. Also, working 5 days a week, and some weekends, I really didn't want to spend much time outside of that playing with computers, certainly not enough to sustain a job.

I quit last year though, and have been doing some interesting side projects. I actually just accepted a job offer with Meteor that I'll be starting soon, but in contrast that is open source, and I convinced them to let me work only 4 days a week. As a result, hopefully I'll want to do at least some computer work for fun, and I'll probably want to post more as I learn about open source types of projects and the like.

So, here's my blog.

I did a lot of robotics in highschool and college and went to Carnegie Mellon with the intention of doing AI. I then got into operating systems, where I TA'd a fairly intense operating systems class. I realized the languages I was working in were awful and thus got interested in type theory and programming language design, taking several classes down that track such as category theory and a class on building a compiler for an SML type language. Somewhere along the line I also got interested in algorithms and took a graduate course in complexity theory.

Then, as mentioned earlier, I was an SRE for 5 years. I also run Linux on all my own machines. My preferences and needs for personal machines are strongly driven by my desire to be outdoors.

Suffice to say my interests are rather eclectic, and so this blog is likely to follow the same pattern.