Worst case vs. Average case

2014-05-28

For years and years everyone's been focused on the average case and amortized analysis. Quicksort is most programmers favorite sort. Splay trees are popular in some groups due to their amortized performance, etc.
I'd like to propose an alternative. I would like to propose that for most problems this view is incorrect, and that we should be focused on the worst case performance rather than the average. The reason for this is that ALL code is real-time.

The definition of realtime is:

A program is realtime if it must complete it's task within a specified real world time bound T.

If I'm playing a videogame and 1 out of every 100 frames takes longer than a 100'th of a second to render I can visually see a flicker/stall in the video. As a result the videogame looks crappy and I hate playing it. This is a clear case where our software is realtime. If it takes longer than a certain amount of time, then it's not performing correctly.

Now lets say I go to visit a website, I sit there for 5 minutes waiting for it to load and it never does. So I give up on the website close the browser and never visit it again. Clearly this websites behavior was incorrect, and as a result it caused the website to lose a user. So, part of the correctness criteria for that website is a realtime bound. That websites server code is real-time!

Worst case vs. Average case

If I need my code to meet a real-time criteria, always, then I need to make sure that it's worst-case is less than that bound. Thus, we want to choose algorithms with low variability in time, and possibly a worst average case, in exchange for ensuring that the worst-case falls within our limitations.

This concept is especially true in distributed systems. In a normal system if our worst-case is highly unlikely it won't impact us often at all. On the other hand if rendering the website the user wanted requires hitting 100,000 servers, then it's pretty likely that that worst-case that only happens 0.0001% of the time will trigger on ONE of those servers. Examples of this include search-engines which fan out the request to servers each of which is responsible for part of the space of all data being searched.

Surprisingly this even comes up on back-end systems. Lets say I'm building an enormous dataset on a large cluster. I do this once a day.  If I'm doing a mapreduce and 1 out of every 1000 shards of my mapreduce is running really slowly, then my total mapreduce time is bound on those worst case shards, not the average. And I have a realtime bound of 1 day (probably less, since it'd be nice if I had a little leeway).

On top of that, while waiting for those couple of shards to finish their work we're probably waisting the rest of the cluster's computational ability - it's just idle. We could try and cram in other workloads, but this gets really complicated. It's much easier if you can flatten every workload as much as possible.

Lies and Deceit

I've just told a lie. In practice on really large systems it's often actually the 99.99% latency (or some number of 9's anyway), and you can may be able to just throw-out/time-out on the other cases. You probably can't do this at every layer though! So, as you distribute and add more layers to a system you still need to worry more and more about the wost-case performance.

Note for wankers, I'm using O not big theta because it's easier to type, and while the distinction would be formally useful here it's not widely used.

The point

When we start thinking of algorithms this way, the cost of mis-sizing a hashtable is an O(1) operation suddenly jumping to an O(n) operation so we can resize. Our quicksort looks pretty bad at O(n). Balanced trees, and the extra memory needed for mergesort or the extra moves for heapsort start to look a lot more tempting.

That is not to say average case, or even better amortized analysis, is not useful. If you do 100,000 operations on your dict, and all you care about is the time bound of that total set of operations, but this is only true if no-one else depends on the intermediate results.

So think about the real bounds of the code you are writing, and whenever someone says they want something to be "faster" stop for a second ask yourself if worst case or average is really what needs to improve.