Complexity vs. Constant Factors

If you're formally trained, you've seen lots and lots of asymptotic analysis and the like in Computer Science.

In an earlier post about trees I noted one way you can be led astray by asymptotic analysis when I compared AVL trees and Red-Black trees. The question there is whether the part of the work people analyze is the part that makes up the most work.

There's another simpler way it often goes wrong though. Constant factors *matter*. A computer only has so much memory, and so much hardware. Every problem that can be solved on a given computer can be solved in 2**number_of_bits where number_of_bits is all storage on the machine. After that the machine must cycle to an already met state. That number is *extremely* large, (and in point of fact likely takes all the energy in the universe to cycle through) but it is finite.

It turns out that small local operations on arrays can still be VERY fast, particularly because the general locality helps with caching. A BTree is a way of taking advantage of these speedups, but within the small finite space of your node size.. that is, your arity. This makes BTrees MUCH faster than either AVL or RedBlack trees, despite the asymptotic analysis being identical to RedBlack trees.

I was curious about exactly where this tradeoff lies for BTrees. A btree stores many elements in each node, and usually keeps the elements sorted within that node in an array. This means inserts and removals from a node require shifting everything over in the node - a linear time operation in the size of the array, but B-tree nodes are of finite size, so still O(1). Anyway the number of elements in a BTree corrosponds to the number of children it has (-1 actually), so we can call it the "arity". So, I tried running my benchmark (same one used in the earlier tree experiments) at varying arities to see where our efficiency peaked. Note that these results are for 32 bit integers. The hardware and other dependencies are: Dell XPS 12 convertable laptop/tablet with: Intel(R) Core(TM) i7-4500U CPU @ 1.80GHz, ubuntu desktop, gcc 4-8-1-10ubuntu9. Linux kernel 3-11.0-19-generic.

It appears to peak around 70 elements. The peak is pretty wide though so it's hard to be sure given the noise in the data. This was done with a single trial on a laptop that wasn't necessarilly perfectly isolated. For details on the exact experiement, and ability to run it yourself see

This isn't a surprising result at all, it's about what I expected. My guess had been between 100 and 200, which isn't far off.

Now if you read my post about real-time http://www.blog.computersarehard.net/2014/05/worst-case-vs-average-case.html You probably just stopped to wonder about this algorithm.... Yup, you can bet that the worst-case goes up a bit as we add more elements even while the average might be dropping. So, it's not clear what the best ARITY to use would be. Given no information about use-case there's a pretty clear cliff where we get a huge win by increasing the arity. So, not knowing I'd be tempted to set it at maybe 55. This will get us basically all of the gains of increasing arity, but keep the worst-case and variance comparitively small.

Consider the minimum there, ~25 seconds or so. The best run I've seen out of AVL is ~49 seconds. That's nearly a 2x difference. Even at Arity 5 the btree destroys the AVL tree at only ~39 seconds.

Conclusion: Constant factors matter, and can matter a lot. What's neat is that we still keep our asymptotic bounds TOO, so on both big and small problems we can be fairly sure BTree will keep beating the AVL tree by the same factors. I think that's pretty neat.


  1. I assume you've seen the Stroustrup talk on Linked Lists vs. Vectors? It's along a similar vein.

  2. No I hadn't seen that. That's a great talk. His results seem closely related to my redblack tree experiment - where it seems that caching affects dominate completely over actual modifications.

    I wondered earlier if a linear search would actually perform better on the btree for related reasons. At 55-80 elements here each node is ~200 to ~350 bytes long.

  3. BTW, sorry to anyone who had trouble postin - I just switched to Google accounts only. This is not to limit posting, but because it looks like the non-google posting was causing comment posts to fail quite a bit. So now hopefully if you think you can post it'll work.

    Please contact me if you do have trouble again, I can try more things.

  4. Other considerations with memory locality performance:
    i. Are the arrays aligned to pages or cache lines?
    ii. Are there opportunities to explicitly __prefetch early?
    iii. Are we "false-sharing" blocks with other CPUs?

    The third point comes up a bunch in console gamedev when debugging hot loops on PPC systems, and it's the worst -- the caching hierarchy kind of acts like a version control system in this case, and has to basically rebase memory writes and resolve conflicts. Preshing from Ubisoft had a great post on this: http://preshing.com/20120710/memory-barriers-are-like-source-control-operations/

    1. ia32-e (otherwise known as AMD64) now actually has operations for transaction-like behavior that I believe under the hood is just taking advantage of MOESI to delay writebacks.

      That would be a little similar to thinking of it like version control.