2014-05-31

Sound multiplexing in Linux

Linux's default sound system is pretty bare-bones. These days the go-to solution for sound drivers is Alsa. Alsa is just enough to get you the ability to get sound in and out of the machine, and adjust whatever the hardware supports. On thing it does not do by default is multiplexing. That is, only one application can make sound at a time.

This is a problem, and a problem many people have solved many ways. The result has been a plethora of different sound daemons and standards to go with them, jack, esd, pulseaudio, etc. The idea is you run a daemon that talks to the hardware (via the kernel) and everyone else sends it's sound stream through the daemon, which mixes it for you.

Now, when I set up my laptop with ubuntu-server, what I got by default was pulse-audio. My laptop has an interesting sound card feature, the sound-card can play out of both the headphone jack, and the speakers, at the same time. Often plugging in headphones cuts out the speaker, but not on this machine. Instead there are seperate volume controls for the two outputs.

That's all well and good, but pulse-audio is designed to need minimal configuration, and similar to so much software of it's ilk, that means it's not configurable. As a result it has different ideas about what should happen when I plug in my headphones than I do (in particular, it likes to crank the volume WAY up). It does mute the speakers, but I find the changing of my volume to ear-bleeding to me unacceptable. Since I couldn't change this, I removed pulse-audio.

I use my laptop to listen to music, as well as sometimes to receive phone-calls via gmail, skype, etc. These applications require sound, and without multiplexing I can't hear the phone ring if I'm listening to music. So, I needed a new multiplexer.

In comes dmix. dmix is alsa's own multiplexing solution. It's built right in, no need to use some other protocol, or some daemon that's too smart for it's own good. It's not shiny, it's not featureful, but it's simple and works. To make it the default you just edit the file "/etc/asound.conf". I'm not going to go into details on how, there's plenty of pages out there on how, but should you want this basic feature, without some heavyweight solution, give it a try.

Update

Someone requested some references, I hadn't bother as I had little trouble finding them but I may have gotten lucky. So here we go

2014-05-30

Complexity vs. Constant Factors

If you're formally trained, you've seen lots and lots of asymptotic analysis and the like in Computer Science.

In an earlier post about trees I noted one way you can be led astray by asymptotic analysis when I compared AVL trees and Red-Black trees. The question there is whether the part of the work people analyze is the part that makes up the most work.

There's another simpler way it often goes wrong though. Constant factors *matter*. A computer only has so much memory, and so much hardware. Every problem that can be solved on a given computer can be solved in 2**number_of_bits where number_of_bits is all storage on the machine. After that the machine must cycle to an already met state. That number is *extremely* large, (and in point of fact likely takes all the energy in the universe to cycle through) but it is finite.

It turns out that small local operations on arrays can still be VERY fast, particularly because the general locality helps with caching. A BTree is a way of taking advantage of these speedups, but within the small finite space of your node size.. that is, your arity. This makes BTrees MUCH faster than either AVL or RedBlack trees, despite the asymptotic analysis being identical to RedBlack trees.

I was curious about exactly where this tradeoff lies for BTrees. A btree stores many elements in each node, and usually keeps the elements sorted within that node in an array. This means inserts and removals from a node require shifting everything over in the node - a linear time operation in the size of the array, but B-tree nodes are of finite size, so still O(1). Anyway the number of elements in a BTree corrosponds to the number of children it has (-1 actually), so we can call it the "arity". So, I tried running my benchmark (same one used in the earlier tree experiments) at varying arities to see where our efficiency peaked. Note that these results are for 32 bit integers. The hardware and other dependencies are: Dell XPS 12 convertable laptop/tablet with: Intel(R) Core(TM) i7-4500U CPU @ 1.80GHz, ubuntu desktop, gcc 4-8-1-10ubuntu9. Linux kernel 3-11.0-19-generic.

It appears to peak around 70 elements. The peak is pretty wide though so it's hard to be sure given the noise in the data. This was done with a single trial on a laptop that wasn't necessarilly perfectly isolated. For details on the exact experiement, and ability to run it yourself see



This isn't a surprising result at all, it's about what I expected. My guess had been between 100 and 200, which isn't far off.

Now if you read my post about real-time http://www.blog.computersarehard.net/2014/05/worst-case-vs-average-case.html You probably just stopped to wonder about this algorithm.... Yup, you can bet that the worst-case goes up a bit as we add more elements even while the average might be dropping. So, it's not clear what the best ARITY to use would be. Given no information about use-case there's a pretty clear cliff where we get a huge win by increasing the arity. So, not knowing I'd be tempted to set it at maybe 55. This will get us basically all of the gains of increasing arity, but keep the worst-case and variance comparitively small.

Consider the minimum there, ~25 seconds or so. The best run I've seen out of AVL is ~49 seconds. That's nearly a 2x difference. Even at Arity 5 the btree destroys the AVL tree at only ~39 seconds.

Conclusion: Constant factors matter, and can matter a lot. What's neat is that we still keep our asymptotic bounds TOO, so on both big and small problems we can be fairly sure BTree will keep beating the AVL tree by the same factors. I think that's pretty neat.

2014-05-28

Worst case vs. Average case

For years and years everyone's been focused on the average case and amortized analysis. Quicksort is most programmers favorite sort. Splay trees are popular in some groups due to their amortized performance, etc.
I'd like to propose an alternative. I would like to propose that for most problems this view is incorrect, and that we should be focused on the worst case performance rather than the average. The reason for this is that ALL code is real-time.

The definition of realtime is:
A program is realtime if it must complete it's task within a specified real world time bound T.

 If I'm playing a videogame and 1 out of every 100 frames takes longer than a 100'th of a second to render I can visually see a flicker/stall in the video. As a result the videogame looks crappy and I hate playing it. This is a clear case where our software is realtime. If it takes longer than a certain amount of time, then it's not performing correctly.

Now lets say I go to visit a website, I sit there for 5 minutes waiting for it to load and it never does. So I give up on the website close the browser and never visit it again. Clearly this websites behavior was incorrect, and as a result it caused the website to lose a user. So, part of the correctness criteria for that website is a realtime bound. That websites server code is real-time!

Worst case vs. Average case

If I need my code to meet a real-time criteria, always, then I need to make sure that it's worst-case is less than that bound. Thus, we want to choose algorithms with low variability in time, and possibly a worst average case, in exchange for ensuring that the worst-case falls within our limitations.

This concept is especially true in distributed systems. In a normal system if our worst-case is highly unlikely it won't impact us often at all. On the other hand if rendering the website the user wanted requires hitting 100,000 servers, then it's pretty likely that that worst-case that only happens 0.0001% of the time will trigger on ONE of those servers. Examples of this include search-engines which fan out the request to servers each of which is responsible for part of the space of all data being searched.

Surprisingly this even comes up on back-end systems. Lets say I'm building an enormous dataset on a large cluster. I do this once a day.  If I'm doing a mapreduce and 1 out of every 1000 shards of my mapreduce is running really slowly, then my total mapreduce time is bound on those worst case shards, not the average. And I have a realtime bound of 1 day (probably less, since it'd be nice if I had a little leeway).

On top of that, while waiting for those couple of shards to finish their work we're probably waisting the rest of the cluster's computational ability - it's just idle. We could try and cram in other workloads, but this gets really complicated. It's much easier if you can flatten every workload as much as possible.

Lies and Deceit

I've just told a lie. In practice on really large systems it's often actually the 99.99% latency (or some number of 9's anyway), and you can may be able to just throw-out/time-out on the other cases. You probably can't do this at every layer though! So, as you distribute and add more layers to a system you still need to worry more and more about the wost-case performance.

Note for wankers, I'm using O not big theta because it's easier to type, and while the distinction would be formally useful here it's not widely used.

The point

When we start thinking of algorithms this way, the cost of mis-sizing a hashtable is an O(1) operation suddenly jumping to an O(n) operation so we can resize. Our quicksort looks pretty bad at O(n). Balanced trees, and the extra memory needed for mergesort or the extra moves for heapsort start to look a lot more tempting.

That is not to say average case, or even better amortized analysis, is not useful. If you do 100,000 operations on your dict, and all you care about is the time bound of that total set of operations, but this is only true if no-one else depends on the intermediate results.

So think about the real bounds of the code you are writing, and whenever someone says they want something to be "faster" stop for a second ask yourself if worst case or average is really what needs to improve.




AVL trees or RedBlack Trees?

Before launching into this. I'm talking about Sorted Binary Tree balancing algorithms. If you don't know what those are, you'll want to before bothering to read onward.

Background

For this post I want to focus on AVL vs RedBlack trees, but first a little background. In a paper called "An Algorithm for the Organization of Information" G. M. Adel'son-Vel'skii and E. M. Landis layed out the first datastructure algorithm invented. It's a great paper too, very approachable, give it a read. You may want to skim Wikipedia's explanation as well though. Papers back then were written a bit differently, but more on that later.

Anyway. BTrees were invented later. RedBlack trees were then invented as an isomorph to BTrees where data never had to be shifted, one node was one piece of data. This can prove useful, for example if you're using pointers to the data stored in the tree.

There are tons of other balancing algorithms, but RedBlack trees and AVL trees have the distinction of  O(log(n)) for all operations, without having to amortize the results. This means that EVERY operation is O(log(n)) it doesn't just average out to that. That property is great if you have realtime bounds for your code, rather than just wanting it to run "fast". They do this because they guarantee that they are approximately "balanced" meaning every subtree has about the same number of children (transitively) to it's left and to it's right.

AVL trees have a tigher bound on how balanced they are than do RedBlack trees. This makes them a touch faster for lookup, but it means there are more balancing operations on average as data goes in and out. In particular RedBlack trees are guaranteed O(1) rotations for all operations, where AVL trees are not.

The Experiment

For fun I wrote myself a small forest of trees. The experiments aren't complete, but I got some slightly surprising results. I wrote them up on Google+, but I wanted to put them somewhere more permanent, so here it is.

All of the code for what I'm talking about is here:
https://github.com/multilinear/datastructures_C--11
Check PERFORMANCE for details on the experiments.

What surprised me is that AVL outperformed the RedBlack tree, it's not by a lot, but it's by some. I ran several further experiments making sure that it wasn't the depth difference impacting lookups. I ran a ton of extra lookups to see what would happen, and the gap didn't widen. This implies the lookups are too small a factor in my workload to matter.

I then spent quite a bit of time trying to figure out what else it could be. I already have 2 redblack implementations, and I'm running against the faster of the two. Thus I'm fairly confident that the algorithm is doing the right thing. I also have extensive consistancy checking which is turned off for the benchmark, but which is used in the unittest, so I'm fairly sure the algorithm is correct.




After much testing my best theory is that since RedBlack trees were written and declared faster the world of hardware has changed. See, when inserting or removing from an AVL tree you can tell whether to rotate by looking only at the nodes in the path to the one you modified. In a RedBlack tree you need to look at some other neighbors. Additionally note that as machines have sped up, the gap between cache layers has widened at every layer. L1 is many times faster than L2 than it used to be, same L2 to L3 and L3 to main memory.

My theory is that pulling in the cache-lines to look at the neighbors is in fact so expensive that it overrides having to do more rotations. Thus while theoretically RedBlack trees only do O(1) rotations per insert, this doesn't matter because the main expense is actually *looking* at nodes.

If anyone else comes up with another theory I'd be very interested. That's what I've got for now.








First Post

I've been keeping a blog about my outdoor adventures for some time over at blog.smalladventures.net.

I worked at Google for over 5 years as an SRE. In that time most of the computer work I was doing was fairly not public. Also, working 5 days a week, and some weekends, I really didn't want to spend much time outside of that playing with computers, certainly not enough to sustain a job.

I quit last year though, and have been doing some interesting side projects. I actually just accepted a job offer with Meteor that I'll be starting soon, but in contrast that is open source, and I convinced them to let me work only 4 days a week. As a result, hopefully I'll want to do at least some computer work for fun, and I'll probably want to post more as I learn about open source types of projects and the like.

So, here's my blog.

I did a lot of robotics in highschool and college and went to Carnegie Mellon with the intention of doing AI. I then got into operating systems, where I TA'd a fairly intense operating systems class. I realized the languages I was working in were awful and thus got interested in type theory and programming language design, taking several classes down that track such as category theory and a class on building a compiler for an SML type language. Somewhere along the line I also got interested in algorithms and took a graduate course in complexity theory.

Then, as mentioned earlier, I was an SRE for 5 years. I also run Linux on all my own machines. My preferences and needs for personal machines are strongly driven by my desire to be outdoors.

Suffice to say my interests are rather eclectic, and so this blog is likely to follow the same pattern.