libanimation for everyone

This blog post got reposted on reddit with the title “Compiz effects are coming back to GNOME-Shell”. That is not what is happening here at all. Don’t expect any of this stuff to make it into any upcoming Shell releases, because it hasn’t been discussed with anyone in the Shell team.

Instead, what I was trying to say was that I was looking into whether it was possible to do so and created libanimation in the process. Libanimation is a math library. Someone still has to write a binding on the compositor side to consume its outputs and make it work. I can say that at least for GNOME-Shell it is certainly possible to do so, but someone else would need to write it and upstream it.

End big important edit: original blog post upcoming.

Something I worked on when I first started at Endless was the rather interesting task of “making Wobbly Windows from Compiz work on GNOME-Shell”.

This task is interesting in the sense that GNOME-Shell doesn’t really work in the way that the wobbly plugin for Compiz wanted it to. There is the “Wobbly Windows” extension for the shell, but that was sort of out for Endless, since we didn’t want to rely on the extensions mechanism and that extension also didn’t quite work the way I had come to know and love.

What I really wanted to do with this project was replicate the same smooth wobbly windows from Compiz in the shell itself.

Where do you even start on such a project? The first hard part is trying to get the math to work and knowing that you have the math working. Thus, libwobbly was born – a library that reimplements the same math that Compiz itself used, allowing callers to create a 3×3 mesh of springs representing the physical effect of “wobbly” windows and providing functions to interpolate into that 3×3 mesh. Then, we used libwobbly in our fork of GNOME-Shell along with ClutterDeformEffect and a bridge to get the same, buttery smooth wobbly windows in GNOME-Shell.

However, this approach had its shortcomings. Up until very recently it only exported a C++ interface, because I very foolishly was obstinate about keeping things in C++ at the time. This made integration into the shell a pain for a variety of reasons, among them frequent ABI breaks due to compiler or boost changes, needing to have a whole separate GObject-Introspection typelib just to handle the C++ stuff, separate build rules, etc etc. Of course, this state of affairs wasn’t well appreciated by others who work on our fork of the shell, with the running joke that whenever the shell breaks, it was always libwobbly’s fault.

For a few reasons I won’t get into, window animations became a priority at Endless again and we wanted to see if we could do a better take on libwobbly. I didn’t want to throw the whole thing out, since it is quite well tested and a lot of thought went into its design.

The result is an improved library, libanimation designed to be used from C++ programs with a C++ interface as well as C and language bindings with a GObject interface wrapping the C++ interface. In future, we should also be able to create emscripten bindings to the C++ interface, allowing it to be used directly in web programs too.

The name change represents a broadening of the project scope. For a few reasons, we wanted to preserve not only wobbly windows, but also the other (decent) effects from Compiz too, among them zoom, bounce, glide and magic lamp. The library does not insist on taking over any of the rendering or scene graph management. Instead, libanimation is designed to provide “the math” and can be seen as a black box, where you feed it a requested animation configuration, a step function (linear, reverse, quadratic, sigmoid) and can “get out” a series of affine transformations or deformed co-ordinates given a pair of unit co-ordinates.

Over time, more animations will be added. I hope that the library will be useful to authors of other compositors or applications and help to preserve some of the more magical parts of Compiz as technology itself matches forward.

Serverless React Boilerplate

I’ve been keeping my eye on serverless as a way to do hosting and deployments for my next react project when I get around to it and this week I finally got a chance to look into it.

React Serverless.png

The tl;dr is that I’ve forked react-boilerplate as react-boilerplate-serverless if you just want to get started with serverless deployments. You should be able to immediately use npm run serverless-deploy.

For the uninitiated, the main difference brought with so-called “serverless” applications is that they don’t run on a dedicated virtual machine or container that is running 24/7. Instead, your code is abstracted behind infinitely scaleable architecture and you only pay for the execution time of your code as opposed to the server itself. This is the sort of automatic scaleability that people like to talk about with cloud computing.

There’s a few important distinctions with serverless computing that comes with its architecture. First, we call serverless deployments “functions” because they really are just that – the infrastructure hosting the serverless service takes care of all the network IO and just calls your “function” when a particular endpoint is accessed. In that sense, you lose control over how network IO is handled and the process which is handling that network IO. The “function” you are deploying in this case is really a loadable library which can be a collection of scripts and data but with no associated running “process”. Instead the library is loaded at runtime, which means that the serverless computing provider can bill you only for the executing time your library actually uses in handling requests.

In practice, this distinction tends not to matter so much. It turns out to be relatively trivial (with some caveats) to turn any express-like application into a serverless application thanks to tools like serverless-http. If you wrap your express application with the serverless wrapper from that package, infrastructure like AWS lambda will just call your route handlers directly after being invoked from a frontend server like AWS API Gateway. To your application, this is essentially no different than the route handlers being invoked asynchronously by libuv and node’s nextTick function.

There’s some other practical annoyances to be aware of as well, though they can all be mitigated.

Support for compression and binary data is still sketchy at best

Unfortunately, AWS API Gateway handles binary content in a particularly awkward way, expecting it to be serialised to base64 by the request handler (e.g., your serverless function) before it goes back to the API Gateway (which should, in theory, unserialise it back into binary). For a long time, API Gateway didn’t even support binary responses and only gained support for it last year making it possible to actually host a server that serves image data directly with Lambda. But you still need to manually specify the MIME types that should be treated as binary on the API Gateway configuration. I’ve found that this works well enough for things like images but it completely broke down when it came to compressed text.

Most node servers, including the server set up in react-boilerplate use compression-middleware which allows for the client and server to agree for responses to be gzipped. Unfortunately, API Gateway chokes on the binary data if you give it the compressed response directly and specifying that the binary data be encoded in base64 causes the client to choke. I haven’t found any way to get around this problem other than just disabling compression. Maybe the story will improve in time to come and we won’t have to worry about messing about with settings to support binary content.

Execution time

Since popular servers like node are single-threaded and event based you probably shouldn’t be performing any too-long-running or expensive tasks from within your server anyway. But you really need to watch out with in-function asynchronous tasks that could take longer than 10 minutes, since the execution time is limited to 300 seconds. This could be a problem if you need to upload large files with the help of the function. One workaround for that is to just have the client upload the files directly to an S3 bucket, though that might require a little more stuffing about than you would like.

Everything is stateless

Again, if you designed your web server in line with RESTful principles this shouldn’t be a problem in theory, though it could be a problem if you have logic that runs on first start.

You don’t own the infrastructure

If you’re used to Platform-as-a-Service this isn’t a huge deal, but its worth keeping in mind that you don’t control the runtime that your application is “running” on. In fact, you don’t even control whether your application is running at a given moment, other than when it is handling requests.

 

Check it out

So far I’ve had enough luck to be able to get a render of a page including images and associated scripts and styles, which proves that this definitely can be done. Hopefully this will become a more affordable and simpler way to deploy simple servers in the future, without having to worry about provisioning entire virtual machines or containers running 24/7 for the sake of hosting.

Exploring faster blurs

This week I a saw blog post on Hacker News entitled “Fastest Gaussian Blur” (by Ivan Kutskir). That caught my interest, because one thing that always irked me back when I worked on Unity was the fact that our UI blurs felt quite sluggish and there didn’t seem to be much we could do to speed them up given the size of the blur radii that we were using. I had a Sunday afternoon to kill and recently broke my ankle again (no gym or rock climbing), so I figured why not try implementing some of them.

Gaussian Blur

The post goes over four different algorithms for implementing gaussian blur, the first one being the standard approach of just doing an R-squared pass over every pixel and calculating its normalised gaussian distribution weights inline. As you can imagine, this is terribly slow, because you need to sample R-squared pixels for the entire image (and needlessly recompute the gaussian weights!). As is typically the case, its also the most trivial to implement in a fragment shader – you just take the input image and do two nested loops for the radius, sampling each neighbouring pixel and applying its calculated gaussian weight.

As a simple blur shader, its pretty satisfying to implement, since it is probably the most accurate in terms of what we consider a “blur” to be. But alas, the performance sucks. So can we do better?

Box Blur

Well, the article says yes, we can. Algorithms 2 and 3 in the article go over a technique called “box blur”, which is basically a way to approximate gaussian blur without having to compute weights. The trick here is to do several passes over the image each with different radii that you work out using some calculus. The article goes into more detail about why the math behind this works. As it says, the nice thing about box blurs is that you don’t need to worry about computing weights – the application of different blur radii tends to converge on the same result.

The only difference Algorithms 2 and 3 is that Algorithm 3 explicitly separates out the horizontal and vertical passes (thus, you have an O(2(nr)) operation as opposed to an O(n*r^2) operation. Most blurring approaches I’ve seen tend to do this because the complexity saving alone outweighs the cost of doing multiple framebuffer passes. In practice, it does turn out to be more performant, which is nice.

Precomputed Weights with a Data Texture

I was tempted to go back and try implement this in GNOME-Shell, though I saw that someone had already beaten me to it. Initially I was a little defeated, but then I became curious looking at the approach. It seems as though this extension uses a bunch of hardcoded weights separated into “bands”, though it initialises those weights on each pixel. I thought that was a little odd, so I dug a bit deeper into what it was doing by reimplementing it.

What was neat about this approach is that it is essentially a 2-pass gaussian blur, O(2(n * r)), except without the cost of having to recompute all those weights all the time. That’s handy, though since the weights are precomputed, its easy to run into a situation where you “run out” of weights if you want a bigger blur kernel.

Luckily, the author included a tool to compute weights up to a given blur radius. What was interesting is that the weights are put into “bands” for each stepping of 5 units on the radius. I presume that this was done to save on space inside the fragment shader since they are all hardcoded.

Now, the size of the computed weights array changes as you increase the radius, since that gaussian distribution function kind of “flattens out” and more area along the x-axis becomes significant. The original author dealt with this problem by hardcoding a size-31 array that only fills up once we get to the last “band” and assigning the each hardcoded weights array to it depending on which “band” your radius was in. Unfortunately, the notation the author uses for directly assigning the array isn’t implemented in OpenGL ES 2 (which is what WebGL uses). First I tried just brute-forcing the problem and using a regex to generate code that assigned the weights directly, but I realised that there is a much nicer alternative.

You can’t pass variable length arrays to shaders through uniforms in GLSL – I believe that doing so would probably violate the linker’s expectations as to memory layout etc. But you can pass handles to textures to sample instead!

Textures used to be this special datatype in OpenGL, but thanks to OES_texture_float and hacks where we treat textures as “memory”, they’re now useful as general purpose variable-length storage. So much so its how GPGPU in WebGL works  – the threejs people even have an example on how to use it. I once did a project where I used the depth buffer value to dynamically scale a blur radius – maybe that can be the subject of a future blog post.

So of course, what you can do to pass these gaussian weights down to the shader without having to hardcode or precompute them is to create a texture using GL_FLOAT as the storage type and set the red channel to the weight value. Then just sample the texture using texture2D inside the fragment shader to get the weight value for a given radius offset.

In my testing, such an implementation seemed to have pretty good performance even for very large blur kernels, however, it appears as though the sampling mechanism diverges from what I’m used to with gaussian blur  and we end up giving more weight to things outside the focal point of the pixel very quickly. Perhaps something to be looked into for next time.

 

Explaining Boyer-Moore

I had to implement the Boyer-Moore string search algorithm from scratch for an advanced algorithms class in 2015. It remember it took ages to get my head around the various concepts and I was so happy when I finally got it. So it was unsurprisingly disappointing that fast-forward to today, I’ve forgotten how it works.

Luckily, I’m on a bit of a mission to learn Rust and algorithms are a pretty good testing bed. So this article is partly for my benefit to summarise my own findings and partly for the benefit of the reader who might be scratching their head about how it works.

Naive Substring Search

To understand why such a big deal is made about Boyer-Moore, you need to appreciate why “naive” substring search scales so poorly.  “naive” substring search is basically exactly how you’d expect to implement a substring search:

char needle[] = "some word";
char haystack[] = "Why oh why must I always find the word, some word, any word";

size_t needle_len = strlen(needle);
size_t haystack_len = strlen(haystack);

for (size_t i = 0; i < (haystack_len - needle_len); ++i) {
    bool match = true;
    for (size_t j = 0; j < needle_len; ++j) {
        if (needle[i] != haystack[i + j]) {
            match = false;
            break;
        }
    }

    if (match) {
        return i;
    }
}

return -1;

This implementation of substring search is dutiful – it will try and match, from front to back, the needle at every position in the haystack. The problem is exactly that, it will try and match the needle at every single position in the haystack. In the worst case, it needs to check the length of the needle against the length of the haystack, haystack times. We say that this is O(nm).

If you’re just matching one or two characters against a very long body of text, its probably not too bad. But if you’re thinking of checking if a paragraph is contained a book, you’re gonna have a bad time.

If you visualize what is going on in the naive search, you’ll probably notice all the unnecessary work:

Why oh why must I always find the word, some word, any word
some word
 some word
  some word
   some word
    some word
     some word
      some word

The astute reader might see this and realize that in every case above, we’re just matching the letter “s” over and over again against cases that are futile anyway, because for the entire string to match, the last letter must also necessarily match and … it doesn’t. So lets handle that case now:

char needle[] = "some word";
char haystack[] = "Why oh why must I always find the word, some word, any word";

size_t needle_len = strlen(needle);
size_t haystack_len = strlen(haystack);

for (size_t i = 0; i < (haystack_len - needle_len); ++i) {
    bool match = true;
    for (size_t j = 0; j < needle_len; ++j) {
        /* Compare back to front */
        if (needle[(needle_len - 1) - j] != haystack[i + ((needle_len - 1) - j)]) {
            match = false;
            break;
        }
    }

    if (match) {
        return i;
    }
}

return -1;

Main difference here is that we compare back to front – so at least we’re catching all those futile mismatches early. While this is a little better, it is still technically O(nm) – the same problem could just happen in reverse, all the characters at the back could match, except for the first one. Now you’re back to square one.

Skipping Ahead

The impatient reader might wonder why we’re bothering to re-examine all those sub-patterns within the pattern we just failed. That alignment didn’t match, so why not check the next one instead? That would make the whole thing O(n), right?

Why oh why must I always find the word, some word, any word
some word
         some word
                  some word
                           some word
                                    some word
                                             some word
                                                  some word

Damn! We missed the match! The problem with this approach is that in skipping words you might partly overlap the match, then partly overlap another part of the match, but never actually see whole match. So this substring search approach is actually incorrect. What we’ll need to do “intelligently” figure out how many characters to skip ahead so that we’ll eventually align with the match in the haystack. And that is exactly what Boyer-Moore is about!

The Boyer-Moore algorithm defines two lookup tables, the bad character table and the good suffix table. The tl;dr of both of these tables is as follows:

  1. The bad character table tells you how many characters to shift the pattern forward if you want to align the character in the haystack string where the mismatch occurred with the closest character in the needle string such that the two will match. The theory goes that if there’s at least a character matching in the middle of the pattern, there’s some likelihood that the new alignment itself is a match.
  2. The good suffix table tells you how many characters to shift the pattern forward such that the substring into the pattern you just matched (eg, all the characters that did match up until the mismatch) occurs again. Generally speaking the good suffix table doesn’t come into play very much, but it is relevant in cases where you have repeating patterns within the needle string and you need to check the repeated pattern against the input string. Usually this comes up in the context of trying to match DNA strings, where you only have an alphabet of GATC.

That probably didn’t mean very much without examples, so lets start with the bad character rule. Look again at what we were doing when we tried to just blindly skip ahead by the text length:

Why oh why must I always find the word, some word, any word
some word
         some word
                  some word
                           some word
                                    some word
                                             some word
                                                  some word

If you’re observant, you’ll see that we got close, but we just missed out, specifically, here:

ord, some 
some word

We mismatched on the last character (remember, starting from the back!), d != e. But we know that the character ‘e’ occurs in our pattern! If we just shift the pattern forward such that the two ‘e’ characters align….

some word 
some word

Ah-hah! We have our match! So by using the bad character rule we were able to keep skipping words(*) all the way until we got close to the match, and then we just aligned with the match! That’s very very close to linear performance for something that was previously quadratic!

So where does the good suffix rule come in to it? Well, to illustrate this, I’ll have to use another example:

ababa abcbb abcab
abcab

Yes, this is slightly more contrived, but it demonstrates that the bad character rule on its own is not sufficient to ensure correct matching:

ababa abcbb abcab abccb
abcab
 abcab
      abcab
        abcab
         abcab
              abcab
                  abcab

Uh-oh, we missed it! So why did that happen? Well, we got close, but then this happened:

ababa abcbb abcab abccb
         abcab

“ab” matched “ab” at the end of the needle, which was a good start. But then “c” didn’t match “a”. Since ” ” does not occurr within our string, we skip forward by five characters (eg, the length of the needle) and completely miss the fact that we could have aligned “ab” at the start of the string with the “ab” that we just matched! The good suffix rule accounts for this. We kept a record of how far back in the string the same subpattern “ab” occurs again. In this case, it is three characters away, because we matched two characters, we shift forward by three:

ababa abcbb abcab abccb
            abcab

And we’re done!

Computing the tables

By far the most annoying part of implementing the algorithm the algorithm is computing the tables for the bad character rule and good suffix rule. Most of the content I’ve seen around this has literally been as useful as code snippits with cryptic variable names or just snapshots of the already-computed tables. I’m going to walk you through how the tables actually work, step by step.

Starting with the bad character rule table, recall what the bad character rule actually did:

Why oh why must I always find the word, some word, any word
some word
         some word
                  some word
                           some word        5 <- (shift for e from d)
                                    some word
                                         some word (shift by only 5)

We saw that we mismatched on ‘e’ at the last character in the needle. We know from there that the character ‘e’ is five characters before that last character, so we shift forward only five characters the next time. So it stands to reason that we’ll need to compute a 2D table – for every character in the alphabet and every position in the needle, how far back do we need to look in order to the character in the alphabet in the needle?Again, it isn’t the character in the needle at the given index that we’re interested in, its the character in the alphabet that we’re looking at (and we do it for each character in the alphabet). So yes, this is quite a large table:

    a b c d e f g h i j k l m n o p q r s t u v ... w
0 s
1 o                                     1
2 m                             1       2
3 e         1               1   2       3
4           2               2   3       4
5 w         3               3   4       5
6 o         4               4   5       6           1
7 r         5               5   6       7           2
8 d         6               6   7     1 8           3

Notice a convenient pattern? All we’re doing is figuring out how many characters backwards into the needle we need to go in order to reach that character in the alphabet. If we can’t find it, then no worries, we just don’t put an entry  there (or typically, we’d put some sentinel value like 0 or -1). In rust, you might have code that looks like this:

fn find_pending_character_index(chars: &Vec<char>, start: usize, needle: &char) -> usize {
    let len = chars.len();

    for i in (start +1)..len {
        if chars[i] == *needle {
            return i - start;
        }
    }

    return 0;
}
fn bad_character_table(pattern: &str) -> Vec<Vec<usize>> {
    /* Go backwards through the string for each character and work out how many characters
     * away that character is (backwards) */
    let mut table =vec![vec![(0asusize); pattern.len()]; ASCII_LOWER.len()];

    /* String gets reversed here */
    let chars: Vec<char>= pattern.chars().rev().collect();
    let len = chars.len();

    /* Enumerate through the characters, back to front (idx here starts
     * at zero, but think of it as (len - 1) - idx */
    for (idx, _) in chars.iter().enumerate() {
        for (alphabet_index, c) in ASCII_LOWER.iter().enumerate() {
            /* How many characters from this one until we reach the end
             * or see this character again */
            let res = find_pending_character_index(&chars, idx, c);
            /* Since idx was zero-index, we invert it here, insert
             * number of steps backwards we have to go to find this
             * character in the alphabet */
            table[alphabet_index][(len - 1) - idx] = res;
        }
    }

    return table;
}

Okay, now for the good suffix table. Thankfully, it is much smaller than the bad character table, but it is a little more complicated to generate. Recall what the good suffix rule was doing:

ababa abcbb abcab abccb
abcab
 abcab
      abcab
        abcab
         abcab <- (shift forward by 3 to align subpattern)
            abcab (shifted only by 3, not by 5)

So essentially we need to keep track of, for each suffix how many characters we can look backwards in the needle to find the same suffix again. Simple enough to say it, but how do we actually implement something like that?

Well, the best way to explain it is that for each suffix, we have a sliding window starting from one character before which is the same size of the suffix. We scan both the substring pointed to by the window and the suffix to see if there is a match, and if so, we take note of how many characters away our window was:

abcab
    b (suffix)
   _ (window)
  _
 b (suffix appeared again, 4 characters away)

Same thing with the “ab” suffix…

abcab
   ab (suffix)
  __
 __
ab (suffix appeared again, 3 characters away)

So our overall good suffix table will look like this:

0 1 2 3 4
a b c a b
---------
      3 4

What does that look like in implementation? Well, something like this:

fn good_suffix_table(pattern: &str) -> Vec<usize> {
    /* For each character in the string, take the subpattern from char[i]..char[len]
     * and see if such a subpattern exists elsewhere in the string, moving the window
     * back one by one. This is an O(N^2) operation over the pattern, since you have
     * to check the subpattern over each window from the current position. */
    let chars: Vec<char>= pattern.chars().rev().collect();
    let mut table =vec![(0asusize); chars.len()];

    for idx in 0..chars.len() {
        /* Add one here since the outer loop was exclusive, we do not want the inner
         * loop to be exclusive of the last index */
       for candidate in 1..((chars.len() +1) - idx) {
           /* Checking two empty subpatterns against each other doesn't make
            * much sense - we want a full shift in that case */
           if idx > 0 && matching_subpatterns(&chars, 0, candidate, idx) {
               table[(chars.len() - 1) - idx] = candidate;
               break;
           }
       }
   }

    return table;
}

Putting it all together

So now that we went to all that effort to construct our good suffix and bad character tables, how do we make use of them in the overall algorithm? Well, it is remarkably simple:

fn boyer_moore(input: &str,
               pattern: &str,
               bad_character: &Vec<Vec<usize>>,
               good_suffix: &Vec<usize>) -> Option<usize> {
    /* Some obvious cases */
    let pattern_chars: Vec<char>= pattern.chars().collect();
    let input_chars: Vec<char>= input.chars().collect();

    if pattern_chars.len() > input_chars.len() {
        return None;
    }

    /* Align at the first character */
    let mut alignment = 0;
    let max_alignment = input.len() - pattern.len();

    loop {
        let mut mismatch =false;

        for (pattern_index, i) in (alignment..(alignment + pattern_chars.len())).enumerate().rev() {
            if input_chars[i] != pattern_chars[pattern_index] {
               mismatch = true;

               /* If we're on the last alignment, return now,
                * as there are no matches */
               if alignment == (input.len() - pattern.len()) {
                  return None;
               }

               /* Mismatch occurred here, look up this character/index pair
                * in the good suffix and bad character tables to see how far
                * forward to shift. If the shift amount is zero, shift
                * forward by the entire pattern size */
               let charidx = (input_chars[i] as usize) - ('a' as usize);
               /* Shift amount is the higher of either the bad character
                * rule or the good suffix rule */
               let shift = std::cmp::max(bad_character[charidx][pattern_index],
                                         good_suffix[pattern_index]);

               if shift == 0 {
                   /* No shift found, shift as far as we can, either to
                    * the end or by the length of the needle */
                   alignment = std::cmp::min(max_alignment,
                                             alignment + pattern_chars.len());
               } else {
                   alignment = std::cmp::min(max_alignment, alignment + shift);
               }

               break;
            }
        }
    }

    /* No mismatches, return the current alignment */

    if !mismatch {
        return Some(alignment);
    }
}

Essentially what we’re doing is starting at the beginning and matching from back to front. If we find a mismatch, we record which character we found a mismatch on and look up both the input character and the mismatch character index in the pattern in the good suffix and bad character tables. If those tell us how far to shift, we shift by that amount. Otherwise, we shift forward by the full pattern length.

And that, is how you get super fast linear time string matching!

A better way to think about Quicksort

Recently I’ve been learning haskell, a purely functional programming language. I still don’t really understand why I’m doing that; I think functional style is great and it makes a lot of sense to divide programs up into mostly atomic units, composable at the domain level. In a functional style, you generally try to put the majority of your code into imdepotent functions whose outputs are strictly defined by their inputs. The function itself is kind of a black box and checks the “functional style” checkbox as long as there’s no observable side effects. Imperative shell, functional core. I find that programs that use a functional style is a lot easier to reason about, to test and to re-use. Purely functional programming on the other hand always seemed like overkill to me, sacrificing comprehensibility for the sake of functional programming principles.

So what does this have to do with Quicksort? Well, it turns out that Chapter 5 of Learn You a Haskell has a very motivating example:

quicksort :: (Ord a) => [a] -> [a]  
quicksort [] = []  
quicksort (x:xs) =  
    let smallerSorted = quicksort [a | a <- xs, a <= x] 
        biggerSorted = quicksort [a | a <- xs, a > x] 
    in  smallerSorted ++ [x] ++ biggerSorted

Now I appreciate that haskell’s syntax is a little obtuse, so lets rephrase that in another more widely known language:

def quicksort(elements):
    if elements:
        pivot = elements[0]
        smaller = quicksort([e for e in elements[1:] if e =< pivot])
        larger = quicksort([e for e in elements[1:] if e > pivot])
        return smaller + [pivot] + larger

    return []

When I first saw this example it was like a lightbulb went off in my head. Quicksort finally made sense. It wasn’t this weird looking algorithm that just happened to work and have decent time complexity for some reason. See, if you ever took a Data Structures and Algorithms course at university, you were probably taught (in-place) Quicksort like this:

quicksort(elements, left, right) {
    if (left == right) {
        return
    }

    pivot = partition(elements, left, right)
    quicksort(elements, left, pivot)
    quicksort(elements, pivot, right)
}

Which probably lead to the first “WTF” – in my case – “WTF does a partition and a pivot have to do with sorting?”

partition(elements, left, right) {
    cmp = elements[left];
    pivot = left + 1;

    for (i = (left + 1); i < right; ++i) {
        if (elements[i] < cmp) {
            tmp = elements[pivot];
            elements[pivot] = elements[i];
            elements[i] = elements[pivot];

            pivot++;
        }
    }

    tmp = elements[pivot];
    elements[pivot] = cmp;
    elements[left] = tmp;

    return pivot - 1;
}

At which point 95% of people in the class would think be thinking “WTF????????”

And then the prof will usually open their mouth and say something like “so first you pick a pivot” and then everyone in the room will immediately switch off and just pray that they can rote-learn their way through the upcoming test and forget about it.

Really, the functional way of looking at it makes far more sense.

But hang on, you say, “my textbook said that Quicksort was O(n log n) in the best case and O(n^2) in the worst case – why would you want to use it over merge sort when merge sort is guaranteed to be O(n log n) and both methods require concatenating lists together?”

That’s a good point. The purely functional style doesn’t allow for inner mutability of the array, so you can’t exactly sort in place. But it is a good way to understand what on earth is going on with the in-place sort. But to understand that, first you need to understand why the recurrent version of quicksort works and how it is related to merge sort.

If you’ve seen the recurrent version of merge sort, it looks sort of similar to the recurrent version of quicksort:

def mergesort(elements):
    if len(elements) <= 1:
        return elements;

    midway = floor(len(elements) / 2)
    left = mergesort(elements[0:midway])
    right = mergesort(elements[midway:1])

    merged = []

    leftIdx = 0
    rightIdx = 0
    while leftIdx < len(left) or rightIdx < len(right):
        if leftIdx < len(left) and left[leftIdx] < right[rightIdx]:
            merged.append(left[leftIdx])
            leftIdx += 1
        else if rightIdx < len(right):
            merged.append(right[rightIdx])
            rightIdx += 1

    return merged

An important property of comparison sorts is that lower time complexity bound is n log n. That is to say that in order to provide some sort of operation which gets the array into a “more sorted” state than it was before, you need to evaluate all the elements and at best you’ll probably have to do that operation about log n times. So what is the incremental thing that merge-sort is doing here? Well, it looks at two chunks of an array and once it is done with its O(n) operation, it guarantees the inner ordering of that chunk, so if you were looking at [1, 5] and [4, 7], the result would be [1,4,5,7]. But it doesn’t say anything about the ordering of the resultant chunk in relation to other chunks in the array, nor does it guarantee that that chunk contains all the elements in within the range 1-7 that also appear in the array. All that said, the more you keep comparing two separate chunks and guaranteeing their inner ordering, the more “sorted” the array becomes. Because we kept halving the array, it stands to reason that if you do this about log n times, you’ll eventually end up with a sorted array.

If you compare the two recurrent equations, you’ll realise that quicksort is the same thing but in reverse. Quicksort guarantees that the outer ordering between two chunks is maintained – each number in the right hand side will always be greater than each number in the left hand side. But it doesn’t say anything about the inner ordering of those two chunks. All that said, if you keep going and doing the same thing on each of the two chunks, you’ll eventually get to the point where there is no outer ordering to be maintained and thus each chunk has an inner ordering. Again, in the best case, you’ll probably be halving the array, so you can recursively maintain outer-ordering between two chunks in log n time (although in the worst case, you’ll be splitting the array into two parts – 1 and n – 1 elements, at which point you have to maintain outer-ordering n times).

Okay, so what does this have to do with the overly-complicated looking in-place partition function? Well, imagine there was some atomic function that could take an array and make it outer ordered, such that all the elements in one part were always greater than all the elements in some other part just by looking at each element in the list once. That’s what partition does. All the elements less than a certain value get swapped over to the left hand side and all the elements greater than it get swapped over to the right hand side. Then you just put the centrepiece back in place and you’re done.

I still think the functional style is a lot easier to understand though. Maybe purely functional programming isn’t so bad after all.