In a previous entry I promised to talk about the cases where it makes sense to use your own implementation of a sorting algorithm. I will, but first I want to illustrate how badly twisted it can get.
When I say “fail horribly” I don’t mean the case of “not making it work” but the case of “making it work… until it happens to fail some day”. The latter is much worse.
Ignoring the fine print is a mistake. Grabbing a book on algorithms and implementing your own version of algorithm X without reading the full chapter can be a huge mistake.
Let’s review, as a practical case, the mistakes commonly made while implementing quicksort:
- To leave the pivot inside one of the parts (possible infinite recursion).
- To use always the first element as pivot (or always the last).
- To believe that using as pivot the median of 3 or a random element avoids completely the worst case.
- To make two recursive calls instead of using tail recursion.
- To use tail recursion, but always on the same side.
- To believe that using tail recursion solves all problems. It reduces the memory required in the worst case, but not the time.
- To leave all elements equal to the pivot on one side. If you do this, quicksort will show its worst behaviour also in the case where all elements (or many) are equal.
I’ll explain the previous issues one by one:
1. To leave the pivot inside one of the parts
While partitioning, it’s good to leave the pivot in its final place, and not inside one of the parts to be sorted. This might look like a small, unimportant optimization, but it isn’t.
If we leave the pivot inside one of the parts to be sorted, and this part fatally happens to be the whole array to be sorted (being the other part empty), then we provoke an infinite recursion.
Several details determine the impact of this issue, like the choice of the pivot or whether the elements equal to the pivot are distributed or left on one side. Though, the only way to completely solve this problem is to exclude the pivot from both parts, leaving it placed in its final position. This way we guarantee that every recursive call reduces the problem size by at least one element.
2. To use always the first element as pivot (or always the last)
This would be less important if all possible permutations happened with the same probability. But reality is different. The special cases of “already sorted data” (either in straight order or reverse) or “almost sorted data” are quite frequent in comparison with other permutations. And in these cases, the first and last elements are the worst candidates we can choose as pivot.
If we use always the first element (or always the last) as pivot, quicksort degenerates into its worst behaviour with already sorted data, no matter whether they are in ascending or descending order.
This problem is usually attacked in two different ways:
- To use as pivot the median of three elements: the fist, the last and the middle one
- To use as pivot the element of a position chosen at random
Both methods reduce the probability of the worst case, but they can’t eliminate it completely.
3. To believe that using as pivot the median of 3 or a random element avoids completely the worst case
The worst case can still happen. It might be very unlikely, but it’s still possible. This is the kind of problem that goes through testing undetected and shows up later, when the consequences are serious.
It’s not only a question of probability. An attacker might feed our system with data arranged exactly in the way that provokes quicksort’s worst behaviour.
We must take into account that the worst case can happen. It’s quite convenient that tests exercise it in order to evaluate the consequences regarding the required time and additional memory.
4. To make two recursive calls instead of using tail recursion
Tail recursion is implemented replacing one of the recursive calls with a loop. This way we slightly reduce the execution time.
It looks like a simple optimization, but in reality this is necessary to reduce the additional memory to O(log N), even in the worst case. Though, it is not enough to simply use tail recursion in any way. I’ll explain why.
5. To use tail recursion, but always on the same side
If tail recursion is not used, or if it is used always on the same side, the required additional space in the worst case will be in the order of O(N), that is, proportional to the number of elements to be sorted. This can provoke a stack overflow.
The reason is that, in the worst case, up to O(N) recursive calls will be nested, and every one of them takes some memory.
The only way to solve this is to use tail recursion in the recursive call that corresponds to the largest part. This way we ensure that only O(log N) recursive calls will be nested.
6. To believe that using tail recursion solves all problems
Tail recursion, used correctly, solves the space problem. But there are still other problems. The time of the worst case is still O(N2) . As I’ve already pointed out, the choice of the pivot can help in making this case very unlikely in practice. If “very unlikely” is not enough, then we must resort to some other algorithm that takes O(N log N) time also in the worst case, even though the average case takes several times quicksort’s average time.
But this is not the end. There’s still a subtle detail that can turn this “very unlikely” in “embarrassingly common”:
7. To leave all elements equal to the pivot on one side
While partitioning, what should we do with the elements that happen to be equal to the pivot? At first sight this looks like an unimportant detail. We can leave them in any of the two sides. The algorithm will sort properly. In addition, there won’t be too many elements equal to the pivot, will they? Oh! Wait a moment… Maybe there are lots of them!
When we talk about sorting algorithms we tend to think of arrays full of different values. But in real life if happens quite often that many elements have the same value. Suppose that we need to sort the data of 47 million citizens according to their marital status (single, married…). After a couple of recursion levels, chances are that all, or almost all the elements of the fragment to be sorted are equal regarding marital status. What will happen then.
The probability of choosing a pivot with that repeated value is very high. Therefore, many elements will be equal to the pivot. If we put them all at the same side we will have an inefficient partition. The worst partition indeed! Quicksort will exhibit its worst behaviour in a case that, in principle, seemed absurdly easy. This is most certainly the biggest problem because algorithms textbooks don’t emphasize it enough, if they ever mention it.
The solution is to distribute the elements equal to the pivot among both parts, even though it might seem at first sight that we are moving data around unnecessarily. You can find an implementation here.
However, I insist: the best option is almost always to use the sorting method provided by the programming language in its functions library ;-)