Select the k-th smallest item from an unordered array in O(n) in python

This blog contains two python implementations of the SELECT algorithm to select the k-th ordered item from an unsorted array of N items in O(N) time. They derive from the discussion of RANDOM-SELECT and SELECT in the CLRS “Introduction to Algorithms, 3rd Edition” on pages 215-222 and from Professor David Eppstein’s lecture notes entitled “ICS 161: Design and Analysis of Algorithms” from January 30th, 1996.

Contents

Overview

Both implementations use a divide and conquer approach to recursively partition the input array into smaller and smaller chunks around a pivot point where each chunk contains data that honors an ordering relation to the pivot value until the k-th item matches the pivot. There are a lot of great discussions about how this works on the web and in text books (like CLRS). My goal here is not to try to improve on those discussions (I can’t). Instead, it is to simply show two working implementations.

The first algorithm (select_random_pivot), uses a random number to select the pivot point. The second algorithm (select_median_of_medians_pivot) uses a more deterministic approach to find a better approximation of the median.

The select_random_pivot algorithm has an average run time of O(N) but a worst case run time of O(N^2) in vanishingly rare cases. The select_median_of_medians_pivot algorithm has an average and worst case run time of O(N).

These implementations were written to help me understand how the algorithms work. They were not designed for production use. For example, they do not do any bounds checking and they create sub arrays during recursion instead of operating on a single array in place which would probably be substantially faster but manipulating indices would, in my opinion, detract from the readability.

For a production environment you would want to create a several implementations, test them against characteristic datasets and choose the one that performs the best. When I tested these two implementations on a series of datasets, the select_random_pivot function won because it was at least 2X faster in every case.

To get a feel for how these algorithms might be used, I have included an example that shows how to get the top n ordered items from an unordered array in O(N) time using the get_top_n_items function.

I hope that you find this helpful.

Implementation of select_random_pivot

The first implementation randomly chooses the pivot point. It is an implementation of the random select algorithm as described in CLRS “Introduction to Algorithms, 3rd Edition” on page 216 modified to partition the items in separate arrays.

The average run time is O(N) and the worst case run time isO(N^2) where N is the number of entries in the array but the probability of the worst case occurring in practice is vanishingly small.

Implementation of select_median_of_medians_pivot

This implementation chooses a pivot point deterministically to reduce the worst case run time to O(N). It is an implementation of the Blum, Floyd, Pratt, Rivest, Tarjan SELECT algorithm as described by David Eppstein. This algorithm is also discussed in CLRS “Introduction to Algorithms, 3rd Edition” on page 220.

This algorithm has worst case run time of O(N) where N is the number of entries in the array.

Although this algorithm has better worst case performance than select_random_pivot, that algorithm is preferred because it is much faster in practice.

Implementation of get_top_n_items

This function shows how to get the top n ordered items from an unordered array using the select_random_pivot function. It demonstrates how to use the SELECT algorithms described here.

3 thoughts on “Select the k-th smallest item from an unordered array in O(n) in python”

  1. Updated today to change “select_fct” to the correct function name based on a private comment that noted the use of the undefined “select_fct”. It was a typo.

Leave a Reply