{"id":977,"date":"2013-09-10T15:33:46","date_gmt":"2013-09-10T15:33:46","guid":{"rendered":"http:\/\/joelinoff.com\/blog\/?p=977"},"modified":"2017-03-13T11:32:24","modified_gmt":"2017-03-13T18:32:24","slug":"select-the-k-th-smallest-item-from-an-unordered-array-in-on-in-python","status":"publish","type":"post","link":"https:\/\/joelinoff.com\/blog\/?p=977","title":{"rendered":"Select the k-th smallest item from an unordered array in O(n) in python"},"content":{"rendered":"<p>This blog contains two python implementations of the SELECT algorithm to select the k-th ordered item from an unsorted array of N items in O(N) time. They derive from the discussion of RANDOM-SELECT and SELECT in the CLRS &#8220;Introduction to Algorithms, 3rd Edition&#8221; on pages 215-222 and from Professor David Eppstein&#8217;s <a href=\"http:\/\/www.ics.uci.edu\/~eppstein\/161\/960130.html\">lecture notes<\/a> entitled &#8220;ICS 161: Design and Analysis of Algorithms&#8221; from January 30th, 1996.<br \/>\n<!--more--><\/p>\n<h1>Contents<\/h1>\n<ul>\n<li><a href=\"#overview\">Overview<\/a><\/li>\n<li><a href=\"#select_random_pivot\">Implementation of select_random_pivot<\/li>\n<li><a href=\"#select_median_of_medians_pivot\">Implementation of select_median_of_medians_pivot<\/i>\n<li><a href=\"#get_top_n_items\">Implementation of get_top_n_items<\/i>\n<\/ul>\n<p><a name=\"overview\"><\/a><\/p>\n<h1>Overview<\/h1>\n<p>Both implementations use a divide and conquer approach to recursively partition the input array into smaller and smaller chunks around a pivot point where each chunk contains data that honors an ordering relation to the pivot value until the k-th item matches the pivot. There are a lot of great discussions about how this works on the web and in text books (like CLRS). My goal here is not to try to improve on those discussions (I can&#8217;t). Instead, it is to simply show two working implementations.<\/p>\n<p>The first algorithm (<a href=\"#select_random_pivot\">select_random_pivot<\/a>), uses a random number to select the pivot point. The second algorithm (<a href=\"#select_median_of_medians_pivot\">select_median_of_medians_pivot<\/a>) uses a more deterministic approach to find a better approximation of the median.<\/p>\n<p>The select_random_pivot algorithm has an average run time of O(N) but a worst case run time of O(N^2) in vanishingly rare cases. The select_median_of_medians_pivot algorithm has an average and worst case run time of O(N).<\/p>\n<p>These implementations were written to help me understand how the algorithms work. They were not designed for production use. For example, they do not do any bounds checking and they create sub arrays during recursion instead of operating on a single array in place which would probably be substantially faster but manipulating indices would, in my opinion, detract from the readability.<\/p>\n<p>For a production environment you would want to create a several implementations, test them against characteristic datasets and choose the one that performs the best. When I tested these two implementations on a series of datasets, the select_random_pivot function won because it was at least 2X faster in every case.<\/p>\n<p>To get a feel for how these algorithms might be used, I have included an example that shows how to get the top n ordered items from an unordered array in O(N) time using the <a href=\"#get_top_n_items\">get_top_n_items<\/a> function.<\/p>\n<p>I hope that you find this helpful.<\/p>\n<p><a name=\"select_random_pivot\"><\/a><\/p>\n<h1>Implementation of select_random_pivot<\/h1>\n<p>The first implementation randomly chooses the pivot point. It is an implementation of the random select algorithm as described in CLRS &#8220;Introduction to Algorithms, 3rd Edition&#8221; on page 216 modified to partition the items in separate arrays.<\/p>\n<p>The average run time is O(N) and the worst case run time isO(N^2) where N is the number of entries in the array but the probability of the worst case occurring in practice is vanishingly small.<\/p>\n<p>[crayon lang=&#8221;python&#8221; toolbar=&#8221;always&#8221; title=&#8221;select_random_pivot&#8221;]<br \/>\ndef select_random_pivot(array, k):<br \/>\n    &#8221;&#8217;<br \/>\n    Implementation of the random select algorithm as described in CLRS<br \/>\n    &#8220;Introduction to Algorithms, 3rd Edition&#8221; on page 216 modified to<br \/>\n    partition the items in separate lists.<\/p>\n<p>    This algorithm has a worst case run time of O(N^2) where N is the<br \/>\n    number of entries in the array but the probability of that occuring<br \/>\n    is vanishingly small. The average case is O(N). It is normally much<br \/>\n    faster than select_median_of_medians_pivot.<\/p>\n<p>    Here is how you might use it:<\/p>\n<p>        # Create a list of pseudo random numbers.<br \/>\n        # Duplicates can occur.<br \/>\n        num = 10000<br \/>\n        array = [random.randint(1,1000) for i in range(num)]<br \/>\n        random.shuffle(array)<br \/>\n        random.shuffle(array)<\/p>\n<p>        # Get the value of the kth item.<br \/>\n        k = 7<br \/>\n        kval = select_random_pivot(array, k)<\/p>\n<p>        # Test it.<br \/>\n        sorted_array = sorted(array)<br \/>\n        assert sorted_array[k] == kval<\/p>\n<p>    @param array the list of values<br \/>\n    @param k     k-th item to select.<br \/>\n    @returns the k-th item<br \/>\n    &#8221;&#8217;<\/p>\n<p>    # If the array is short, terminate the recursion and return the<br \/>\n    # value without partitioning.<br \/>\n    if len(array) <= 10:\n        array.sort()\n        return array[k]\n\n    # Randomly choose a pivot point.\n    # In vanishingly rare cases this can result in O(N^2).\n    pivot_idx = random.randint(0, len(array) - 1)\n    pivot = array[pivot_idx]  # pivot point value (not index)\n\n    # Now select recursively using the pivot.\n    # At this point we have the pivot. Use it to partition the input\n    # array into 3 categories: items that are less than the pivot\n    # value (array_lt), items that are greater than the pivot value\n    # (array_gt) and items that exactly equal to the pivot value\n    # (equals_array).\n    array_lt = []\n    array_gt = []\n    array_eq = []\n    for item in array:\n        if item < pivot:\n            array_lt.append(item)\n        elif item > pivot:<br \/>\n            array_gt.append(item)<br \/>\n        else:<br \/>\n            array_eq.append(item)<\/p>\n<p>    # The array values have been partitioned according to their<br \/>\n    # relation to the pivot value. The partitions look like this:<br \/>\n    #<br \/>\n    #   +&#8212;+&#8212;+&#8212;+&#8230;+&#8212;+&#8212;+&#8212;+&#8230;+&#8212;+&#8212;+&#8212;+&#8230;<br \/>\n    #   | 0 | 1 | 2 |   | e |e+1|e+2|   | g |g+1|g+2|<br \/>\n    #   +&#8212;+&#8212;+&#8212;+&#8230;+&#8212;+&#8212;+&#8212;+&#8230;+&#8212;+&#8212;+&#8212;+&#8230;<br \/>\n    #      array_lt        array_eq       array_gt<br \/>\n    #<br \/>\n    # If the value of k is in the range [0..e) then we know that<br \/>\n    # the desired value is in array_lt so we need to recurse.<br \/>\n    #<br \/>\n    # If the value of k in the range [e..g) then we know that the<br \/>\n    # desired value is in array_eq and we are done.<br \/>\n    #<br \/>\n    # If the value of k is >= g then we the desired value is in<br \/>\n    # array_gt and we need to recurse but we also have to make sure<br \/>\n    # that k is normalized with respect to array_gt so that it has the<br \/>\n    # proper offset in the recursion. We normalize it by subtracting<br \/>\n    # len(array_lt) and len(array_eq).<br \/>\n    #<br \/>\n    if k < len(array_lt):\n        return select_random_pivot(array_lt, k)\n    elif k < len(array_lt) + len(array_eq):\n        return array_eq[0]\n    else:\n        normalized_k = k - (len(array_lt) + len(array_eq))\n        return select_random_pivot(array_gt, normalized_k)\n[\/crayon]\n\n<a name=\"select_median_of_medians_pivot\"><\/a><\/p>\n<h1>Implementation of select_median_of_medians_pivot<\/h1>\n<p>This implementation chooses a pivot point deterministically to reduce the worst case run time to O(N). It is an implementation of the Blum, Floyd, Pratt, Rivest, Tarjan SELECT algorithm as described by David Eppstein. This algorithm is also discussed in CLRS &#8220;Introduction to Algorithms, 3rd Edition&#8221; on page 220.<\/p>\n<p>This algorithm has worst case run time of O(N) where N is the number of entries in the array.<\/p>\n<p>Although this algorithm has better worst case performance than <a href=\"#select_random_pivot\">select_random_pivot<\/a>, that algorithm is preferred because it is much faster in practice.<\/p>\n<p>[crayon lang=&#8221;python&#8221; toolbar=&#8221;always&#8221; title=&#8221;select_median_of_medians_pivot&#8221;]<br \/>\ndef select_median_of_medians_pivot(array, k):<br \/>\n    &#8221;&#8217;<br \/>\n    Implementation of the Blum, Floyd, Pratt, Rivest, Tarjan SELECT<br \/>\n    algorithm as described by David Eppstein.<\/p>\n<p>    CITATION: http:\/\/www.ics.uci.edu\/~eppstein\/161\/960130.html<\/p>\n<p>    This algorithm has worst case run time of O(N) where N is the<br \/>\n    number of entries in the array.<\/p>\n<p>    Although this algorithm has better worst case performance than<br \/>\n    select_random_pivot(), that algorithm is preferred because it<br \/>\n    is much faster in practice.<\/p>\n<p>    Here is how you might use it:<\/p>\n<p>        # Create a list of pseudo random numbers.<br \/>\n        # Duplicates can occur.<br \/>\n        num = 10000<br \/>\n        array = [random.randint(1,1000) for i in range(num)]<br \/>\n        random.shuffle(array)<br \/>\n        random.shuffle(array)<\/p>\n<p>        # Get the value of the kth item.<br \/>\n        k = 7<br \/>\n        kval = select_median_of_medians_pivot(array, k)<\/p>\n<p>        # Test it.<br \/>\n        sorted_array = sorted(array)<br \/>\n        assert sorted_array[k] == kval<\/p>\n<p>    @param array the list of values<br \/>\n    @param k     k-th item to select.<br \/>\n    &#8221;&#8217;<\/p>\n<p>    # If the array is short, terminate the recursion and return the<br \/>\n    # value without partitioning.<br \/>\n    if len(array) <= 10:\n        array.sort()\n        return array[k]\n\n    # Partition the array into subsets with a maximum of 5 elements\n    # each.\n    subset_size = 5  # max items in a subset\n    subsets = []  # list of subsets\n    num_medians = len(array) \/ subset_size\n    if (len(array) % subset_size) > 0:<br \/>\n        num_medians += 1  # not divisible by 5<br \/>\n    for i in range(num_medians):<br \/>\n        beg = i * subset_size<br \/>\n        end = min(len(array), beg + subset_size)<br \/>\n        subset = array[beg:end]<br \/>\n        subsets.append(subset)<\/p>\n<p>    # Find the medians in each subset.<br \/>\n    # Note that it calls select_median_of_medians_pivot() recursively taking<br \/>\n    # advantage of the fact that for len(array) <= 10, the select\n    # operation simply sorts the array and returns the k-th item. This\n    # could be done here but since the termination condition is\n    # required to get an infinite loop we may as well use it.\n    medians = []  # list of medians\n    for subset in subsets:\n        median = select_median_of_medians_pivot(subset, len(subset)\/2)\n        medians.append(median)\n\n    # Now get the median of the medians recursively.\n    # Assign it to the local pivot variable because\n    # the pivot handling code is the same regardless\n    # of how it was generated. See select_random_pivot() for\n    # a different approach for generating the pivot.\n    median_of_medians = select_median_of_medians_pivot(medians, len(medians)\/2)\n    pivot = median_of_medians  # pivot point value (not index)\n\n    # Now select recursively using the pivot.\n    # At this point we have the pivot. Use it to partition the input\n    # array into 3 categories: items that are less than the pivot\n    # value (array_lt), items that are greater than the pivot value\n    # (array_gt) and items that exactly equal to the pivot value\n    # (equals_array).\n    array_lt = []\n    array_gt = []\n    array_eq = []\n    for item in array:\n        if item < pivot:\n            array_lt.append(item)\n        elif item > pivot:<br \/>\n            array_gt.append(item)<br \/>\n        else:<br \/>\n            array_eq.append(item)<\/p>\n<p>    # The array values have been partitioned according to their<br \/>\n    # relation to the pivot value. The partitions look like this:<br \/>\n    #<br \/>\n    #   +&#8212;+&#8212;+&#8212;+&#8230;+&#8212;+&#8212;+&#8212;+&#8230;+&#8212;+&#8212;+&#8212;+&#8230;<br \/>\n    #   | 0 | 1 | 2 |   | e |e+1|e+2|   | g |g+1|g+2|<br \/>\n    #   +&#8212;+&#8212;+&#8212;+&#8230;+&#8212;+&#8212;+&#8212;+&#8230;+&#8212;+&#8212;+&#8212;+&#8230;<br \/>\n    #      array_lt        array_eq       array_gt<br \/>\n    #<br \/>\n    # If the value of k is in the range [0..e) then we know that<br \/>\n    # the desired value is in array_lt so we need to recurse.<br \/>\n    #<br \/>\n    # If the value of k in the range [e..g) then we know that the<br \/>\n    # desired value is in array_eq and we are done.<br \/>\n    #<br \/>\n    # If the value of k is >= g then we the desired value is in<br \/>\n    # array_gt and we need to recurse but we also have to make sure<br \/>\n    # that k is normalized with respect to array_gt so that it has the<br \/>\n    # proper offset in the recursion. We normalize it by subtracting<br \/>\n    # len(array_lt) and len(array_eq).<br \/>\n    #<br \/>\n    if k < len(array_lt):\n        return select_median_of_medians_pivot(array_lt, k)\n    elif k < len(array_lt) + len(array_eq):\n        return array_eq[0]\n    else:\n        normalized_k = k - (len(array_lt) + len(array_eq))\n        return select_median_of_medians_pivot(array_gt, normalized_k)\n[\/crayon]\n\n<a name=\"get_top_n_items\"><\/a><\/p>\n<h1>Implementation of get_top_n_items<\/h1>\n<p>This function shows how to get the top n ordered items from an unordered array using the <a href=\"#select_random_pivot\">select_random_pivot<\/a> function. It demonstrates how to use the SELECT algorithms described here.<\/p>\n<p>[crayon lang=&#8221;python&#8221; toolbar=&#8221;always&#8221; title=&#8221;get_top_n_items&#8221;]<br \/>\ndef get_top_n_items(array, n):<br \/>\n    &#8220;&#8221;&#8221; Get the top n items. &#8220;&#8221;&#8221;<br \/>\n    nth = select_random_pivot(array, n)  # avg case: O(N), WC: O(N^2)<br \/>\n    top_items = []<\/p>\n<p>    # Start by getting all of the values less than the top value<br \/>\n    # (O(N)). This guarantees that we don&#8217;t miss some because there<br \/>\n    # are duplicate nth values.<br \/>\n    for value in array:<br \/>\n        if value < nth:\n            top_items.append(value)\n            if len(top_items) == n:\n                return top_items  # all done\n\n    # Now get the remaining values.\n    # This is where duplicates are handled.\n    for value in array:\n        if value == nth:\n            top_items.append(value)\n            if len(top_items) == n:\n                return top_items  # all done\n\n    # We should never reach this point.\n    assert len(top_items) == n\n[\/crayon]\n\n<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This blog contains two python implementations of the SELECT algorithm to select the k-th ordered item from an unsorted array of N items in O(N) time. They derive from the discussion of RANDOM-SELECT and SELECT in the CLRS &#8220;Introduction to Algorithms, 3rd Edition&#8221; on pages 215-222 and from Professor David Eppstein&#8217;s lecture notes entitled &#8220;ICS &hellip; <a href=\"https:\/\/joelinoff.com\/blog\/?p=977\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Select the k-th smallest item from an unordered array in O(n) in python<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[37,5,7],"tags":[],"class_list":["post-977","post","type-post","status-publish","format-standard","hentry","category-algorithms","category-programming","category-python"],"_links":{"self":[{"href":"https:\/\/joelinoff.com\/blog\/index.php?rest_route=\/wp\/v2\/posts\/977","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/joelinoff.com\/blog\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/joelinoff.com\/blog\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/joelinoff.com\/blog\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/joelinoff.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=977"}],"version-history":[{"count":18,"href":"https:\/\/joelinoff.com\/blog\/index.php?rest_route=\/wp\/v2\/posts\/977\/revisions"}],"predecessor-version":[{"id":1753,"href":"https:\/\/joelinoff.com\/blog\/index.php?rest_route=\/wp\/v2\/posts\/977\/revisions\/1753"}],"wp:attachment":[{"href":"https:\/\/joelinoff.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=977"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/joelinoff.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=977"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/joelinoff.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=977"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}