开发者

Efficient way to maintain a sorted list of access counts in Python

开发者 https://www.devze.com 2023-01-03 03:49 出处:网络
Let\'s say I have a list of objects. (All together now: \"I have a list of objects.\") In the web application I\'m writing, each time a request comes in, I pick out up to one of these objects accordin

Let's say I have a list of objects. (All together now: "I have a list of objects.") In the web application I'm writing, each time a request comes in, I pick out up to one of these objects according to unspecified criteria and use it to handle the request. Basically like this:

def handle_request(req):
    for h in handlers:
        if h.handles(req):
            return h
    return None

Assuming the order of the objects in the list is unimportant, I can cut down on unnecessary iterations by keeping the list sorted such that the most frequently used (or perhaps most recently used) objects are at the front. I know this isn't something to be concerned about - it'll make only a miniscule, undetectable difference in the app's execution time - but debugging the rest of the code is driving me crazy and I need a distraction :) so I'm asking out of curiosity: what is the most efficient way to maintain the list in sorted order, descending, by the number of times each handler is chosen?

The obvious solution is to make handlers a list of (count, handler) pairs, and each tim开发者_如何学运维e a handler is chosen, increment the count and resort the list.

    def handle_request(req):
        for h in handlers[:]:
            if h[1].handles(req):
                h[0] += 1
                handlers.sort(reverse=True)
                return h[1]
        return None

But since there's only ever going to be at most one element out of order, and I know which one it is, it seems like some sort of optimization should be possible. Is there something in the standard library, perhaps, that is especially well-suited to this task? Or some other data structure? (Even if it's not implemented in Python) Or should/could I be doing something completely different?


Python's sort algorithm, timsort, is pretty magical: if your listed is sorted except for one element, it will intrinsically (discover and) use that fact, sorting in O(N) time. (Josh Bloch, the Java guru, was so impressed by a presentation about timsort's performance characteristics that he started coding it for Java on his laptop -- it's supposed to become Java's standard sort pretty soon). I'd just do a sort after each locate-and-increment-count, and very much doubt that other approaches can beat timsort.

Edit: the first alternative that comes to mind, of course, is to possibly "shift up" just the item whose count you've just incremented. But first, a little optimization to avoid copying handlers...):

def handle_request(req):
    for h in handlers:
        if h[1].handles(req):
            h[0] += 1
            handlers.sort(reverse=True)
            break
    else:
        return None
    return h[1]

now, the "shift up" variant

def handle_request(req):
    for i, h in enumerate(handlers):
        if h[1].handles(req):
            h[0] += 1
            for j in reversed(range(i+1)):
                if handlers[j][0] <= h[0]:
                    break
            if j < i:
                handlers[j+1:i+1] = handlers[j:i]
                handlers[j] = h
            break
    else:
        return None
    return h[1]

I can imagine patterns of access where this approach might save a little time -- for example, if the distribution was so skewed that most hits were in handlers[0], this would do little work beyond one comparison (while sort needs about N of them even in the best case). Without representative samples of your access patterns, I can't confirm or disprove this!-)


Sounds like a job for priority queue (a.k.a. heapq). Python has an implementation of priority queue as heapq in the standard library. Basically, you keep a tree/heap with the most-frequently-used-item or most-recently-used-item on the top.


Even though timsort is magical, using list.sort() is not a good idea because (at a minimum) it requires each adjacent pair of entries to be compared each time to ensure that the list is in sorted order.

Using a priority queue (aka Python's heapq module) is a good solution for many problems like this, but is not ideal for your application because it is expensive to traverse through the heapq in order.

Surprisingly, the best approach for your situation is to use something like the much-aligned bubble sort. Since all entries are in order except for the one whose counter you just adjusted, all that can happen is that the one entry moves up a bit in the list. And since you are only incrementing by one, it shouldn't move far. So just compare it to the previous entry and if they are out of order swap them. Something like:

def handle_request(req):
    for (i, h) in enumerate(handlers):
        if h[1].handles(req):
            h[0] += 1
            while i > 0 and handlers[i][0] > handlers[i-1][0]:
                handlers[i-1], handlers[i] = handlers[i], handlers[i-1]
                i -= 1
            return h[1]
    return None

(Of course if multiple threads are accessing the handlers array, you have to do some kind of synchronization.)


I'm guessing that all those extra calls to sort() will slow you down more than it will speed you up. My suggestion would be to memoize handle_request() using a wrapper such as this (taken from here)

class Memoize:
    """Memoize(fn) - an instance which acts like fn but memoizes its arguments
    Will only work on functions with non-mutable arguments
    """
    def __init__(self, fn):
        self.fn = fn
        self.memo = {}
    def __call__(self, *args):
        if not self.memo.has_key(args):
            self.memo[args] = self.fn(*args)
        return self.memo[args]

You can use it like this:

handle_request = Memoize(handle_request)

That will cause the various return values of handle_request to be cached and could actually provide a noticeable speedup. I would suggest experimenting with when and were you wrap various functions with Memoize() in your app to see just how much memory it takes up and how much it speeds up (or doesn't) various functions. You could also memoize your .handles() method using a similar approach (for example, there's a memoizing decorator here).


Here's some code I use to solve this problem (although now I read the other answers I am wondering if heapq would be better):

class MRUSortedIterable:

    def __init__(self, data):
        self._data = list(data)
        self._i = 0

    def __iter__(self):
        if self._i:  # if previous use had a success, move to top
            self._data[0], self._data[1:self._i+1] = self._data[self._i], self._data[0:self._i]
        for self._i, value in enumerate(self._data):
            yield value
        self._i = 0  # reset on exhaustion (ie failed to find what we wanted)

You use it like this (for example):

MY_DATA = MRUSortedIterable(a_list_of_objects)
...
def handler(criteria):
    for data in MY_DATA:
        if test(data, criteria):
            return data

and it automatically re-arranges the underlying data to have the most recently used items on top, as required (the re-arranging is actually done when handling the next request). The only requirement is that you stop iterating through the data on success (and consume all data on failure).

IMPORTANT: This is very much not thread-safe (which would probably have been a problem for your web server 2 years ago). But it is, imho, very neat...

On reflection, this is MRU while a heapq with access counts would be ordered by total use. So they probably perform slightly differently (heapq is likely better if access patterns are constant).

0

精彩评论

暂无评论...
验证码 换一张
取 消