Abuse yield to avoid condition in loop_问答_开发者

I need to search for the first, last, any, or all occurence of something in something else. To avoid repeating myse开发者_JAVA百科lf (DRY) I came up with the following solution.

Of interest are the methods search_revisions() and collect_one_occurence() of both Searcher classes.

In SearcherYield I create a generator in search_revisions() only to abandon the generator in collect_one_occurence() after collecting the first result. In SearcherCondition I put a condition in the loop. This condition will have to be checked for every iteration of the loop.

I can't decide whether my (ab)use of yield and subsequent abandoning of the generator is a strike of genius or a hideous hack. What do you think? Do you have any other ideas for such a situation?

#!/usr/bin/python

class Revision:
  # a revision is something like a textfile.
  # the search() method will search the textfile
  # and return the lines which match the given pattern.
  # for demonstration purposes this class is simplified
  # to return predefined results
  def __init__(self, results):
    self.results = results
  def search(self, pattern):
    return self.results

class AbstractSearcher:
  def __init__(self, revisions):
    self.revisions = revisions
  def search_for_first_occurence(self, pattern):
    keys = sorted(self.revisions.iterkeys())
    return self.collect_one_occurence(keys, pattern)
  def search_for_last_occurence(self, pattern):
    keys = sorted(self.revisions.iterkeys(), reverse = True)
    return self.collect_one_occurence(keys, pattern)
  def search_for_any_occurence(self, pattern):
    keys = self.revisions.iterkeys()
    return self.collect_one_occurence(keys, pattern)
  def search_for_all_occurences(self, pattern):
    keys = self.revisions.iterkeys()
    return self.collect_all_occurences(keys, pattern)

class SearcherYield(AbstractSearcher):

  def search_revisions(self, keys, pattern):
    # create generator which yields the results one by one
    for key in keys:
      rev = self.revisions[key]
      result = rev.search(pattern)
      if result:
        yield result

  def collect_one_occurence(self, keys, pattern):
    # take the first result and then abandon the generator
    for result in self.search_revisions(keys, pattern):
      return result
    return []

  def collect_all_occurences(self, keys, pattern):
    # collect all results from generator
    results = []
    for result in self.search_revisions(keys, pattern):
      results.extend(result)
    return results

class SearcherCondition(AbstractSearcher):

  def search_revisions(self, keys, pattern, just_one):
    # collect either all results from all revisions
    # or break the loop after first result found
    results = []
    for key in keys:
      rev = self.revisions[key]
      result = rev.search(pattern)
      if result:
        results.extend(result)
        if just_one:
          break
    return results

  def collect_one_occurence(self, keys, pattern):
    return self.search_revisions(keys, pattern, just_one = True)

  def collect_all_occurences(self, keys, pattern):
    return self.search_revisions(keys, pattern, just_one = False)

def demo(searcher):
  print searcher.__class__.__name__
  print 'first:', searcher.search_for_first_occurence('foo')
  print 'last: ', searcher.search_for_last_occurence('foo')
  print 'any:  ', searcher.search_for_any_occurence('foo')
  print 'all:  ', searcher.search_for_all_occurences('foo')

def main():
  revisions = {
        1: Revision([]),
        2: Revision(['a', 'b']),
        3: Revision(['c']),
        4: Revision(['d','e', 'f']),
        5: Revision([])}
  demo(SearcherYield(revisions))
  demo(SearcherCondition(revisions))

if __name__ == '__main__':
  main()

Some context: revisions are basically text files. You can think of them like the revisions of a wiki page. Typically there are hundreds of revisions, sometimes thousands. Each revision contains up to thousands of lines of text. There are also cases when there are just a few revision with a few lines each.

A search in a revision will search for a pattern in the text and return the matching lines. Sometimes there are thousands of results, sometimes there are no results.

Sometimes I just need to know whether there are any results in any revision (search for any). Sometimes I have to collect all the results for further processing (search for all). Sometimes I just need the first revision with a match, sometimes just the last revision (search for first and last).

I did a benchmark. Here are the results:

$ ./benchmark.py 
benchmark with revcount: 1000 timeitcount: 1000
last, first, yield: 0.902059793472
last, first,  cond: 0.897155046463
last,   all, yield: 0.818709135056
last,   all,  cond: 0.818334102631
 all,   all, yield: 1.26602506638
 all,   all,  cond: 1.17208003998
benchmark with revcount: 2000 timeitcount: 1000
last, first, yield: 1.80768609047
last, first,  cond: 1.84234118462
last,   all, yield: 1.64661192894
last,   all,  cond: 1.67588806152
 all,   all, yield: 2.55621600151
 all,   all,  cond: 2.37582707405
benchmark with revcount: 10000 timeitcount: 1000
last, first, yield: 9.34304785728
last, first,  cond: 9.33725094795
last,   all, yield: 8.4673140049
last,   all,  cond: 8.49153590202
 all,   all, yield: 12.9636368752
 all,   all,  cond: 11.780673027

The yield and the condition solution show very similar times. I think this is because the generator (yield) has a loop with a condition in it (if not empty or something like that). I thought I avoided the condition in the loop, but I just moved it out of sight.

Anyway, the numbers show that the performance is mostly equal, so the code should be judged by readability. I will stick with the condition in the loop. I like explicit.

Here is the benchmark code:

#!/usr/bin/python

import functools
import timeit

class Revision:
  # a revision is something like a textfile.
  # the search() method will search the textfile
  # and return the lines which match the given pattern.
  # for demonstration purposes this class is simplified
  # to return predefined results
  def __init__(self, results):
    self.results = results
  def search(self, pattern):
    return self.results

class AbstractSearcher:
  def __init__(self, revisions):
    self.revisions = revisions
  def search_for_first_occurence(self, pattern):
    keys = sorted(self.revisions.iterkeys())
    return self.collect_one_occurence(keys, pattern)
  def search_for_last_occurence(self, pattern):
    keys = sorted(self.revisions.iterkeys(), reverse = True)
    return self.collect_one_occurence(keys, pattern)
  def search_for_any_occurence(self, pattern):
    keys = self.revisions.iterkeys()
    return self.collect_one_occurence(keys, pattern)
  def search_for_all_occurences(self, pattern):
    keys = self.revisions.iterkeys()
    return self.collect_all_occurences(keys, pattern)

class SearcherYield(AbstractSearcher):

  def search_revisions(self, keys, pattern):
    # create generator which yields the results one by one
    for key in keys:
      rev = self.revisions[key]
      result = rev.search(pattern)
      if result:
        yield result

  def collect_one_occurence(self, keys, pattern):
    # take the first result and then abandon the generator
    for result in self.search_revisions(keys, pattern):
      return result
    return []

  def collect_all_occurences(self, keys, pattern):
    # collect all results from generator
    results = []
    for result in self.search_revisions(keys, pattern):
      results.extend(result)
    return results

class SearcherCondition(AbstractSearcher):

  def search_revisions(self, keys, pattern, just_one):
    # collect either all results from all revisions
    # or break the loop after first result found
    results = []
    for key in keys:
      rev = self.revisions[key]
      result = rev.search(pattern)
      if result:
        results.extend(result)
        if just_one:
          break
    return results

  def collect_one_occurence(self, keys, pattern):
    return self.search_revisions(keys, pattern, just_one = True)

  def collect_all_occurences(self, keys, pattern):
    return self.search_revisions(keys, pattern, just_one = False)

def benchmark(revcount, timeitcount):

  lastrev = {}
  for i in range(revcount):
    lastrev[i] = Revision([])
  lastrev[revcount] = Revision([1])

  allrevs = {}
  for i in range(revcount):
    allrevs[i] = Revision([1])

  last_yield = SearcherYield(lastrev)
  last_cond = SearcherCondition(lastrev)
  all_yield = SearcherYield(allrevs)
  all_cond = SearcherCondition(allrevs)

  lfy = functools.partial(last_yield.search_for_first_occurence, 'foo')
  lfc = functools.partial(last_cond.search_for_first_occurence, 'foo')
  lay = functools.partial(last_yield.search_for_all_occurences, 'foo')
  lac = functools.partial(last_cond.search_for_all_occurences, 'foo')
  aay = functools.partial(all_yield.search_for_all_occurences, 'foo')
  aac = functools.partial(all_cond.search_for_all_occurences, 'foo')

  print 'benchmark with revcount: %d timeitcount: %d' % (revcount, timeitcount)
  print 'last, first, yield:', timeit.timeit(lfy, number = timeitcount)
  print 'last, first,  cond:', timeit.timeit(lfc, number = timeitcount)
  print 'last,   all, yield:', timeit.timeit(lay, number = timeitcount)
  print 'last,   all,  cond:', timeit.timeit(lac, number = timeitcount)
  print ' all,   all, yield:', timeit.timeit(aay, number = timeitcount)
  print ' all,   all,  cond:', timeit.timeit(aac, number = timeitcount)

def main():
  timeitcount = 1000
  benchmark(1000, timeitcount)
  benchmark(2000, timeitcount)
  benchmark(10000, timeitcount)

if __name__ == '__main__':
  main()

Some information about my system:

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 10.04.1 LTS
Release:    10.04
Codename:   lucid
$ uname -a
Linux lesmana-laptop 2.6.32-26-generic #46-Ubuntu SMP Tue Oct 26 16:46:46 UTC 2010 i686 GNU/Linux
$ python --version
Python 2.6.5
$ cat /proc/cpuinfo | grep name
model name  : Intel(R) Pentium(R) M processor 1.60GHz

This will solve your problem, if the lookup_item is immutable and collec is any ordered collection:

positions = [i for i, item in enumerate(collec) if item==lookup_item]

It will practically return all positions where lookup_item occurs in collec.

Personally I am in favor of the yield as far as readability, but it's a close call. I really don't have much of a reason why other than I think it is a great code construct and applies in many situations.

You probably already know this but the code will need the matched revisions returned to the caller. The least-code-change way to fix is to return a link back to the Revision when results are returned by the Revision search method.

You can probably reduce your code by making use of the python itertools module in conjunction with the yield. Readability arguably goes to crap but it's so awesomely geek-sleek:

from itertools import chain,repeat,islice,ifilter
def collect_one_occurence(self, keys, pattern):
    return chain(ifilter(None,(rev.search(pattern) for rev in (self.revisions[key] for key in keys)),repeat([]).next()

def collect_all_occurences(self, keys, pattern):
    return list(chain(*[rev.search(pattern) for rev in (self.revisions[key] for key in keys)]))

Obviously, you can expand the code to make it more readable but I left it collapsed for benchmarking purposes.. wondering how much this improves your current results?