I need to search for the first, last, any, or all occurence of something in something else. To avoid repeating myse开发者_JAVA百科lf (DRY) I came up with the following solution.
Of interest are the methods search_revisions()
and collect_one_occurence()
of both Searcher
classes.
In SearcherYield
I create a generator in search_revisions()
only to abandon the generator in collect_one_occurence()
after collecting the first result. In SearcherCondition
I put a condition in the loop. This condition will have to be checked for every iteration of the loop.
I can't decide whether my (ab)use of yield and subsequent abandoning of the generator is a strike of genius or a hideous hack. What do you think? Do you have any other ideas for such a situation?
#!/usr/bin/python
class Revision:
# a revision is something like a textfile.
# the search() method will search the textfile
# and return the lines which match the given pattern.
# for demonstration purposes this class is simplified
# to return predefined results
def __init__(self, results):
self.results = results
def search(self, pattern):
return self.results
class AbstractSearcher:
def __init__(self, revisions):
self.revisions = revisions
def search_for_first_occurence(self, pattern):
keys = sorted(self.revisions.iterkeys())
return self.collect_one_occurence(keys, pattern)
def search_for_last_occurence(self, pattern):
keys = sorted(self.revisions.iterkeys(), reverse = True)
return self.collect_one_occurence(keys, pattern)
def search_for_any_occurence(self, pattern):
keys = self.revisions.iterkeys()
return self.collect_one_occurence(keys, pattern)
def search_for_all_occurences(self, pattern):
keys = self.revisions.iterkeys()
return self.collect_all_occurences(keys, pattern)
class SearcherYield(AbstractSearcher):
def search_revisions(self, keys, pattern):
# create generator which yields the results one by one
for key in keys:
rev = self.revisions[key]
result = rev.search(pattern)
if result:
yield result
def collect_one_occurence(self, keys, pattern):
# take the first result and then abandon the generator
for result in self.search_revisions(keys, pattern):
return result
return []
def collect_all_occurences(self, keys, pattern):
# collect all results from generator
results = []
for result in self.search_revisions(keys, pattern):
results.extend(result)
return results
class SearcherCondition(AbstractSearcher):
def search_revisions(self, keys, pattern, just_one):
# collect either all results from all revisions
# or break the loop after first result found
results = []
for key in keys:
rev = self.revisions[key]
result = rev.search(pattern)
if result:
results.extend(result)
if just_one:
break
return results
def collect_one_occurence(self, keys, pattern):
return self.search_revisions(keys, pattern, just_one = True)
def collect_all_occurences(self, keys, pattern):
return self.search_revisions(keys, pattern, just_one = False)
def demo(searcher):
print searcher.__class__.__name__
print 'first:', searcher.search_for_first_occurence('foo')
print 'last: ', searcher.search_for_last_occurence('foo')
print 'any: ', searcher.search_for_any_occurence('foo')
print 'all: ', searcher.search_for_all_occurences('foo')
def main():
revisions = {
1: Revision([]),
2: Revision(['a', 'b']),
3: Revision(['c']),
4: Revision(['d','e', 'f']),
5: Revision([])}
demo(SearcherYield(revisions))
demo(SearcherCondition(revisions))
if __name__ == '__main__':
main()
Some context: revisions are basically text files. You can think of them like the revisions of a wiki page. Typically there are hundreds of revisions, sometimes thousands. Each revision contains up to thousands of lines of text. There are also cases when there are just a few revision with a few lines each.
A search in a revision will search for a pattern in the text and return the matching lines. Sometimes there are thousands of results, sometimes there are no results.
Sometimes I just need to know whether there are any results in any revision (search for any). Sometimes I have to collect all the results for further processing (search for all). Sometimes I just need the first revision with a match, sometimes just the last revision (search for first and last).
I did a benchmark. Here are the results:
$ ./benchmark.py
benchmark with revcount: 1000 timeitcount: 1000
last, first, yield: 0.902059793472
last, first, cond: 0.897155046463
last, all, yield: 0.818709135056
last, all, cond: 0.818334102631
all, all, yield: 1.26602506638
all, all, cond: 1.17208003998
benchmark with revcount: 2000 timeitcount: 1000
last, first, yield: 1.80768609047
last, first, cond: 1.84234118462
last, all, yield: 1.64661192894
last, all, cond: 1.67588806152
all, all, yield: 2.55621600151
all, all, cond: 2.37582707405
benchmark with revcount: 10000 timeitcount: 1000
last, first, yield: 9.34304785728
last, first, cond: 9.33725094795
last, all, yield: 8.4673140049
last, all, cond: 8.49153590202
all, all, yield: 12.9636368752
all, all, cond: 11.780673027
The yield and the condition solution show very similar times. I think this is because the generator (yield) has a loop with a condition in it (if not empty or something like that). I thought I avoided the condition in the loop, but I just moved it out of sight.
Anyway, the numbers show that the performance is mostly equal, so the code should be judged by readability. I will stick with the condition in the loop. I like explicit.
Here is the benchmark code:
#!/usr/bin/python
import functools
import timeit
class Revision:
# a revision is something like a textfile.
# the search() method will search the textfile
# and return the lines which match the given pattern.
# for demonstration purposes this class is simplified
# to return predefined results
def __init__(self, results):
self.results = results
def search(self, pattern):
return self.results
class AbstractSearcher:
def __init__(self, revisions):
self.revisions = revisions
def search_for_first_occurence(self, pattern):
keys = sorted(self.revisions.iterkeys())
return self.collect_one_occurence(keys, pattern)
def search_for_last_occurence(self, pattern):
keys = sorted(self.revisions.iterkeys(), reverse = True)
return self.collect_one_occurence(keys, pattern)
def search_for_any_occurence(self, pattern):
keys = self.revisions.iterkeys()
return self.collect_one_occurence(keys, pattern)
def search_for_all_occurences(self, pattern):
keys = self.revisions.iterkeys()
return self.collect_all_occurences(keys, pattern)
class SearcherYield(AbstractSearcher):
def search_revisions(self, keys, pattern):
# create generator which yields the results one by one
for key in keys:
rev = self.revisions[key]
result = rev.search(pattern)
if result:
yield result
def collect_one_occurence(self, keys, pattern):
# take the first result and then abandon the generator
for result in self.search_revisions(keys, pattern):
return result
return []
def collect_all_occurences(self, keys, pattern):
# collect all results from generator
results = []
for result in self.search_revisions(keys, pattern):
results.extend(result)
return results
class SearcherCondition(AbstractSearcher):
def search_revisions(self, keys, pattern, just_one):
# collect either all results from all revisions
# or break the loop after first result found
results = []
for key in keys:
rev = self.revisions[key]
result = rev.search(pattern)
if result:
results.extend(result)
if just_one:
break
return results
def collect_one_occurence(self, keys, pattern):
return self.search_revisions(keys, pattern, just_one = True)
def collect_all_occurences(self, keys, pattern):
return self.search_revisions(keys, pattern, just_one = False)
def benchmark(revcount, timeitcount):
lastrev = {}
for i in range(revcount):
lastrev[i] = Revision([])
lastrev[revcount] = Revision([1])
allrevs = {}
for i in range(revcount):
allrevs[i] = Revision([1])
last_yield = SearcherYield(lastrev)
last_cond = SearcherCondition(lastrev)
all_yield = SearcherYield(allrevs)
all_cond = SearcherCondition(allrevs)
lfy = functools.partial(last_yield.search_for_first_occurence, 'foo')
lfc = functools.partial(last_cond.search_for_first_occurence, 'foo')
lay = functools.partial(last_yield.search_for_all_occurences, 'foo')
lac = functools.partial(last_cond.search_for_all_occurences, 'foo')
aay = functools.partial(all_yield.search_for_all_occurences, 'foo')
aac = functools.partial(all_cond.search_for_all_occurences, 'foo')
print 'benchmark with revcount: %d timeitcount: %d' % (revcount, timeitcount)
print 'last, first, yield:', timeit.timeit(lfy, number = timeitcount)
print 'last, first, cond:', timeit.timeit(lfc, number = timeitcount)
print 'last, all, yield:', timeit.timeit(lay, number = timeitcount)
print 'last, all, cond:', timeit.timeit(lac, number = timeitcount)
print ' all, all, yield:', timeit.timeit(aay, number = timeitcount)
print ' all, all, cond:', timeit.timeit(aac, number = timeitcount)
def main():
timeitcount = 1000
benchmark(1000, timeitcount)
benchmark(2000, timeitcount)
benchmark(10000, timeitcount)
if __name__ == '__main__':
main()
Some information about my system:
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 10.04.1 LTS
Release: 10.04
Codename: lucid
$ uname -a
Linux lesmana-laptop 2.6.32-26-generic #46-Ubuntu SMP Tue Oct 26 16:46:46 UTC 2010 i686 GNU/Linux
$ python --version
Python 2.6.5
$ cat /proc/cpuinfo | grep name
model name : Intel(R) Pentium(R) M processor 1.60GHz
This will solve your problem, if the lookup_item is immutable and collec is any ordered collection:
positions = [i for i, item in enumerate(collec) if item==lookup_item]
It will practically return all positions where lookup_item occurs in collec.
Personally I am in favor of the yield as far as readability, but it's a close call. I really don't have much of a reason why other than I think it is a great code construct and applies in many situations.
You probably already know this but the code will need the matched revisions returned to the caller. The least-code-change way to fix is to return a link back to the Revision when results are returned by the Revision search method.
You can probably reduce your code by making use of the python itertools module in conjunction with the yield. Readability arguably goes to crap but it's so awesomely geek-sleek:
from itertools import chain,repeat,islice,ifilter
def collect_one_occurence(self, keys, pattern):
return chain(ifilter(None,(rev.search(pattern) for rev in (self.revisions[key] for key in keys)),repeat([]).next()
def collect_all_occurences(self, keys, pattern):
return list(chain(*[rev.search(pattern) for rev in (self.revisions[key] for key in keys)]))
Obviously, you can expand the code to make it more readable but I left it collapsed for benchmarking purposes.. wondering how much this improves your current results?
精彩评论