def boolean_search_and(self, text):
results = []
and_tokens = self.tokenize(text)
tokencount = len(and_tokens)
term1 = and_tokens[0]
print ' term 1:', term1
term2 = and_tokens[1]
print ' term 2:', term2
#for term in and_tokens:
if term1 in self._inverted_index.keys():
resultlist1 = self._inverted_index[term1]
print resultlist1
if term2 in self._inverted_index.keys():
resultlist2 = self._inverted_index[term2]
print resultlist2
#intersection of two sets casted into a list
results = list(set(resultlist1) & set(resultlist2))
print 'result开发者_如何学JAVAs:', results
return str(results)
This code works great for two tokens, ex: text= "Hello World" and so, tokens = ['hello', 'world']. I want to generalize it for multiple tokens, so the text can be a sentence, or an entire text file.
self._inverted_index is a dictionary that saves the tokens as keys and the values are the DocIDs in which the keys/tokens occur.hello -> [1,2,5,6]
world -> [1,3,5,7,8] result: hello AND world -> [1,5]I want to achieve result for: say, (((hello AND computer) AND science) AND world)
I am working on making this work for multiple words, not just two. I started working in python this mornin', so I'm unaware of a lot of features it has to offer.
Any ideas?
I want to generalize it for multiple tokens
def boolean_search_and_multi(self, text):
and_tokens = self.tokenize(text)
results = set(self._inverted_index[and_tokens[0]])
for tok in and_tokens[1:]:
results.intersection_update(self._inverted_index[tok])
return list(results)
Would the built-in set type work for you?
$ python
Python 2.6.5 (r265:79063, Jun 12 2010, 17:07:01)
[GCC 4.3.4 20090804 (release) 1] on cygwin
Type "help", "copyright", "credits" or "license" for more information.
>>> hello = set([1,2,5,6])
>>> world = set([1,3,5,7,8])
>>> hello & world
set([1, 5])
精彩评论