I have a full inverted index in form of nested python dictionary. Its structure is :
{word : { doc_name : [location_list] } }
For example let the dictionary be called index, then for a word " spam ", entry would look like :
{ spam : { doc1.txt : [102,300,399], doc5.txt : [200,587] } }
so that, the documents containing any word can be given by index[word].keys() , and frequency in that document by len(index[word][document])
Now my question is, how do I implement a normal query search in this index. i.e. given a query containing lets say 4 words, find documents containing all four matches (ranked by total frequency of occurrence ), then docs containing 3 matches and so on ....
**
Added this code, using S. Lott's answer. This is the code I have written. Its working exactly as I want, ( just some formatting of output is needed ) but I know it could be improved.
**
from collections import defaultdict
from operator import itemgetter
# Take input
开发者_运维百科query = input(" Enter the query : ")
# Some preprocessing
query = query.lower()
query = query.strip()
# now real work
wordlist = query.split()
search_words = [ x for x in wordlist if x in index ] # list of words that are present in index.
print "\nsearching for words ... : ", search_words, "\n"
doc_has_word = [ (index[word].keys(),word) for word in search_words ]
doc_words = defaultdict(list)
for d, w in doc_has_word:
for p in d:
doc_words[p].append(w)
# create a dictionary identifying matches for each document
result_set = {}
for i in doc_words.keys():
count = 0
matches = len(doc_words[i]) # number of matches
for w in doc_words[i]:
count += len(index[w][i]) # count total occurances
result_set[i] = (matches,count)
# Now print in sorted order
print " Document \t\t Words matched \t\t Total Frequency "
print '-'*40
for doc, (matches, count)) in sorted(result_set.items(), key = itemgetter(1), reverse = True):
print doc, "\t",doc_words[doc],"\t",count
Pls comment .... Thanx.
Here's a start:
doc_has_word = [ (index[word].keys(),word) for word in wordlist ]
This will build an list of (word,document) pairs. You can't easily make a dictionary out of that, since each document occurs many times.
But
from collections import defaultdict
doc_words = defaultdict(list)
for d, w in doc_has_word:
doc_words[tuple(d.items())].append(w)
Might be helpful.
import itertools
index = {...}
def query(*args):
result = []
doc_count = [(doc, len(index[word][doc])) for word in args for doc in index[word]]
doc_group = itertools.groupby(doc_count, key=lambda doc: doc[0])
for doc, group in doc_group:
result.append((doc, sum([elem[1] for elem in group])))
return sorted(result, key=lambda x:x[1])[::-1]
Here is a solution for finding the similar documents (the hardest part):
wordList = ['spam','eggs','toast'] # our list of words to query for
wordMatches = [index.get(word, {}) for word in wordList]
similarDocs = reduce(set.intersection, [set(docMatch.keys()) for docMatch in wordMatches])
wordMatches
gets a list where each element is a dictionary of the document matches for one of the words being matched.
similarDocs
is a set of the documents that contain all of the words being queried for. This is found by taking just the document names out of each of the dictionaries in the wordMatches
list, representing these lists of document names as sets, and then intersecting the sets to find the common document names.
Once you have found the documents that are similar, you should be able to use a defaultdict (as shown in S. Lott's answer) to append all of the lists of matches together for each word and each document.
Related links:
- This answer demonstrates defaultdict(int). defaultdict(list) works pretty much the same way.
- set.intersection example
精彩评论