Inverted Index Evaluation Order_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-02-25 00:32 出处：网络

I read somewhere that when you have an inverted index (for instance,开发者_运维问答 you have a sorted list of pages of brutus, a sorted list of pages for caesar, and a sorted list of pages for calpurnia), when you do caesar AND brutus AND calpurnia, if the number of pages for calpurnia and brutus are less than the number of pages for caesar, then you should do caesar AND (brutus and calpurnia), meaning you should evaluate the latter AND first. In general, whenever you have a series of AND, you always evaluate the pair with the lowest number of pages first. What is the reasoning behind this? Why is this efficient?

It is not true for every case of inverted indexes. If you need to sequentially scan the whole inverted indexes, then in would not matter which postings list intersection you do first.

But, assume a scenario when the inverted lists are stored in an indexed relation. Then evaluating the pair with smaller number of document occurrences will be equal to joining relations with higher selectivities, thus increasing the efficiency of the evaluation.

Intuitively, when we intersect smaller lists, we create a stronger filter which is used as a feed to the index to find the matches.

Assume we are interested in evaluating the keyword query a b c, where a, b and c are words in documents. Also assume the number of documents matching are as follows:

a --> 20
b --> 100
c --> 1000
a+b --> 10
a+c --> 15
b+c --> 50
a+b+c --> 5

Note that (a JOIN b) has size 10 and (b JOIN c) has size 50. Thus the first will require 10 accesses to the index on c, while the second requires 50 accesses to the index on a. But using a hash-based or a tree-based index, such accesses to the index do not differ greatly in cost and are usually done in a single I/O.

An important thing to realize is that because of the sorting, which you mentioned already, the inverted lists can be searched for any given document id very efficiently (generally, in logarithmic time), for example using binary search.

To see the effect of that, assume a query caesar AND brutus, and assume that there are occ_caesar pages for caesar and occ_brutus pages for brutus (i.e. occ_X denotes the length of the pages list for a term X). Now assume, for the sake of the example, that occ_caesar > occ_brutus, i.e. caesar occurs more frequently in the content than brutus.

What you do then is to iterate through all pages for brutus first, and search for each of them in the pages list for caesar. If indeed the lists can be searched in logarithmic time, this means you need

occ_brutus * log(occ_caesar)

computational steps to identify all pages that contain both terms.

If you had done it reversely (i.e. iterating through the caesar list and searching for each of its pages in the brutus list), the smaller number would end up in the logarithm and the greater number would become a factor, so the total time the evaluation takes would be longer.

Having said this, it also important to realize that in practice things are more complicated than this, because (a) the lists are not only sorted but also compressed, which makes search harder, and (b) parts of the lists may be stored on disk rather than in memory, which means the total number of disk accesses is overwhelmingly more important than the total number of computational steps. Hence, the algorithm described above might not apply in its purest form, but the principle is as described.