Why does Document at a time scoring allow us to skip over parts of longer lists while doing intersection?_问答_开发者

Why does Document at a time scoring allow us to skip over parts of longer lists while doing intersection?

开发者 https://www.devze.com 2023-03-04 18:05 出处：网络

This is a information retrieval question. When we do document at a time scoring, why does DAAT allow us to skip over parts of longer lists. I am reading a research paper called Using Graphics Processors for High Perfomance IR Query Processing, in which they just mention the above property without any explanation. AN example will be apprecia开发者_开发问答ted. Thanks

Considering a "AND" query, for example:

"Gaga AND CD"

You can imagine that Gaga is much more rare than CD. In other words, the posting list (or inverted list as you will) of Gaga is much shorter than the one for CD.

Let's assume two small posting lists for the two words (I'm showing only the doc_ids, which are the objects of interest here):

Gaga -> [2, 10, 1023, 2030]
CD   -> [1, 2, 6, 8, 15, 32, 43, 52, 92, 115, 326, 401, 560, 564, ... , 1924, 2030, ...]

In Document-at-a-time retrieval, we iterate through posting lists in paralell looking for docs that match the query (in a AND query, just every doc that occurs in both posting lists).

In this type of retrieval, we can skip documents by knowing the most rare term (Gaga). That way we can use its posting list as a "pivot". The first doc_id to look for is 2, than is 10. Note that, we can skip all doc_ids between 2 and 10 in CD posting lists, because we know it won't match anything. Similarly, the next doc_id processed is 1023. When processing 1023 we can skip at least 10 documents (from 15 to something after 564), because we know it won't match anything.

The algorithm (for the AND query) is basically an array intersection. When you got a intersection you process it. Otherwise you keep skipping.

UPDATE: Many implementations use Skip Lists to avoid doing comparisons while processing inverted lists. In the example above, the system could use the skip list to "jump" to the next position of CD inverted list that has a doc_id close to 10. That way it won't need to compare with 6 and 8.