开发者

Challenge! Complex MySQL Query

开发者 https://www.devze.com 2023-03-02 21:25 出处:网络
We are coding a small search engine. Database Tables: Documents (DocumentID, Title, Abstract, Author, ...)

We are coding a small search engine. Database Tables:

Documents (DocumentID, Title, Abstract, Author, ...)
InvertedIndex (DocumentID, Word, Count)
Stopwords (Word)

Where InvertedIndex has an entry for each word in each Document and the number of times it appears. Stopwords is simply a list of words that I dont care about. The engine is queried with lists of terms separated by or. For example:

  • term1 term2
  • term1 or term2
  • term1 term2 or term3

...etc. Search results based on Relevance, calculated for each document with the boolean extended model. and-ed terms (all terms that are not ored) are multiplied and ors are summed. For example, considering the query term1 term2 or term3, if the terms appear in a document 3, 4, and 5 times respectively, the document relevance would be (3*4)+5 = 12. Also, ignore terms that exist in Stopwords.

OK Now... my professor has told us that calculating the relevance for all documents can be done in a single query. That's what I need help on.

I've prepared some pseudocode fo开发者_如何学JAVAr the example query term1 term2 or term3. So this is how I would calculate the relevance for every document, but I would like to perform a single MySQL query instead. I include this just as clarification for the relevance formula.

foreach document
    relevance = 0
    foreach term_set // where (term1 term2) would be a term_set and (term3) would be the other
        product = 1
        foreach term
            if term not in stopwords
                SELECT Count FROM InvertedIndex WHERE Word=term AND DocumentID=document
                product *= Count
        relevance += product

(EXP(SUM(LOG(COALESCE(Column,1))) apparently is a way to perform aggregate multiplication.

Any help would be greatly appreciated. Sorry if this was a chore to get through. It's 2 oclock and I probably didnt explain this very well.


If I understand your issue, this might help you get started (but you'll have to check syntax, since my MySQL is rusty):

Select DocumentId, Word, Count
From Documents
Inner Join InvertedIndex On Documents.DocumentID = InvertedIndex.DocumentID
Where Word In (term1, term2, term3)

This query would give you a list of DocumentIds, the "search" terms, and the count for each document that contains the search term. You can use this as a starting point for aggregating on DocumentId, using a Group By DocumentId, and then figuring your aggregate multiplication function (which I kindly leave to you).

I haven't worked with MySQL enough to know how to exclude the words in the Stopwords table (you can use EXCEPT in SQL Server) but something like this might work:

Select DocumentId, Word, Count
From Documents
Inner Join InvertedIndex On Documents.DocumentID = InvertedIndex.DocumentID
Where Word In (term1, term2, term3)
And Where Not Exists (
    Select DocumentId, Word, Count
    From Documents
    Inner Join InvertedIndex On Documents.DocumentID = InvertedIndex.DocumentID
    Inner Join Stopwords On InvertedIndex.Word = Stopwords.Word
    Where Word In (term1, term2, term3)
)

Good luck with your assignment. Let us know how it turns out!

0

精彩评论

暂无评论...
验证码 换一张
取 消