Efficient algorithm for finding related submissions_问答_开发者

Efficient algorithm for finding related submissions

开发者 https://www.devze.com 2022-12-08 22:37 出处：网络

I recently launched my humble side project and would like to add a \"related submissions\" section when viewing a submission. Exact开发者_如何学运维ly like what SO is doing here - see right column, ti

I recently launched my humble side project and would like to add a "related submissions" section when viewing a submission. Exact开发者_如何学运维ly like what SO is doing here - see right column, titled "Related"

Considering that each submission has a title and a set of tags, what is most effective (optimum result), most efficient (fast, memory friendly) way to query the database for related submissions?

I can think of one way to do this (which I'll post as an answer) but I'm very interested to see what others have to say. Or perhaps there's already a standard way of achieving this?

Here's my two cent solution:
To achieve the best output, we need to put “weight” on the query results.

To start with, each submission in the database is assumed to have a weight of zero. Then, if a submission in the "pool" shares one tag with the current submission, we'd add +3 to the found submission. Hence, if another submission is found that shares two tags with the current submission, we add +6 to the weight.

Next, we split/tokenize the title of the current submission and remove “stop words”.
I’ve seen a list of stop words from google, but for now I’ll define my stop words to be: [“of”, “a”, “the”, “in”]

Example:
Title “The Best Submission of All Times”
Result the array: ["The", “Best”, “Submission”, “of”, “All”, “Times”]
Remove stop words: [“Best”, “Submission”, “All”, “Times”]

Then we query the database for submissions containing any of the mentioned titles, and for each result we add the weight: +2
And finally sort the list descending by weight and take the top N results.

What do you think? (be gentle!)

If I understand well, you need a technique to find whether two posts are "similar" one to each other. You may want to use a probabilistic model for that:

http://en.wikipedia.org/wiki/Mutual_information

The idea would be to say that if two posts share a lot of "uncommon" words, they are probably speaking on the same topic. For detecting uncommon words, depending on your application, you may use a general table of frequencies, or maybe better, build it yourself on the universe of the words of your posts (but you will need to have enough of them to have something relevant).

I would not limit myself on title and tags, but I would overweight them in the research.

This kind of ideas is very common in spam filtering. I unfortunately the time to make a full review, but a quick google search gives:

http://www.aclweb.org/anthology/P/P04/P04-3024.pdf karlmicha.googlepages.com/acl2004_poster.pdf