I'm working on a customized search feature for a website. and I was curious if using only tf-idf to rank documents in my corpus would also help to weigh documents that have multiple search terms higher than documents with only one search term.
Example: Search = "poland spring water" Theoretically, would the above query weigh (using traditional tf-idf) a document higher if the a document contained "poland" 100 times and "water" zero times. Or would it weigh a document heavier if it contained "poland" 10 times and "water" 10 times.
I'm aware that it all depends on the tf-idf value of "pol开发者_运维百科and" and "water" but theoretically on an even playing field, would the algorithm help bring documents to the top of the results more if there were multiple terms in the document, or is it really term independent?
It is term independent. Remember, the tf-idf weighing scheme treats the query as a bag of words and each document is seen as a vector. For the above example, consider tf for poland is 100 while its idf is 1 in doc x. Also, consider tf for poland is 10 and tf for water is 2 is doc y. the idf of water is 1.
score of doc x = 100 score of doc y = 12
doc x ranked higher even though has one term.
its term independent. Depends on the ratio of how many documents contain poland and how many contain water. it that ratio. If its half-half, than the second document wins. If the ratio is 100:1, then the first document wins since the ratio is more similar to in-document distribution of the words.
精彩评论