开发者

Ngram IDF smoothing

开发者 https://www.devze.com 2023-01-02 20:59 出处:网络
I am trying to use IDF scores to find interesting phrases in my pretty huge corpus of documents. I basically need something like Amazon\'s Statistically Improbable Phrases, i.e. phrases that distingui

I am trying to use IDF scores to find interesting phrases in my pretty huge corpus of documents.

I basically need something like Amazon's Statistically Improbable Phrases, i.e. phrases that distinguish a document from all the others

The problem that I am running into is that some (3,4)-grams in my data w开发者_JAVA百科hich have super-high idf actually consist of component unigrams and bigrams which have really low idf..

For example, "you've never tried" has a very high idf, while each of the component unigrams have very low idf..

I need to come up with a function that can take in document frequencies of an n-gram and all its component (n-k)-grams and return a more meaningful measure of how much this phrase will distinguish the parent document from the rest.

If I were dealing with probabilities, I would try interpolation or backoff models.. I am not sure what assumptions/intuitions those models leverage to perform well, and so how well they would do for IDF scores.

Anybody has any better ideas?


I take it that "you've never tried" is a phrase that you don't want to extract, but which has high IDF. The problem will be that there are going to be a vast number of n-grams that only occur in one document and so have the largest possible IDF score.

There are lots of smoothing techniques in NLP. This paper [Chen&Goodman] is a pretty good summary of many of them. In particular, you sound like you might be interested in the Kneser-Ney smoothing algorithm that works in the way you suggest (backing off to lower length n-grams).

These methods are usually used for the task of language modelling, i.e. to estimate the probability of an n-gram occurring given a really big corpus of the language. I don't really know how how you might integrate them with IDF scores, or even if that's really what you want to do.

0

精彩评论

暂无评论...
验证码 换一张
取 消