Ngram IDF smoothing_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-01-02 20:59 出处：网络

I am trying to use IDF scores to find interesting phrases in my pretty huge corpus of documents. I basically need something like Amazon\'s Statistically Improbable Phrases, i.e. phrases that distingui

I am trying to use IDF scores to find interesting phrases in my pretty huge corpus of documents.

I basically need something like Amazon's Statistically Improbable Phrases, i.e. phrases that distinguish a document from all the others

The problem that I am running into is that some (3,4)-grams in my data w开发者_JAVA百科hich have super-high idf actually consist of component unigrams and bigrams which have really low idf..

For example, "you've never tried" has a very high idf, while each of the component unigrams have very low idf..

I need to come up with a function that can take in document frequencies of an n-gram and all its component (n-k)-grams and return a more meaningful measure of how much this phrase will distinguish the parent document from the rest.

If I were dealing with probabilities, I would try interpolation or backoff models.. I am not sure what assumptions/intuitions those models leverage to perform well, and so how well they would do for IDF scores.

Anybody has any better ideas?

I take it that "you've never tried" is a phrase that you don't want to extract, but which has high IDF. The problem will be that there are going to be a vast number of n-grams that only occur in one document and so have the largest possible IDF score.

There are lots of smoothing techniques in NLP. This paper [Chen&Goodman] is a pretty good summary of many of them. In particular, you sound like you might be interested in the Kneser-Ney smoothing algorithm that works in the way you suggest (backing off to lower length n-grams).

These methods are usually used for the task of language modelling, i.e. to estimate the probability of an n-gram occurring given a really big corpus of the language. I don't really know how how you might integrate them with IDF scores, or even if that's really what you want to do.

Ngram IDF smoothing

精彩评论

关注公众号

热门标签

图文推荐

Ngram IDF smoothing

更多 问答 相关资讯：

精彩评论

关注公众号

热门标签

图文推荐

更多问答相关资讯：