开发者

Incremental clustering algorithm for grouping news articles?

开发者 https://www.devze.com 2023-01-13 17:51 出处:网络
I\'m doing a little research on开发者_运维百科 how to cluster articles into \'news stories\' ala Google News.

I'm doing a little research on开发者_运维百科 how to cluster articles into 'news stories' ala Google News.

Looking at previous questions here on the subject, I often see it recommended to simply pull out a vector of words from an article, weight some of the words more if they're in certain parts of the article (e.g. the headline), and then to use something like a k-means algorithm to cluster the articles.

But this leads to a couple of questions:

  • With k-means, how do you know in advance how much k should be? In a dynamic news environment you may have a very variable number of stories, and you won't know in advance how many stories a collection of articles represents.

  • With hierarchal clustering algorithms, how do you decide which clusters to use as your stories? You'll have clusters at the bottom of the tree that are just single articles, which you obviously won't want to use, and a cluster at the root of the tree which has all of the articles, which again you won't want...but how do you know which clusters in between should be used to represent stories?

  • Finally, with either k-means or hierarchal algorithms, most literature I have read seems to assume you have a preset collection of documents you want to cluster, and it clusters them all at once. But what of a situation where you have new articles coming in every so often. What happens? Do you have to cluster all the articles from scratch, now that there's an additional one? This is why I'm wondering if there are approaches that let you 'add' articles as you go without re-clustering from scratch. I can't imagine that's very efficient.


I worked on a start-up that built exactly this: an incremental clustering engine for news articles. We based our algorithm on this paper: Web Document Clustering Using Document Index Graph (http://ieeexplore.ieee.org/xpl/articleDetails.jsp?reload=true&arnumber=4289851). Worked well for us for 10K articles / day.

It has two main advantages: 1) It's incremental, which addresses the problem you have with having to deal with a stream of incoming articles (rather than clustering all at once) 2) It uses phrase-based modeling, as opposed to just "bag of words", which results in much higher accuracy.

A Google search pops up http://www.similetrix.com, they might have what you're looking for.


I would do a search for adaptive K-means clustering algorithms. There is a good section of research devoted to the problems you describe. Here is one such paper (pdf)

0

精彩评论

暂无评论...
验证码 换一张
取 消