I have encountered a very unusual problem. I have a set of phrases (noun phrases) extracted from a large corpus of documents. These phrases are >=2 and <=3 words of length. There is a need to cluster these phrases because the number of phrases extracted are very large in number and showing them as a simple list might not be useful for the user.
We are thinking of nice very simple ways of clustering these. Is there a quick tool/software/method that I coul开发者_如何学Cd use to cluster these so that all phrases inside a cluster belong to a particular theme/topic, if I keep the number of topics as a fixed initially? I don't have any training set or any other clusters that I can use as a training set.
Topic classification is not an easy problem.
The conventional methods used to classify long documents (100's of words) are usually based on frequent words, and not suitable for very short messages. I believe that your problem is somewhat similar to tweet classification.
Two very interesting papers are:
- Discovering Context: Classifying Tweets through a Semantic Transform Based on Wikipedia
(presented at HCI International 2011)
- Eddi: Interactive Topic-based Browsing of Social Status Streams (presented at UIST'10)
If you want to include knowledge about the world so that, e.g., cat and dog will be clustered together, you can use WordNet's domains hierarchy.
精彩评论