Is there any algorithm or trick of how to determine the number of gaussians which should be identified within a set of data before applying the expectation maximization algorithm?
For example, in the above illustrated plot of 2 - Dimensional data, when I apply the Expectation Maximization algorithm, I try to fit 4 gaussians to the data and I would obtain the following result.
But what if I wouldn't knew the number of gauss开发者_StackOverflowians within the data? Is there any algorithm or trick which I could apply so that I could find out this detail?
This might be a bit of a retread, since others already linked the wiki article of the actual cluster number determination, but I found that article a lil overly dense, so I thought I'd provide a brief, intuitive answer:
Basically, there isn't a universally 'correct' answer for the number of clusters in a data set -- the fewer clusters, the smaller the description length but the higher the variance, and in all non-trivial datasets the variance won't completely go away unless you have a Gaussian for each point, which renders the clustering useless (this is a case of the more general phenomena known as the 'futility of bias free learning': A learner that makes no a priori assumptions regarding the identity of the target concept has no rational basis for classifying any unseen instances).
So you basically have to pick some feature of your dataset to maximize via the number of clusters (see the wiki article on inductive bias for some example features)
In other sad news, in all such cases finding the number of clusters is known to be NP-hard, so the best you can expect is a good heuristic approach.
Wikipedia has an article on this subject. I am not too familiar with the subject, but I've been told that clustering algorithms that don't require specifying the number of clusters instead need some density information about the clusters or some minimum distance between clusters.
- Non parametric bayesian clustering is now getting lot of attention. You dont need to specify clusters.
- Autoclass is algorithm that automatically identify number of clusters from mixture.
精彩评论