I'm working on a cluster analysis program that takes a set of points S as an input and labels each point with that index of the cluster it belong to. I've implemented the DBScan and OPTICS algorithms and they both work as expected. However, the results of those algorithms can be very different depending on the initial values of MinPts and Epsilon. I've searched all over the net and read lots of papers about data mining and cluster analysis and yet I can't seem to find a way of analyzing the data without needing MinPts and Epsilon to determine if a point is in such a cluster. I'm guessing density based cluster analysis is not the way to go in my case.
Does anyone have an idea or know about an algorithm I could use that wont require that kind of configuration ? Or simply point me in the right direction. Any help is welcome.
Thanks!
It's a school project I'm trying to finish, in which I have a set of 2D coords representing points on a plane, and I have to determine what cluster each point belongs to. Now I've done that using OPTICS and it work fine but I need to tweak the Eps value so that my output matches the example outputs I'm given. But since I have no description of what a cluster is in the subject, or what it's characteristics are, there is no way I can base myself solely on the distance between the points, or the density of points in a given region. Also, I do not know the number of clusters in advance, hence my use of the OPTICS algorithm. So in my opinion, either I'm doing it 开发者_开发问答very wrong, either there is a crucial piece of info missing in the subject. And also, I'm not looking for anyone to do my homework or give me any source code, just some ideas or guidance, since I'm pretty much lost as how get the exact results given in the data set examples (I'm also not allowed to get any wrong values, if I do they consider the project is a failure, so algorithms with error margins can't be used).
Thanks again, and sorry for the long post.
In general a set of points can be assigned to clusters in more than one way (e.g. they can all be assigned to one big cluster, or divided into two or three), so you have to have some parameters.
Why do you object to MinPts and Epsilon? If you don't like what happens when you change them, don't change them. Seriously.
EDIT:
What a bizarre assignment! Your clustering must match theirs perfectly, with no other clues? I will assume that they are neither idiots nor sadists and make the following guess: in the examples, there is a "natural" clustering that is obvious to the eye. Am I right? If so then there is a way we can set the parameters programmatically, as a function of distances in the point set. How many examples are there, and is it possible to post one?
EDIT:
Hah! I knew it! Here is a rule that will correctly divide this case into clusters: find the biggest distance from any point to its nearest neighbor, and if any two points are less than twice that distance apart, they belong to the same cluster. I'll bet it will work on the other cases too.
You could try looking into the many other cluster algorithms out there. You have probabilistic clustering (EM), partitional clustering (KMeans), hierarchical clustering, and many others... Of course each requires a different kind of configuration
Also make sure to try Weka, an open source tool containing lots and lots of machine learning algorithms (classification, clustering, pre-processing, ...). I believe it has an implementation (Java) for all those algorithms mentioned.
Edit: The question of determining which clustering is best is very domain-dependent. And it all comes down to how the clusters are put to use in the context of your application that determines how useful they are (Besides there could be more than one natural clustering of your data).
精彩评论