I am currently looking for some tool that would generate datasets of different shapes like square, circle, rectangle, etc. with outliers for cluster analysis.
Can any开发者_开发技巧 one of you recommend a good dataset generator for cluster analysis? Is there anyway to generates such datasets in languages like R?
You should probably look into the mlbench package, especially synthetic dataset generating from mlbench.*
functions, see some examples below.
Other datasets or utility functions are probably best found on the Cluster Task View on CRAN. As @Roman said, adding outliers is not really difficult, especially when you work in only two dimensions.
I would create a shape and extract bounding coordinates. You can populate the shape with random points using splancs
package.
Here's a small snippet from one of my programs:
# First we create a circle, into which uniform random points will be generated (kudos to Barry Rowlingson, r-sig-geo).
circle <- function(x = x, y = y, r = radius, n = n.faces){
t <- seq(from = 0, to = 2 * pi, length = n + 1)[-1]
t <- cbind(x = x + r * sin(t), y = y+ r * cos(t))
t <- rbind(t, t[1,])
return(t)
}
csr(circle(0, 0, 100, 30), 1000)
Feel free to add outliers. One way of going about this is sampling different shapes and joining them in different ways.
There is a flexible data generator in ELKI that can generate various distributions in arbitrary dimensionality. It also can generate Gamma distributed variables, for example.
There is documentation on the Wiki: http://elki.dbs.ifi.lmu.de/wiki/DataSetGenerator
精彩评论