I am trying to produce some example graphics using ggplot2, and one of the examples I picked was the birthday problem, here using code 'borrowed' from a Revolution computing presentation at Oscon.
birthday<-function(n){
ntests<-1000
pop<-1:365
anydup<-function(i){
any(duplicated(sample(pop,n,replace=TRUE)))
}
sum(sapply(seq(ntests), anydup))/ntests
}
x<-data.frame(x=rep(1:100, each=5))
x<-ddply(x, .(x),开发者_StackOverflow社区 function(df) {return(data.frame(x=df$x, prob=birthday(df$x)))})
birthdayplot<-ggplot(x, aes(x, prob))+
geom_point()+geom_smooth()+
theme_bw()+
opts(title = "Probability that at least two people share a birthday in a random group")+
labs(x="Size of Group", y="Probability")
Here my graph is what I would describe as exponential, but the geom_smooth doesn't fit the data particularly well. I've tried the loess method but this didn't change things much. Can anyone suggest how to add a better smooth ?
Thanks
Paul.
The smoothing routine does not react to the sudden change for low values of x
fast enough (and it has no way of knowing that the values of prob
are restricted to a 0-1 range). Since you have so low variability, a quick solution is to reduce the span of values over which smoothing at each point is done. Check out the red line in this plot:
birthdayplot + geom_smooth(span=0.1, colour="red")
The problem is that the probabilities follow a logistic curve. You could fit a proper smoothing line if you change the birthday function to return the raw successes and failures instead of the probabilities.
birthday<-function(n){
ntests<-1000
pop<-1:365
anydup<-function(i){
any(duplicated(sample(pop,n,replace=TRUE)))
}
data.frame(Dups = sapply(seq(ntests), anydup) * 1, n = n)
}
x<-ddply(x, .(x),function(df) birthday(df$x))
Now, you'll have to add the points as a summary, and specify a logistic regression as the smoothing type.
ggplot(x, aes(n, Dups)) +
stat_summary(fun.y = mean, geom = "point") +
stat_smooth(method = "glm", family = binomial)
精彩评论