I am trying to recreate the following plot with R. Minitab describes this as a normal probability plot.
The probplot gets you most of the way there. Unfortunately, I cannot figure out how to add the confidence interval bands around this plot.
Similarly, ggplot's stat_qq() seems to present similar information with a transformed x axis. It seems that geom_smooth()
would be the likely candidate to add the bands, but I haven't figure that out.
Finally, the Getting Genetics Done guys describe something similar here.
Sample data to recreate the plot above:
x <- c(40.2, 43.1, 45.5, 44.5, 39.5, 38.5, 40.2, 41.0, 41.6, 43.1, 44.9, 42.8)
If anyone has a solution with 开发者_如何学编程base graphics or ggplot, I'd appreciate it!
EDIT
After looking at the details of probplot
, I've determined this is how it generates the fit line on the graph:
> xl <- quantile(x, c(0.25, 0.75))
> yl <- qnorm(c(0.25, 0.75))
> slope <- diff(yl)/diff(xl)
> int <- yl[1] - slope * xl[1]
> slope
75%
0.4151
> int
75%
-17.36
Indeed, comparing these results to what you get out of the probplot object seem to compare very well:
> check <- probplot(x)
> str(check)
List of 3
$ qdist:function (p)
$ int : Named num -17.4
..- attr(*, "names")= chr "75%"
$ slope: Named num 0.415
..- attr(*, "names")= chr "75%"
- attr(*, "class")= chr "probplot"
>
However, incorporating this information into ggplot2 or base graphics does not yield the same results.
probplot(x)
Versus:
ggplot(data = df, aes(x = x, y = y)) + geom_point() + geom_abline(intercept = int, slope = slope)
I get similar results using R's base graphics
plot(df$x, df$y)
abline(int, slope, col = "red")
Lastly, I've learned that the last two rows of the legend refer to the Anderson-Darling test for normality and can be reproduced with the nortest
package.
> ad.test(x)
Anderson-Darling normality test
data: x
A = 0.2303, p-value = 0.7502
Try the qqPlot
function in the QTLRel
package.
require("QTLRel")
qqPlot(rnorm(100))
Perhaps this will be something you can build on. By default, stat_smooth() uses level=0.95.
df <- data.frame(sort(x), ppoints(x))
colnames(df) <- c("x","y")
ggplot(df, aes(x,y)) +
geom_point() +
stat_smooth() +
scale_y_continuous(limits=c(0,1),breaks=seq(from=0.05,to=1,by=0.05), formatter="percent")
you are using the incorrect "y", they should be quantiles (labeled with probabilities). The following shows the line in the right spot:
df<-data.frame(x=sort(x),y=qnorm(ppoints(length(x))))
probs <- c(0.01, 0.05, seq(0.1, 0.9, by = 0.1), 0.95, 0.99)
qprobs<-qnorm(probs)
xl <- quantile(x, c(0.25, 0.75))
yl <- qnorm(c(0.25, 0.75))
slope <- diff(yl)/diff(xl)
int <- yl[1] - slope * xl[1]
ggplot(data = df, aes(x = x, y = y)) + geom_point() + geom_abline(intercept = int,slope = slope)+scale_y_continuous(limits=range(qprobs), breaks=qprobs, labels = 100*probs)+labs(y ="Percent" , x="Data")
to add the confidence bounds as in Minitab, you can do the following
fd<-fitdistr(x, "normal") #Maximum-likelihood Fitting of Univariate Dist from MASS
xp_hat<-fd$estimate[1]+qprobs*fd$estimate[2] #estimated perc. for the fitted normal
v_xp_hat<- fd$sd[1]^2+qprobs^2*fd$sd[2]^2+2*qprobs*fd$vcov[1,2] #var. of estimated perc
xpl<-xp_hat + qnorm(0.025)*sqrt(v_xp_hat) #lower bound
xpu<-xp_hat + qnorm(0.975)*sqrt(v_xp_hat) #upper bound
df.bound<-data.frame(xp=xp_hat,xpl=xpl, xpu = xpu,nquant=qprobs)
and add the following two lines to your ggplot from above (in addition, replace the slope and intercept line approach with the estimated percentiles)
geom_line(data=df.bound,aes(x = xp, y = qprobs))+
geom_line(data=df.bound,aes(x = xpl, y = qprobs))+
geom_line(data=df.bound,aes(x = xpu, y = qprobs))
I know it's an old question, but for others who also still look for a solution, have a look at ggqqplot
from the ggpubr
package.
library(ggpubr)
ggqqplot(data$sample)
[It's related to the answer from Julie B: above]
https://stackoverflow.com/a/9215532/5885615
This is the old topic, but someone still can want to do something (I did it recently). So I have found one issue showing a bit different results between R and Minitab: the QQ-plots are similar, but the end points are shifted more outside. After digging inside the code I have found the difference:
The function "ppoints" is used to distribute the sample by the range:
df<-data.frame(x=sort(x),y=qnorm(ppoints(length(x))))
In R it has the next source code:
function (n, a = if (n <= 10) 3/8 else 1/2) # function"ppoints"
{
if (length(n) > 1L)
n <- length(n)
if (n > 0)
(1L:n - a)/(n + 1 - 2 * a)
else numeric()
}
where the parameter "a", depending on "n", can be 3/8 or 1/2.
Minitab uses a = 0.3 for all "n".
The most visible effect is on the end points of the sample.
精彩评论