What I got: A matrix where I got the predicted probability of an outcome (from a logistic regression model) and the known outcome. For those curious I actually got two regression models and an independent test dataset where I wish to compare these two models by doing this.
> head(matrixComb)
probComb outComb
[1,] 0.9999902 1
[2,] 0.9921736 0
[3,] 0.9901175 1
[4,] 0.9815581 0
[5,] 0.7692992 0
[6,] 0.7369990 0
What I want: A graph where I can plot how often my prediction model yields correct outcomes (one line for positives and one line for negatives) as a function of the cut off value for the probability. My problem is that I am unable to figure out how to do this without switching to Perl and use to For-loop to iterate through the matrix.
In Perl I would just start at probability 0.1 and in reach run of the for-loop increase the value by 0.1. In the first iteration I would count all probabilities <0.1 and outcome = 0 as true negatives, probability < 0.1 and outcome 1 as false negatives probability > 0.1 and outcome = 0 as false positives and probability > 0.1 and outcome = 1 as true positives.
The process would then be repeated and the results of each iteration would be printed as [probability, true positives/total positives, true negatives/total negatives]. Thus make it easy for me to print it out in open office开发者_开发问答 calc.
The reason that I am asking this is that the operation is too complex for me to find a similar case here on stackoverflow or in a tutorial. But I would really like to learn a way to do this in an efficient manner in the R environment.
You can get R to draw the curves which are based on ROC analysis. This is a crude version using the ROCR
package and could easily be made prettier
ss <- 1000 # sample size
mydf <- data.frame(probComb = runif(ss)) # predictions illustration
mydf$outComb <- 0 + (runif(ss) < mydf$probComb) # actuals illustration
library(ROCR)
pred <- prediction(mydf$probComb, mydf$outComb)
perfp <- performance(pred, "tpr")
perfn <- performance(pred, "tnr")
plot(perfp, col="green", ylab="True positive (green) and true negative (red) rates")
plot(perfn, col="red", ylab="True negative rate", add=TRUE)
to produce
If you must, you can find the data in perfp
and perfn
.
Here's a way to do this manually:
#Create some sample data
dat <- data.frame(x=runif(100),y=sample(0:1,100,replace=TRUE))
#Function to compute tp and tn
myFun <- function(x){
tbl <- table(dat$x > x,dat$y)
marg <- margin.table(tbl,2)
tn <- tbl[1,1]/marg[1]
tp <- tbl[2,2]/marg[2]
rs <- c(tp,tn)
names(rs) <- c('truePos','trueNeg')
return(rs)
}
#Decision thresholds
thresh <- seq(0.1,0.9, by = 0.1)
#Loop using lapply
temp <- as.data.frame(do.call(rbind,lapply(thresh,myFun)))
temp$thresh <- thresh
#Melt and plot using ggplot
tempMelt <- melt(temp,id.vars="thresh")
ggplot(tempMelt,aes(x=thresh,y=value)) +
geom_line(aes(group=variable,colour=variable))
Alternatively, as mentioned above in the comments, there are a plethora or ROC functions in R which can be found using ??ROC
. For example, using roc
from the caret
package:
temp <- as.data.frame(roc(dat$x,factor(dat$y)))
tempMelt <- melt(temp,id.vars="cutoff")
ggplot(tempMelt,aes(x=cutoff,y=value)) +
geom_line(aes(group=variable,colour=variable))
Maybe something like this:
# A function for counting outcomes for a certain probability
f <- function(d, p) {
lp <- d$prob < p
c(TNeg=sum(lp & d$out==0), TPos=sum(!lp & d$out==1))
}
# Make it accept a vector of probabilities
vf <- Vectorize(f, 'p')
# Sample data
n <- 100
d <- data.frame(prob=runif(n), out=round(runif(n)))
# Probabilities to plot
p <- seq(0,1, len=20)
res <- vf(d, p)
colnames(res) <- paste('p(', p, ')', sep='')
matplot(p, t(res), type='l', xlab='prob', ylab='count')
精彩评论