ROC曲线

ROC Curve(ReceiverOperating Characteristics Curve,指受试者工作特征曲线 / 接收器操作特性曲线),在评估和比较二分类模型(结果通常标记为Positive或Negative)的性能时非常有用,它是一个二维的曲线,Y轴用sensitive(敏感性)表示,X轴用1-specificity(特异性)表示。它通过将连续变量设定出多个不同的临界值(cut-off points),从而计算出一系列sensitive和specificity。在ROC曲线上,最靠近坐标图左上方的点为敏感性和特异性均较高的临界值。


一个分类模型将不同的观测值分为两类(Positive,Negative),那么这个分类模型产生的结果共有四种可能:



一些重要的定义:

  1. type I error = FP;
  2. type II error= FN;
  3. TP (True, Positive): 实际为Positive, 并且模型识别为Positive;
  4. TN (True, Negative): 实际为Negative, 并且模型识别也为Negative;
  5. FP (False,Positive): 实际为Negative,但是模型识别为Positive。比如,一个人并没有患癌症,但是模型识别这个人患有癌症,那么结果为FP;
  6. FN (False Negative):  实际为Positive,但是模型识别为Negative。比如,一个人确实有病,但是模型识别他没有病,那么结果为FN;
  7. TPR(True Positive Rate):模型识别出的Positive实例占所有Positive实例的比例,TPR=TP/ (TP+ FN);
  8. FPR(True Negative Rate):模型识别出的Positive但实际为Negative的实例占所有Negative实例的比例,FPR= FP/ (FP + TN);
  9. Sensitivity = power = TPR = TP / (TP + FN)
  10. Specificity= TN / (TN + FP) = 1 – FPR
  11. Accuracy = (TP+TN) / (P+N)
  12. PPV(positive predictive value ) = precision = TP /(TP + FP)
  13. NPV (negative predictive value ) = TN / (TN + FN)
  14. false discovery rate = 1 - PPV = FP / (FP + TP) 

接下来我们给出画ROC的方法。


下图蓝色为positive,红色为negative。

 

TP,FP,TN,FN的具体表示:


 

ROC的计算示例:

 

ROC曲线好坏的评价:


 

其中AUC(Area Under the ROC Curve,ROC曲线下的面积)。


下面给出ROC曲线的R代码:


1.  tpr--fpr 之间的曲线

## computing a simple ROC curve (x-axis: fpr, y-axis: tpr)
library(ROCR.xval)
data(ROCR.xval)
pred <- prediction(ROCR.xval$predictions, ROCR.xval$labels)
perf <- performance(pred,"tpr","fpr")
plot(perf,col="grey82",lty=3)
plot(perf,lwd=3,avg="vertical",spread.estimate="boxplot",add=TRUE)

                          

2.  recall--precision之间的曲线
## precision/recall curve (x-axis: recall, y-axis: precision)
perf1 <- performance(pred, "prec", "rec")
plot(perf1,col="grey82",lty=3)
plot(perf1,lwd=3,avg="vertical",spread.estimate="boxplot",add=TRUE)
                            

3.  specificity--sensitivity之间的曲线
## sensitivity/specificity curve (x-axis: specificity, y-axis: sensitivity)
perf1 <- performance(pred, "sens", "spec")
plot(perf1,col="grey82",lty=3)
plot(perf1,lwd=3,avg="vertical",spread.estimate="boxplot",add=TRUE)

                                 

ROCR一些常用方法:

pred <- prediction(ROCR.simple$predictions,ROCR.simple$labels)

# performance
roc.perf <- performance(pred, measure = "tpr", x.measure = "fpr")

# draw roc curve
plot(roc.perf, col=rainbow(10))
abline(a=0, b= 1)

# precise and recall
rec.perf <- performance(pred, "prec", "rec")
plot(rec.perf, col=rainbow(10))

# accuracy
acc.perf = performance(pred, "acc")
plot(acc.perf, avg= "vertical", spread.estimate="boxplot", show.spread.at= seq(0.1, 0.9, by=0.1))

ind = which.max( slot(acc.perf, "y.values")[[1]] )
acc = slot(acc.perf, "y.values")[[1]][ind]
cutoff = slot(acc.perf, "x.values")[[1]][ind]
print(c(accuracy= acc, cutoff = cutoff))

# sensitivity and specificity
sens.perf<- performance(pred, "sens", "spec")
plot(sens.perf)

# check calibration
cal.perf <- performance(pred, "cal", window.size=50)
plot(cal.perf)

# cutoff density plot
plot(0,0,type="n", xlim= c(0,1), ylim=c(0,7), xlab="Cutoff", ylab="Density",
main="How well do the predictions separate the classes?")

for (runi in 1:length(pred@predictions)) {
  lines(density(pred@predictions[[runi]][pred@labels[[runi]]=="1"]), col= "red")
  lines(density(pred@predictions[[runi]][pred@labels[[runi]]=="0"]), col="green")
}

# find optimal cutoff
opt.cut = function(perf, pred){
    cut.ind = mapply(FUN=function(x, y, p){
        d = (x - 0)^2 + (y-1)^2
        ind = which(d == min(d))
        c(sensitivity = y[[ind]], specificity = 1-x[[ind]], cutoff = p[[ind]])
    }, perf@x.values, perf@y.values, pred@cutoffs)
}

print(opt.cut(roc.perf, pred))

# calculate AUC value
auc.perf = performance(pred, measure = "auc")
auc.perf@y.values

acc_perf <- performance(pred, measure = "acc")
plot(acc_perf)

rec_perf <- performance(pred, measure = "rec")
plot(rec_perf)


指标:

performance_measures <- function(pred) {
    
    tpr <- performance(pred, "tpr")@y.values[[1]]
    fpr <- performance(pred, "fpr")@y.values[[1]]
    acc <- performance(pred, "acc")@y.values[[1]]
    err <- performance(pred, "err")@y.values[[1]]

    rec <- performance(pred, "rec")@y.values[[1]]
    sens<- performance(pred, "sens")@y.values[[1]]
    fnr <- performance(pred, "fnr")@y.values[[1]]
    tnr <- performance(pred, "tnr")@y.values[[1]]
    spec<- performance(pred, "spec")@y.values[[1]]
    ppv <- performance(pred, "ppv")@y.values[[1]]
    prec<- performance(pred, "prec")@y.values[[1]]
    npv <- performance(pred, "npv")@y.values[[1]]
   
    fall<- performance(pred, "fall")@y.values[[1]]
    miss<- performance(pred, "miss")@y.values[[1]]
    pcfall <- performance(pred, "pcfall")@y.values[[1]]
    pcmiss <- performance(pred, "pcmiss")@y.values[[1]]
    rpp <- performance(pred, "rpp")@y.values[[1]]
    rnp <- performance(pred, "rnp")@y.values[[1]]
    
    auc <- performance(pred, "auc")@y.values[[1]]
    prbe<- performance(pred, "prbe")@y.values[[1]]
    rch <- performance(pred, "rch")@y.values[[1]]

    mxe <- performance(pred, "mxe")@y.values[[1]]
    rmse<- performance(pred, "rmse")@y.values[[1]]

    phi <- performance(pred, "phi")@y.values[[1]]
    mat <- performance(pred, "mat")@y.values[[1]]
    mi  <- performance(pred, "mi")@y.values[[1]]
    chisq<- performance(pred, "chisq")@y.values[[1]]
    odds<- performance(pred, "odds")@y.values[[1]]
    lift<- performance(pred, "lift")@y.values[[1]]
    f   <- performance(pred, "f")@y.values[[1]]
    sar <- performance(pred,"sar")@y.values[[1]]
    ecost  <- performance(pred, "ecost")@y.values[[1]]
    cost  <- performance(pred, "cost")@y.values[[1]]
    return(list(tpr=tpr, fpr=fpr, acc=acc, err=err,
                rec=rec, sens=sens, fnr=fnr, tnr=tnr,
                spec=spec, ppv=ppv, prec=prec, npv=npv, 
                fall=fall, miss=miss, pcfall=pcfall, pcmiss=pcmiss, rpp=rpp, rnp=rnp,
                auc=auc, prbe=prbe, rch=rch, mxe=mxe, 
                rmse=rmse, phi=phi, mat=mat, mi=mi, chisq=chisq, odds=odds,
                lift=lift, f=f, sar=sar, ecost=ecost, cost=cost))

}



 参考资料:

  1. Stats for Clinical Trials, Math 150, JoHardin, Info on ROC curves
  2. Receiver Operating Characteristic Analysis:A Tool for the Quantitative Evaluation of Observer Performance and ImagingSystems.
  3. Package 'ROCR'





版权声明:本文为fenghuangdesire原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接和本声明。