ROC曲线 – 源码巴士

ROC Curve（ReceiverOperating Characteristics Curve，指受试者工作特征曲线 / 接收器操作特性曲线），在评估和比较二分类模型（结果通常标记为Positive或Negative）的性能时非常有用，它是一个二维的曲线，Y轴用sensitive（敏感性）表示，X轴用1-specificity（特异性）表示。它通过将连续变量设定出多个不同的临界值（cut-off points），从而计算出一系列sensitive和specificity。在ROC曲线上，最靠近坐标图左上方的点为敏感性和特异性均较高的临界值。

一个分类模型将不同的观测值分为两类（Positive，Negative），那么这个分类模型产生的结果共有四种可能：

一些重要的定义：

type I error = FP；
type II error= FN；
TP (True, Positive): 实际为Positive, 并且模型识别为Positive；
TN (True, Negative): 实际为Negative, 并且模型识别也为Negative；
FP (False,Positive): 实际为Negative，但是模型识别为Positive。比如，一个人并没有患癌症，但是模型识别这个人患有癌症，那么结果为FP；
FN (False Negative): 实际为Positive，但是模型识别为Negative。比如，一个人确实有病，但是模型识别他没有病，那么结果为FN；
TPR(True Positive Rate)：模型识别出的Positive实例占所有Positive实例的比例，TPR=TP/ (TP+ FN)；
FPR（True Negative Rate）：模型识别出的Positive但实际为Negative的实例占所有Negative实例的比例，FPR= FP/ (FP + TN)；
Sensitivity = power = TPR = TP / (TP + FN)
Specificity= TN / (TN + FP) = 1 – FPR
Accuracy = (TP+TN) / (P+N)
PPV(positive predictive value ) = precision = TP /(TP + FP)
NPV (negative predictive value ) = TN / (TN + FN)
false discovery rate = 1 - PPV = FP / (FP + TP)

接下来我们给出画ROC的方法。

下图蓝色为positive，红色为negative。

TP，FP，TN，FN的具体表示：

ROC的计算示例：

ROC曲线好坏的评价：

其中AUC(Area Under the ROC Curve，ROC曲线下的面积)。

下面给出ROC曲线的R代码：

1. tpr--fpr 之间的曲线

## computing a simple ROC curve (x-axis: fpr, y-axis: tpr)
library(ROCR.xval)
data(ROCR.xval)
pred <- prediction(ROCR.xval$predictions, ROCR.xval$labels)
perf <- performance(pred,"tpr","fpr")
plot(perf,col="grey82",lty=3)
plot(perf,lwd=3,avg="vertical",spread.estimate="boxplot",add=TRUE)

2. recall--precision之间的曲线

## precision/recall curve (x-axis: recall, y-axis: precision)
perf1 <- performance(pred, "prec", "rec")
plot(perf1,col="grey82",lty=3)
plot(perf1,lwd=3,avg="vertical",spread.estimate="boxplot",add=TRUE)

3. specificity--sensitivity之间的曲线

## sensitivity/specificity curve (x-axis: specificity, y-axis: sensitivity)
perf1 <- performance(pred, "sens", "spec")
plot(perf1,col="grey82",lty=3)
plot(perf1,lwd=3,avg="vertical",spread.estimate="boxplot",add=TRUE)

ROCR一些常用方法：

pred <- prediction(ROCR.simple$predictions,ROCR.simple$labels)

# performance
roc.perf <- performance(pred, measure = "tpr", x.measure = "fpr")

# draw roc curve
plot(roc.perf, col=rainbow(10))
abline(a=0, b= 1)

# precise and recall
rec.perf <- performance(pred, "prec", "rec")
plot(rec.perf, col=rainbow(10))

# accuracy
acc.perf = performance(pred, "acc")
plot(acc.perf, avg= "vertical", spread.estimate="boxplot", show.spread.at= seq(0.1, 0.9, by=0.1))

ind = which.max( slot(acc.perf, "y.values")[[1]] )
acc = slot(acc.perf, "y.values")[[1]][ind]
cutoff = slot(acc.perf, "x.values")[[1]][ind]
print(c(accuracy= acc, cutoff = cutoff))

# sensitivity and specificity
sens.perf<- performance(pred, "sens", "spec")
plot(sens.perf)

# check calibration
cal.perf <- performance(pred, "cal", window.size=50)
plot(cal.perf)

# cutoff density plot
plot(0,0,type="n", xlim= c(0,1), ylim=c(0,7), xlab="Cutoff", ylab="Density",
main="How well do the predictions separate the classes?")

for (runi in 1:length(pred@predictions)) {
  lines(density(pred@predictions[[runi]][pred@labels[[runi]]=="1"]), col= "red")
  lines(density(pred@predictions[[runi]][pred@labels[[runi]]=="0"]), col="green")
}

# find optimal cutoff
opt.cut = function(perf, pred){
    cut.ind = mapply(FUN=function(x, y, p){
        d = (x - 0)^2 + (y-1)^2
        ind = which(d == min(d))
        c(sensitivity = y[[ind]], specificity = 1-x[[ind]], cutoff = p[[ind]])
    }, perf@x.values, perf@y.values, pred@cutoffs)
}

print(opt.cut(roc.perf, pred))

# calculate AUC value
auc.perf = performance(pred, measure = "auc")
auc.perf@y.values

acc_perf <- performance(pred, measure = "acc")
plot(acc_perf)

rec_perf <- performance(pred, measure = "rec")
plot(rec_perf)

指标：

performance_measures <- function(pred) {
    
    tpr <- performance(pred, "tpr")@y.values[[1]]
    fpr <- performance(pred, "fpr")@y.values[[1]]
    acc <- performance(pred, "acc")@y.values[[1]]
    err <- performance(pred, "err")@y.values[[1]]

    rec <- performance(pred, "rec")@y.values[[1]]
    sens<- performance(pred, "sens")@y.values[[1]]
    fnr <- performance(pred, "fnr")@y.values[[1]]
    tnr <- performance(pred, "tnr")@y.values[[1]]
    spec<- performance(pred, "spec")@y.values[[1]]
    ppv <- performance(pred, "ppv")@y.values[[1]]
    prec<- performance(pred, "prec")@y.values[[1]]
    npv <- performance(pred, "npv")@y.values[[1]]
   
    fall<- performance(pred, "fall")@y.values[[1]]
    miss<- performance(pred, "miss")@y.values[[1]]
    pcfall <- performance(pred, "pcfall")@y.values[[1]]
    pcmiss <- performance(pred, "pcmiss")@y.values[[1]]
    rpp <- performance(pred, "rpp")@y.values[[1]]
    rnp <- performance(pred, "rnp")@y.values[[1]]
    
    auc <- performance(pred, "auc")@y.values[[1]]
    prbe<- performance(pred, "prbe")@y.values[[1]]
    rch <- performance(pred, "rch")@y.values[[1]]

    mxe <- performance(pred, "mxe")@y.values[[1]]
    rmse<- performance(pred, "rmse")@y.values[[1]]

    phi <- performance(pred, "phi")@y.values[[1]]
    mat <- performance(pred, "mat")@y.values[[1]]
    mi  <- performance(pred, "mi")@y.values[[1]]
    chisq<- performance(pred, "chisq")@y.values[[1]]
    odds<- performance(pred, "odds")@y.values[[1]]
    lift<- performance(pred, "lift")@y.values[[1]]
    f   <- performance(pred, "f")@y.values[[1]]
    sar <- performance(pred,"sar")@y.values[[1]]
    ecost  <- performance(pred, "ecost")@y.values[[1]]
    cost  <- performance(pred, "cost")@y.values[[1]]
    return(list(tpr=tpr, fpr=fpr, acc=acc, err=err,
                rec=rec, sens=sens, fnr=fnr, tnr=tnr,
                spec=spec, ppv=ppv, prec=prec, npv=npv, 
                fall=fall, miss=miss, pcfall=pcfall, pcmiss=pcmiss, rpp=rpp, rnp=rnp,
                auc=auc, prbe=prbe, rch=rch, mxe=mxe, 
                rmse=rmse, phi=phi, mat=mat, mi=mi, chisq=chisq, odds=odds,
                lift=lift, f=f, sar=sar, ecost=ecost, cost=cost))

}

参考资料：

原文链接：https://blog.csdn.net/fenghuangdesire/article/details/39759069