confusion_matrix {qwraps2} | R Documentation |
Confusion Matrices (Contingency Tables)
Description
Construction of confusion matrices, accuracy, sensitivity, specificity, confidence intervals (Wilson's method and (optional bootstrapping)).
Usage
confusion_matrix(
...,
thresholds = NULL,
confint_method = "logit",
alpha = getOption("qwraps2_alpha", 0.05)
)
## Default S3 method:
confusion_matrix(
truth,
predicted,
...,
thresholds = NULL,
confint_method = "logit",
alpha = getOption("qwraps2_alpha", 0.05)
)
## S3 method for class 'formula'
confusion_matrix(
formula,
data = parent.frame(),
...,
thresholds = NULL,
confint_method = "logit",
alpha = getOption("qwraps2_alpha", 0.05)
)
## S3 method for class 'glm'
confusion_matrix(
x,
...,
thresholds = NULL,
confint_method = "logit",
alpha = getOption("qwraps2_alpha", 0.05)
)
## S3 method for class 'qwraps2_confusion_matrix'
print(x, ...)
Arguments
... |
pass through |
thresholds |
a numeric vector of thresholds to be used to define the
confusion matrix (one threshold) or matrices (two or more thresholds). If
|
confint_method |
character string denoting if the logit (default), binomial, or Wilson Score method for deriving confidence intervals |
alpha |
alpha level for 100 * (1 - alpha)% confidence intervals |
truth |
a integer vector with the values |
predicted |
a numeric vector. See Details. |
formula |
column (known) ~ row (test) for building the confusion matrix |
data |
environment containing the variables listed in the formula |
x |
a |
Details
The confusion matrix:
True | Condition | ||
+ | - | ||
Predicted Condition | + | TP | FP |
Predicted Condition | - | FN | TN |
where
FN: False Negative = truth = 1 & prediction < threshold,
FP: False Positive = truth = 0 & prediction >= threshold,
TN: True Negative = truth = 0 & prediction < threshold, and
TP: True Positive = truth = 1 & prediction >= threshold.
The statistics returned in the cm_stats
element are:
accuracy = (TP + TN) / (TP + TN + FP + FN)
sensitivity, aka true positive rate or recall = TP / (TP + FN)
specificity, aka true negative rate = TN / (TN + FP)
positive predictive value (PPV), aka precision = TP / (TP + FP)
negative predictive value (NPV) = TN / (TN + FN)
false negative rate (FNR) = 1 - Sensitivity
false positive rate (FPR) = 1 - Specificity
false discovery rate (FDR) = 1 - PPV
false omission rate (FOR) = 1 - NPV
F1 score
Matthews Correlation Coefficient (MCC) = ((TP * TN) - (FP * FN)) / sqrt((TP + FP) (TP+FN) (TN+FP) (TN+FN))
Synonyms for the statistics:
Sensitivity: true positive rate (TPR), recall, hit rate
Specificity: true negative rate (TNR), selectivity
PPV: precision
FNR: miss rate
Sensitivity and PPV could, in some cases, be indeterminate due to division by zero. To address this we will use the following rule based on the DICE group https://github.com/dice-group/gerbil/wiki/Precision,-Recall-and-F1-measure: If TP, FP, and FN are all 0, then PPV, sensitivity, and F1 will be defined to be 1. If TP are 0 and FP + FN > 0, then PPV, sensitivity, and F1 are all defined to be 0.
Value
confusion_matrix
returns a list with elements
-
cm_stats
a data.frame with columns: -
auroc
numeric value for the area under the receiver operating curve -
auroc_ci
a numeric vector of length two with the lower and upper bounds for a 100(1-alpha)% confidence interval about the auroc -
auprc
numeric value for the area under the precision recall curve -
auprc_ci
a numeric vector of length two with the lower and upper limits for a 100(1-alpha)% confidence interval about the auprc -
confint_method
a character string reporting the method used to build theauroc_ci
andauprc_ci
-
alpha
the alpha level of the confidence intervals -
prevalence
the proportion of the input of positive cases, that is (TP + FN) / (TP + FN + FP + TN) = P / (P + N)
Examples
# Example 1: known truth and prediction status
df <-
data.frame(
truth = c(1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0)
, pred = c(1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0)
)
confusion_matrix(df$truth, df$pred, thresholds = 1)
# Example 2: Use with a logistic regression model
mod <- glm(
formula = spam ~ word_freq_our + word_freq_over + capital_run_length_total
, data = spambase
, family = binomial()
)
confusion_matrix(mod)
confusion_matrix(mod, thresholds = 0.5)