pinterval_bccp {pintervals} | R Documentation |
Bin-conditional conformal prediction intervals for continuous predictions
Description
This function calculates bin-conditional conformal prediction intervals with a confidence level of 1-alpha for a vector of (continuous) predicted values using inductive conformal prediction on a bin-by-bin basis. The intervals are computed using either a calibration set with predicted and true values or a set of pre-computed non-conformity scores from the calibration set. In addition the function requires either a set of breaks or a vector of bin identifiers for the calibrations set, either as a standalone vector or as the third column of the calibration dataset if the calibration data is provided as a tibble. The function returns a tibble containing the predicted values along with the lower and upper bounds of the prediction intervals. Bin-conditional conformal prediction intervals are useful when the prediction error is not constant across the range of predicted values and ensures that the coverage is (approximately) correct for each bin under the assumption that the non-conformity scores are exchangeable within each bin.
Usage
pinterval_bccp(
pred,
calib = NULL,
calib_truth = NULL,
calib_bins = NULL,
breaks = NULL,
nbins = NULL,
alpha = 0.1,
ncs_function = "absolute_error",
ncs = NULL,
min_step = 0.01,
grid_size = NULL,
right = TRUE,
weighted_cp = FALSE,
contiguize = FALSE
)
Arguments
pred |
Vector of predicted values |
calib |
A numeric vector of predicted values in the calibration partition or a 2 or 3 column tibble or matrix with the first column being the predicted values and the second column being the truth values and (optionally) the third column being the bin values if bins are not provided as a standalone vector or if breaks are not provided |
calib_truth |
A numeric vector of true values in the calibration partition |
calib_bins |
A vector of bin identifiers for the calibration set |
breaks |
A vector of break points for the bins to manually define the bins. If NULL, lower and upper bounds of the bins are calculated as the minimum and maximum values of each bin in the calibration set. Must be provided if calib_bins or nbins are not provided, either as a vector or as the last column of a calib tibble. |
nbins |
Automatically chop the calibration set into nbins based on the true values with approximately equal number of observations in each bin. Must be provided if calib_bins or breaks are not provided. |
alpha |
The confidence level for the prediction intervals. Must be a single numeric value between 0 and 1 |
ncs_function |
A function or a character string matching a function that takes two arguments, a vector of predicted values and a vector of true values, in that order. The function should return a numeric vector of nonconformity scores. Default is 'absolute_error' which returns the absolute difference between the predicted and true values. |
ncs |
An optional numeric vector of pre-computed nonconformity scores from a calibration partition. If provided, calib will be ignored. If provided, bins must be provided in calib_bins and breaks as well. |
min_step |
The minimum step size for the grid search. Default is 0.01. Useful to change if predictions are made on a discrete grid or if the resolution of the interval is too coarse or too fine. |
grid_size |
Alternative to min_step, the number of points to use in the grid search between the lower and upper bound. If provided, min_step will be ignored. |
right |
Logical, if TRUE the bins are right-closed (a,b] and if FALSE the bins are left-closed '[ a,b)'. Only used if breaks or nbins are provided. |
weighted_cp |
Logical, if TRUE the prediction intervals are created by bootstrapping the ncs scores giving a higher weight to the ncs scores that are closer to the predicted value. Default is FALSE. Experimental, so use with caution. |
contiguize |
logical indicating whether to contiguize the intervals. TRUE will consider all bins for each prediction using the lower and upper endpoints as interval limits to avoid non-contiguous intervals. FALSE will allows for non-contiguous intervals. TRUE guarantees at least appropriate coverage in each bin, but may suffer from over-coverage in certain bins. FALSE will have appropriate coverage in each bin but may have non-contiguous intervals. Default is FALSE. |
Value
A tibble with the predicted values, the lower and upper bounds of the prediction intervals. If treat_noncontiguous is 'non_contiguous', the lower and upper bounds are set in a list variable called 'intervals' where all non-contiguous intervals are stored.
Examples
library(dplyr)
library(tibble)
x1 <- runif(1000)
x2 <- runif(1000)
y <- rlnorm(1000, meanlog = x1 + x2, sdlog = 0.5)
bin <- cut(y, breaks = quantile(y, probs = seq(0, 1, 1/4)),
include.lowest = TRUE, labels =FALSE)
df <- tibble(x1, x2, y, bin)
df_train <- df %>% slice(1:500)
df_cal <- df %>% slice(501:750)
df_test <- df %>% slice(751:1000)
mod <- lm(log(y) ~ x1 + x2, data=df_train)
calib <- exp(predict(mod, newdata=df_cal))
calib_truth <- df_cal$y
calib_bins <- df_cal$bin
pred_test <- exp(predict(mod, newdata=df_test))
pinterval_bccp(pred = pred_test,
calib = calib,
calib_truth = calib_truth,
calib_bins = calib_bins,
alpha = 0.1,
grid_size = 10000)