cram_bandit {cramR} | R Documentation |
Cram Bandit: On-policy Statistical Evaluation in Contextual Bandits
Description
Performs the Cram method for On-policy Statistical Evaluation in Contextual Bandits
Usage
cram_bandit(pi, arm, reward, batch = 1, alpha = 0.05)
Arguments
pi |
An array of shape (T × B, T, K) or (T × B, T), where T is the number of learning steps (or policy updates), B is the batch size, K is the number of arms, T x B is the total number of contexts. If 3D, pi[j, t, a] gives the probability that the policy pi_t assigns arm a to context X_j. If 2D, pi[j, t] gives the probability that the policy pi_t assigns arm A_j (arm actually chosen under X_j in the history) to context X_j. Please see vignette for more details. |
arm |
A vector of length T x B indicating which arm was selected in each context |
reward |
A vector of observed rewards of length T x B |
batch |
(Optional) A vector or integer. If a vector, gives the batch assignment for each context. If an integer, interpreted as the batch size and contexts are assigned to a batch in the order of the dataset. Default is 1. |
alpha |
Significance level for confidence intervals for calculating the empirical coverage. Default is 0.05 (95% confidence). |
Value
A list containing:
raw_results |
A data frame summarizing key metrics: Empirical Bias on Policy Value, Average relative error on Policy Value, RMSE using relative errors on Policy Value, Empirical Coverage of Confidence Intervals. |
interactive_table |
An interactive table summarizing the same key metrics in a user-friendly interface. |
Examples
# Example with batch size of 1
# Set random seed for reproducibility
set.seed(42)
# Define parameters
T <- 100 # Number of timesteps
K <- 4 # Number of arms
# Simulate a 3D array pi of shape (T, T, K)
# - First dimension: Individuals (context Xj)
# - Second dimension: Time steps (pi_t)
# - Third dimension: Arms (depth)
pi <- array(runif(T * T * K, 0.1, 1), dim = c(T, T, K))
# Normalize probabilities so that each row sums to 1 across arms
for (t in 1:T) {
for (j in 1:T) {
pi[j, t, ] <- pi[j, t, ] / sum(pi[j, t, ])
}
}
# Simulate arm selections (randomly choosing an arm)
arm <- sample(1:K, T, replace = TRUE)
# Simulate rewards (assume normally distributed rewards)
reward <- rnorm(T, mean = 1, sd = 0.5)
result <- cram_bandit(pi, arm, reward, batch=1, alpha=0.05)
result$raw_results
result$interactive_table