EBcoBART {EBcoBART} | R Documentation |
Learning prior covariate weights for BART models with empirical Bayes and co-data.
Description
Function that estimates the prior probabilities of variables being selected in the splitting rules of Bayesian Additive Regression Trees (BART). Estimation is performed using empirical Bayes and co-data, i.e. external information on the explanatory variables.
Usage
EBcoBART(
Y,
X,
model,
CoData,
nIter = 10,
EB_k = FALSE,
EB_alpha = FALSE,
EB_sigma = FALSE,
Prob_Init = c(rep(1/ncol(X), ncol(X))),
verbose = FALSE,
ndpost = 5000,
nskip = 5000,
nchain = 5,
keepevery = 1,
ntree = 50,
alpha = 0.95,
beta = 2,
k = 2,
sigest = stats::sd(Y) * 0.667,
sigdf = 10,
sigquant = 0.75
)
Arguments
Y |
Response variable that can be either continuous or binary. Should be a numeric. |
X |
Explanatory variables. Should be a matrix. If X is a data.frame and contains factors, you may consider the function Dat_EBcoBART |
model |
What type of response variable Y. Can be either continuous or binary |
CoData |
The co-data model matrix with co-data information on explanatory variables in X. Should be a matrix, so not a data.frame. If grouping information is present, please encode this yourself using dummies with dummies representing which group a explanatory variable belongs to. The number of rows of the co-data matrix should equal the number of columns of X. If no CoData is available, but one aims to estimate either prior para- meter k, alpha or sigma, please specify CoData == NULL. |
nIter |
Number of iterations of the EM algorithm |
EB_k |
Logical (T/F). If true, the EM algorithm also estimates prior parameter k (of leaf node parameter prior). Defaults to False. Setting to true increases computational time. |
EB_alpha |
Logical (T/F). If true, the EM algorithm also estimates prior parameter alpha (of tree structure prior). Defaults to False. Setting to true increases computational time. |
EB_sigma |
Logical (T/F). If true, the EM algorithm also estimates prior parameters of the error variance. To do so, the algorithm estimates the degrees of freedom (sigdf) and the quantile (sigest) at which sigquant of the probability mass is placed. Thus, the specified sigquant is kept fixed and sigdf and sigest are updated. Defaults to False. |
Prob_Init |
Initial vector of splitting probabilities for explanatory variables X. Length should equal number of columns of X (and number of rows in CoData). Defaults to 1/p, i.e. equal weight for each variable. |
verbose |
Logical. Asks whether algorithm progress should be printed. Defaults to FALSE. |
ndpost |
Number of posterior samples returned by dbarts after burn-in. Same as in dbarts. Defaults to 5000. |
nskip |
Number of burn-in samples. Same as in dbarts. Defaults to 5000. |
nchain |
Number of independent mcmc chains. Same as in dbarts. Defaults to 5. |
keepevery |
Thinning. Same as in dbarts. Defaults to 1. |
ntree |
Number of trees in the BART model. Same as in dbarts. Defaults to 50. |
alpha |
Alpha parameter of tree structure prior. Called base in dbarts. Defaults to 0.95. If EB_alpha is TRUE, this parameter will be the starting value. |
beta |
Beta parameter of tree structure prior. Called power in dbarts. Defaults to 2. |
k |
Parameter for leaf node parameter prior. Same as in dbarts. Defaults to 2. If EB_k is TRUE, this parameter will be the starting value. |
sigest |
Only for continuous response. Estimate of error variance used to set scaled inverse Chi^2 prior on error variance. Same as in dbarts. Defaults to 0.667*var(Y). #' If EB_sigma is TRUE, this parameter will be the starting value. |
sigdf |
Only for continuous response. Degrees of freedom for error variance prior. Same as in dbarts. Defaults to 10. If EB_sigma is TRUE, this parameter will be the starting value. |
sigquant |
Only for continuous response. Quantile at which sigest is placed Same as in dbarts. Defaults to 0.75. If EB_sigma is TRUE, this parameter will be fixed, only sigdf and sigest will be updated. |
Value
An object with the estimated variable weights, i.e the probabilities that variables are selected in the splitting rules. Additionally, the final co-data model is returned. If EB is set to TRUE, estimates of k and/or alpha and/or (sigdf, sigest) are also returned. The returned object is of class S3 for which print(), summary(), and plot() are available. Function print() prints convergence details of the algorithm, summary() prints prior parameter estimates of EBcoBART, and plot() plots the estimated prior variable weights (including vertical line for equal variable weights).
The prior parameter estimates can then be used in your favorite BART R package that supports manually setting the splitting variable probability vector (dbarts and BARTMachine).
Author(s)
Jeroen M. Goedhart, j.m.goedhart@amsterdamumc.nl
References
Jerome H. Friedman. "Multivariate Adaptive Regression Splines." The Annals of Statistics, 19(1) 1-67 March, 1991.
Hugh A. Chipman, Edward I. George, Robert E. McCulloch. "BART: Bayesian additive regression trees." The Annals of Applied Statistics, 4(1) 266-298 March 2010.
Jeroen M. Goedhart, Thomas Klausch, Jurriaan Janssen, Mark A. van de Wiel. "Co-data Learning for Bayesian Additive Regression Trees." arXiv preprint arXiv:2311.09997. 2023 Nov 16.
Examples
###################################
### Binary response example ######
###################################
# For continuous response example, see README.
# Use data set provided in R package
# We set EB = T indicating that we also estimate
# tree structure prior parameter alpha
# and leaf node prior parameter k
data("Lymphoma")
Xtr <- as.matrix(Lymphoma$Xtrain) # Xtr should be matrix object
Ytr <- Lymphoma$Ytrain
Xte <- as.matrix(Lymphoma$Xtest) # Xte should be matrix object
Yte <- Lymphoma$Ytest
CoDat <- Lymphoma$CoData
CoDat <- stats::model.matrix(~., CoDat) # encode grouping by dummies
#(include intercept)
set.seed(4) # for reproducible results
Fit <- EBcoBART(Y = Ytr, X = Xtr, CoData = CoDat,
nIter = 2, # Low! Only for illustration
model = "binary",
EB_k = TRUE, EB_alpha = TRUE,
EB_sigma = FALSE,
verbose = TRUE,
ntree = 5, # Low! Only for illustration
nchain = 3,
nskip = 500, # Low! Only for illustration
ndpost = 500, # Low! Only for illustration
Prob_Init = rep(1/ncol(Xtr), ncol(Xtr)),
k = 2, alpha = .95, beta = 2)
EstProbs <- Fit$SplitProbs # estimated prior weights of variables
alpha_EB <- Fit$alpha_est
k_EB <- Fit$k_est
print(Fit)
summary(Fit)
# The prior parameter estimates EstProbs, alpha_EB,
# and k_EB can then be used in your favorite BART fitting package
# We use dbarts:
FinalFit <- dbarts::bart(x.train = Xtr, y.train = Ytr,
x.test = Xte,
ntree = 5, # Low! Only for illustration
nchain = 3, # Low! Only for illustration
nskip = 200, # Low! Only for illustration
ndpost = 200, # Low! Only for illustration
k = k_EB, base = alpha_EB, power = 2,
splitprobs = EstProbs,
combinechains = TRUE, verbose = FALSE)