Dat_Tree {fusedTree}R Documentation

Construct design data used for fitting fusedTree models

Description

Prepares the full data design used to fit a fusedTree model, including dummy-encoded clinical leaf node indicators, optional continuous clinical variables, and a block-diagonal omics matrix structured per tree node.

Usage

Dat_Tree(Tree, X, Z, LinVars = TRUE)

Arguments

Tree

A fitted tree object, created using rpart. Support for trees fitted with other packages (e.g., partykit) may be added in the future.

X

A numeric omics data matrix with dimensions (sample size × number of omics variables). Must be a matrix.

Z

A data.frame of clinical covariates used in tree fitting. Must be the same data used to construct Tree.

LinVars

Logical. Whether to include continuous clinical variables linearly in the model (in addition to tree clustering). Recommended, as trees may not capture linear effects well. Defaults to TRUE.

Details

This function allows users to inspect the exact data structure used in fusedTree model fitting. The PenOpt() and fusedTreeFit() functions call this function internally so no need to call this function to set-up the right data format. It is just meant for users to check what is going on.

Value

A list with the following components:

Clinical

A matrix encoding the clinical structure:

  • Dummy variables representing membership to leaf nodes of the tree,

  • Continuous clinical covariates (if LinVars = TRUE).

Each row corresponds to a sample.

Omics

A matrix of omics data per leaf node. This matrix has dimensions: sample size × (number of leaf nodes × number of omics variables). For each observation, only the block of omics variables corresponding to its tree node is populated (other blocks are set to zero).

#'

Examples

p = 5 # number of omics variables (low for illustration)
p_Clin = 5 # number of clinical variables
N = 100 # sample size
# simulate from Friedman-like function
g <- function(z) {
  15 * sin(pi * z[,1] * z[,2]) + 10 * (z[,3] - 0.5)^2 + 2 * exp(z[,4]) + 2 * z[,5]
}
Z <- as.data.frame(matrix(runif(N * p_Clin), nrow = N))
X <- matrix(rnorm(N * p), nrow = N)            # omics data
betas <- c(1,-1,3,4,2)                         # omics effects
Y <- g(Z) + X %*% betas + rnorm(N)             # continuous outcome
Y <- as.vector(Y)
dat = cbind.data.frame(Y, Z) #set-up data correctly for rpart
library(rpart)
rp <- rpart::rpart(Y ~ ., data = dat,
                   control = rpart::rpart.control(xval = 5, minbucket = 10),
                   model = TRUE)
cp = rp$cptable[,1][which.min(rp$cptable[,4])] # best model according to pruning
Treefit <- rpart::prune(rp, cp = cp)
plot(Treefit)
Dat_fusedTree <- Dat_Tree(Tree = Treefit, X = X, Z = Z, LinVars = FALSE)
Omics <- Dat_fusedTree$Omics
Clinical <- Dat_fusedTree$Clinical

[Package fusedTree version 1.0.1 Index]