roseRF_plm {roseRF}R Documentation

ROSE random forest estimator for the partially linear model

Description

Estimates the parameter of interest \theta_0 in the partially linear model

\mathbb{E}[Y|X,Z] = X\theta_0 + f_0(Z),

which can be reposed in terms of the ‘nuisance functions’ (\mathbb{E}[Y|X], \mathbb{E}[X|Z]) as

\mathbb{E}[Y|X,Z]-\mathbb{E}[Y|Z] = (X-\mathbb{E}[X|Z])\theta_0.

Usage

roseRF_plm(
  y_formula,
  y_learner,
  y_pars = list(),
  x_formula,
  x_learner,
  x_pars = list(),
  M1_formula = x_formula,
  M1_learner = x_learner,
  M1_pars = x_pars,
  M2_formula = NA,
  M2_learner = NA,
  M2_pars = list(),
  M3_formula = NA,
  M3_learner = NA,
  M3_pars = list(),
  M4_formula = NA,
  M4_learner = NA,
  M4_pars = list(),
  M5_formula = NA,
  M5_learner = NA,
  M5_pars = list(),
  data,
  K = 5,
  S = 1,
  max.depth = 10,
  num.trees = 500,
  min.node.size = max(10, ceiling(0.01 * (K - 1)/K * nrow(data))),
  replace = TRUE,
  sample.fraction = 0.8
)

Arguments

y_formula

a two-sided formula object describing the model for \mathbb{E}[Y|Z].

y_learner

a string specifying the regression method to fit the regression of Y on Z as given by y_formula (e.g. randomforest, xgboost, neuralnet, gam).

y_pars

a list containing hyperparameters for the y_learner chosen. Default is an empty list, which performs hyperparameter tuning.

x_formula

a two-sided formula object describing the model for \mathbb{E}[X|Z].

x_learner

a string specifying the regression method to fit the regression of X on Z as given by x_formula (e.g. randomforest, xgboost, neuralnet, gam).

x_pars

a list containing hyperparameters for the y_learner chosen. Default is an empty list, which performs hyperparameter tuning.

M1_formula

a two-sided formula object for the model \mathbb{E}[M_1(X)|Z]. Default is M_1(X)=X.

M1_learner

a string specifying the regression method for \mathbb{E}[M_1(X)|Z] estimation.

M1_pars

a list containing hyperparameters for the M1_learner chosen.

M2_formula

a two-sided formula object for the model \mathbb{E}[M_2(X)|Z]. Default is no formula / regression (i.e. J=1)

M2_learner

a string specifying the regression method for \mathbb{E}[M_2(X)|Z] estimation.

M2_pars

a list containing hyperparameters for the M2_learner chosen.

M3_formula

a two-sided formula object for the model \mathbb{E}[M_3(X)|Z]. Default is no formula / regression (i.e. J=1).

M3_learner

a string specifying the regression method for \mathbb{E}[M_3(X)|Z] estimation.

M3_pars

a list containing hyperparameters for the M3_learner chosen.

M4_formula

a two-sided formula object for the model \mathbb{E}[M_4(X)|Z]. Default is no formula / regression (i.e. J=1)

M4_learner

a string specifying the regression method for \mathbb{E}[M_4(X)|Z] estimation.

M4_pars

a list containing hyperparameters for the M4_learner chosen.

M5_formula

a two-sided formula object for the model \mathbb{E}[M_5(X)|Z]. Default is no formula / regression (i.e. J=1)

M5_learner

a string specifying the regression method for \mathbb{E}[M_5(X)|Z] estimation.

M5_pars

a list containing hyperparameters for the M5_learner chosen.

data

a data frame containing the variables for the partially linear model.

K

the number of folds used for K-fold cross-fitting. Default is 5.

S

the number of repeats to mitigate the randomness in the estimator on the sample splits used for K-fold cross-fitting. Default is 5.

max.depth

Maximum depth parameter used for ROSE random forests. Default is 5.

num.trees

Number of trees used for a single ROSE random forest. Default is 50.

min.node.size

Minimum node size of a leaf in each tree. Default is max(10,ceiling(0.01 (K-1)/K nrow(data))).

replace

Whether sampling for a single random tree are performed with (bootstrap) or without replacement. Default is TRUE (i.e. bootstrap).

sample.fraction

Proportion of data used for each random tree. Default is 0.8.

Details

The estimator of interest \theta_0 solves the estimating equation

\sum_{i}\psi(Y_i,X_i,Z_i; \theta,\hat{\eta}(Z),\hat{w}(Z)) = 0,

\psi(Y,X,Z;\theta,\eta_0,w) := \sum_{j=1}^J w_j(Z) \big( M_j(X) - \mathbb{E}[M_j(X)|Z]\big) \Big( \big(Y-\mathbb{E}[Y|Z]\big)-\big(X-\mathbb{E}[X|Z]\big)\theta \Big),

\eta_0 := \big(\mathbb{E}[Y|Z=\cdot], \mathbb{E}[X|Z=\cdot]\big),

where M_1(X),\ldots,M_J(X) denotes user-chosen functions of (X) and w(Z)=\big(w_1(Z),\ldots,w_J(Z)\big) denotes weights estimated via ROSE random forests. The default takes J=1 and M_1(X)=X; if taking J\geq 2 we recommend care in checking the applicability and appropriateness of any additional user-chosen regression tasks.

The parameter of interest \theta_0 is estimated using a DML2 / K-fold cross-fitting framework, to allow for arbitrary (faster than n^{1/4}-consistent) learners for \hat{\eta} i.e. solving the estimating equation

\sum_{k \in [K]}\sum_{i \in I_k}\psi(Y_i,X_i,Z_i; \theta,\hat{\eta}^{(k)}(Z),\hat{w}^{(k)}(Z)) = 0,

where I_1,\ldots,I_K denotes a partition of the index set for the datapoints (Y_i,X_i,Z_i), \hat{\eta}^{(k)} denotes an estimator for \eta_0 trained on the data indexed by I_k^c, and \hat{w}^{(k)} denotes a ROSE random forest (again trained on the data indexed by I_k^c).

Value

A list containing:

theta

The estimator of \theta_0.

stderror

Huber robust estimate of the standard error of the \theta_0-estimator.

coefficients

Table of \theta_0 coefficient estimator, standard error, z-value and p-value.


[Package roseRF version 0.1.0 Index]