spar {spareg}R Documentation

Sparse Projected Averaged Regression

Description

Apply Sparse Projected Averaged Regression to high-dimensional data by building an ensemble of generalized linear models, where the high-dimensional predictors can be screened using a screening coefficient and then projected using data-agnostic or data-informed random projection matrices. This function performs the procedure for a given grid of thresholds \nu and a grid of the number of marginal models to be employed in the ensemble. This function is also used in the cross-validated procedure spar.cv.

Usage

spar(
  x,
  y,
  family = gaussian("identity"),
  model = NULL,
  rp = NULL,
  screencoef = NULL,
  xval = NULL,
  yval = NULL,
  nnu = 20,
  nus = NULL,
  nummods = c(20),
  measure = c("deviance", "mse", "mae", "class", "1-auc"),
  avg_type = c("link", "response"),
  parallel = FALSE,
  inds = NULL,
  RPMs = NULL,
  seed = NULL,
  ...
)

spareg(
  x,
  y,
  family = gaussian("identity"),
  model = NULL,
  rp = NULL,
  screencoef = NULL,
  xval = NULL,
  yval = NULL,
  nnu = 20,
  nus = NULL,
  nummods = c(20),
  measure = c("deviance", "mse", "mae", "class", "1-auc"),
  avg_type = c("link", "response"),
  parallel = FALSE,
  inds = NULL,
  RPMs = NULL,
  seed = NULL,
  ...
)

Arguments

x

n x p numeric matrix of predictor variables.

y

quantitative response vector of length n.

family

a family object used for the marginal generalized linear model, default gaussian("identity").

model

function creating a 'sparmodel' object; defaults to spar_glm() for gaussian family with identity link and to spar_glmnet() for all other family-link combinations.

rp

function creating a 'randomprojection' object. Defaults to NULL. In this case rp_cw(data = TRUE) is used.

screencoef

function creating a 'screeningcoef' object. Defaults to NULL. In this case no screening is used is used.

xval

optional matrix of predictor variables observations used for validation of threshold nu and number of models; x is used if not provided.

yval

optional response observations used for validation of threshold nu and number of models; y is used if not provided.

nnu

number of different threshold values \nu to consider for thresholding; ignored when nus are given; defaults to 20.

nus

optional vector of \nu's to consider for thresholding; if not provided, nnu values ranging from 0 to the maximum absolute marginal coefficient are used.

nummods

vector of numbers of marginal models to consider for validation; defaults to c(20).

measure

loss to use for validation; defaults to "deviance" available for all families. Other options are "mse" or "mae" (between responses and predicted means, for all families), "class" (misclassification error) and "1-auc" (one minus area under the ROC curve) both just for binomial family.

avg_type

type of averaging the marginal models; either on link (default) or on response level. This is used in computing the validation measure.

parallel

assuming a parallel backend is loaded and available, a logical indicating whether the function should use it in parallelizing the estimation of the marginal models. Defaults to FALSE.

inds

optional list of index-vectors corresponding to variables kept after screening in each marginal model of length max(nummods); dimensions need to fit those of RPMs.

RPMs

optional list of projection matrices used in each marginal model of length max(nummods), diagonal elements will be overwritten with a coefficient only depending on the given x and y.

seed

integer seed to be set at the beginning of the SPAR algorithm. Default to NULL, in which case no seed is set.

...

further arguments mainly to ensure back-compatibility

Value

object of class 'spar' with elements

If a parallel backend is registered and parallel = TRUE, the foreach function is used to estimate the marginal models in parallel.

References

Parzer R, Filzmoser P, Vana-Gür L (2024). “Sparse Data-Driven Random Projection in Regression for High-Dimensional Data.” Technical Report 2312.00130, arXiv.org E-Print Archive. doi:10.48550/arXiv.2312.00130.

Parzer R, Filzmoser P, Vana-Gür L (2024). “Data-Driven Random Projection and Screening for High-Dimensional Generalized Linear Models.” Technical Report 2410.00971, arXiv.org E-Print Archive. doi:10.48550/arXiv.2410.00971.

Clarkson KL, Woodruff DP (2013). “Low Rank Approximation and Regression in Input Sparsity Time.” In Proceedings of the Forty-Fifth Annual ACM Symposium on Theory of Computing, STOC '13, 81–90. ISBN 9781450320290, doi:10.1145/2488608.2488620.

Achlioptas D (2003). “Database-Friendly Random Projections: Johnson-Lindenstrauss with Binary Coins.” Journal of Computer and System Sciences, 66(4), 671-687. ISSN 0022-0000, doi:10.1016/S0022-0000(03)00025-4, Special Issue on PODS 2001.

See Also

spar.cv, coef.spar, predict.spar, plot.spar, print.spar

Examples

example_data <- simulate_spareg_data(n = 200, p = 400, ntest = 100)
spar_res <- spar(example_data$x, example_data$y, xval = example_data$xtest,
  yval = example_data$ytest, nummods=c(5, 10, 15, 20, 25, 30))
coefs <- coef(spar_res)
pred <- predict(spar_res, xnew = example_data$x)
plot(spar_res)
plot(spar_res, plot_type = "val_measure", plot_along = "nummod", nu = 0)
plot(spar_res, plot_type = "val_measure", plot_along = "nu", nummod = 10)
plot(spar_res, plot_type = "val_numactive",  plot_along = "nummod", nu = 0)
plot(spar_res, plot_type = "val_numactive",  plot_along = "nu", nummod = 10)
plot(spar_res, plot_type = "res_vs_fitted",  xfit = example_data$xtest,
  yfit = example_data$ytest)
plot(spar_res, plot_type = "coefs", prange = c(1,400))

spar_res <- spareg(example_data$x, example_data$y, xval = example_data$xtest,
  yval = example_data$ytest, nummods=c(5, 10, 15, 20, 25, 30))

[Package spareg version 1.1.0 Index]