spar {spareg}R Documentation

Sparse Projected Averaged Regression

Description

Apply Sparse Projected Averaged Regression to high-dimensional data by building an ensemble of generalized linear models, where the high-dimensional predictors can be screened using a screening coefficient and then projected using data-agnostic or data-informed random projection matrices. This function performs the procedure for a given grid of thresholds \nu and a grid of the number of marginal models to be employed in the ensemble. This function is also used in the cross-validated procedure spar.cv.

Usage

spar(
  x,
  y,
  family = gaussian("identity"),
  model = NULL,
  rp = NULL,
  screencoef = NULL,
  xval = NULL,
  yval = NULL,
  nnu = 20,
  nus = NULL,
  nummods = c(20),
  measure = c("deviance", "mse", "mae", "class", "1-auc"),
  parallel = FALSE,
  inds = NULL,
  RPMs = NULL,
  seed = NULL,
  set.seed.iteration = FALSE,
  ...
)

spareg(
  x,
  y,
  family = gaussian("identity"),
  model = NULL,
  rp = NULL,
  screencoef = NULL,
  xval = NULL,
  yval = NULL,
  nnu = 20,
  nus = NULL,
  nummods = c(20),
  measure = c("deviance", "mse", "mae", "class", "1-auc"),
  parallel = FALSE,
  inds = NULL,
  RPMs = NULL,
  seed = NULL,
  set.seed.iteration = FALSE,
  ...
)

Arguments

x

n x p numeric matrix of predictor variables.

y

quantitative response vector of length n.

family

a family object used for the marginal generalized linear model, default gaussian("identity").

model

function creating a 'sparmodel' object; defaults to spar_glm() for gaussian family with identity link and to spar_glmnet() for all other family-link combinations.

rp

function creating a 'randomprojection' object. Defaults to NULL. In this case rp_cw(data = TRUE) is used.

screencoef

function creating a 'screeningcoef' object. Defaults to NULL. In this case no screening is used is used.

xval

optional matrix of predictor variables observations used for validation of threshold nu and number of models; x is used if not provided.

yval

optional response observations used for validation of threshold nu and number of models; y is used if not provided.

nnu

number of different threshold values \nu to consider for thresholding; ignored when nus are given; defaults to 20.

nus

optional vector of \nu's to consider for thresholding; if not provided, nnu values ranging from 0 to the maximum absolute marginal coefficient are used.

nummods

vector of numbers of marginal models to consider for validation; defaults to c(20).

measure

loss to use for validation; defaults to "deviance" available for all families. Other options are "mse" or "mae" (between responses and predicted means, for all families), "class" (misclassification error) and "1-auc" (one minus area under the ROC curve) both just for binomial family.

parallel

assuming a parallel backend is loaded and available, a logical indicating whether the function should use it in parallelizing the estimation of the marginal models. Defaults to FALSE.

inds

optional list of index-vectors corresponding to variables kept after screening in each marginal model of length max(nummods); dimensions need to fit those of RPMs.

RPMs

optional list of projection matrices used in each marginal model of length max(nummods), diagonal elements will be overwritten with a coefficient only depending on the given x and y.

seed

integer seed to be set at the beginning of the SPAR algorithm. Default to NULL, in which case no seed is set.

set.seed.iteration

a boolean indicating whether a different seed should be set in each marginal model i. Defaults to FALSE. If TRUE, seed will be set to seed + i in each marginal model i.

...

further arguments mainly to ensure back-compatibility

Value

object of class 'spar' with elements

If a parallel backend is registered and parallel = TRUE, the foreach function is used to estimate the marginal models in parallel.

References

Parzer R, Filzmoser P, Vana-Gür L (2024). “Sparse Data-Driven Random Projection in Regression for High-Dimensional Data.” Technical Report 2312.00130, arXiv.org E-Print Archive. doi:10.48550/arXiv.2312.00130.

Parzer R, Filzmoser P, Vana-Gür L (2024). “Data-Driven Random Projection and Screening for High-Dimensional Generalized Linear Models.” Technical Report 2410.00971, arXiv.org E-Print Archive. doi:10.48550/arXiv.2410.00971.

Clarkson KL, Woodruff DP (2013). “Low Rank Approximation and Regression in Input Sparsity Time.” In Proceedings of the Forty-Fifth Annual ACM Symposium on Theory of Computing, STOC '13, 81–90. ISBN 9781450320290, doi:10.1145/2488608.2488620.

Achlioptas D (2003). “Database-Friendly Random Projections: Johnson-Lindenstrauss with Binary Coins.” Journal of Computer and System Sciences, 66(4), 671-687. ISSN 0022-0000, doi:10.1016/S0022-0000(03)00025-4, Special Issue on PODS 2001.

See Also

spar.cv,coef.spar,predict.spar,plot.spar,print.spar

Examples

example_data <- simulate_spareg_data(n = 200, p = 2000, ntest = 100)
spar_res <- spar(example_data$x, example_data$y, xval = example_data$xtest,
  yval = example_data$ytest, nummods=c(5, 10, 15, 20, 25, 30))
coefs <- coef(spar_res)
pred <- predict(spar_res, xnew = example_data$x)
plot(spar_res)
plot(spar_res, plot_type = "Val_Meas", plot_along = "nummod", nu = 0)
plot(spar_res, plot_type = "Val_Meas", plot_along = "nu", nummod = 10)
plot(spar_res, plot_type = "Val_numAct",  plot_along = "nummod", nu = 0)
plot(spar_res, plot_type = "Val_numAct",  plot_along = "nu", nummod = 10)
plot(spar_res, plot_type = "res-vs-fitted",  xfit = example_data$xtest,
  yfit = example_data$ytest)
plot(spar_res, plot_type = "coefs", prange = c(1,400))

spar_res <- spareg(example_data$x, example_data$y, xval = example_data$xtest,
  yval = example_data$ytest, nummods=c(5, 10, 15, 20, 25, 30))

[Package spareg version 1.0.0 Index]