grid_search_cv {pcpr}R Documentation

Cross-validated grid search for PCP models

Description

grid_search_cv() conducts a Monte Carlo style cross-validated grid search of PCP parameters for a given data matrix D, PCP function pcp_fn, and grid of parameter settings to search through grid. The run time of the grid search can be sped up using bespoke parallelization settings. The call to grid_search_cv() can be wrapped in a call to progressr::with_progress() for progress bar updates. See the below sections for details.

Usage

grid_search_cv(
  D,
  pcp_fn,
  grid,
  ...,
  parallel_strategy = "sequential",
  num_workers = 1,
  perc_test = 0.05,
  num_runs = 100,
  return_all_tests = FALSE,
  verbose = TRUE
)

Arguments

D

The input data matrix (can contain NA values). Note that PCP will converge much more quickly when D has been standardized in some way (e.g. scaling columns by their standard deviations, or column-wise min-max normalization).

pcp_fn

The PCP function to use when grid searching. Must be either rrmc or root_pcp (passed without the soft brackets).

grid

A data.frame of dimension j by k containing the j-many unique settings of k-many parameters to try. NOTE: The columns of grid should be named after the required parameters in the function header of pcp_fn. For example, if pcp_fn = root_pcp and you want to search through lambda and mu, then names(grid) must be set to c("lambda", "mu"). If instead you want to keep e.g. lambda fixed and search through only mu, you can either have a grid with only one column, mu, and pass lambda as a constant via ..., or you can have names(grid) set to c("lambda", "mu") where lambda is constant. The same logic applies for pcp_fn = rrmc and eta and r.

...

Any parameters required by pcp_fn that should be kept constant throughout the grid search, or those parameters that cannot be stored in grid (e.g. the LOD parameter). A parameter should not be passed with ... if it is already a column in grid, as that behavior is ambiguous.

parallel_strategy

(Optional) The parallelization strategy used when conducting the grid search (to be passed on to the future::plan() function). Must be one of: "sequential", "multisession", "multicore" or "cluster". By default, parallel_strategy = "sequential", which runs the grid search in serial and the num_workers argument is ignored. The option parallel_strategy = "multisession" parallelizes the search via sockets in separate R sessions. The option parallel_strategy = "multicore" is not supported on Windows machines, nor in .Rmd files (must be run in a .R script) but parallelizes the search much faster than "multisession" since it runs separate forked R processes. The option parallel_strategy = "cluster" parallelizes using separate R sessions running typically on one or more machines. Support for other parallel strategies will be added in a future release of pcpr. It is recommended to use parallel_strategy = "multicore" or "multisession" when possible.

num_workers

(Optional) An integer specifying the number of workers to use when parallelizing the grid search, to be passed on to future::plan(). By default, num_workers = 1. When possible, it is recommended to use num_workers = parallel::detectCores(logical = F), which computes the number of physical CPUs available on the machine (see parallel::detectCores()). num_workers is ignored when parallel_strategy = "sequential", and must be ⁠> 1⁠ otherwise.

perc_test

(Optional) The fraction of entries of D that will be randomly corrupted as NA missing values (the test set). Can be anthing in the range ⁠[0, 1)⁠. By default, perc_test = 0.05. See Best practices section for more details.

num_runs

(Optional) The number of times to test a given parameter setting. By default, num_runs = 100. See Best practices section for more details.

return_all_tests

(Optional) A logical indicating if you would like the output from all the calls made to pcp_fn over the course of the grid search to be returned to you in list format. If set to FALSE, then only statistics on the parameters tested will be returned. If set to TRUE then every L, and S matrix recovered during the grid search will be returned in the lists L_mats and S_mats, every test set matrix will be returned in the list test_mats, the original input matrix will be returned as original_mat, and the parameters passed in to ... will be returned in the constant_params list. By default, return_all_tests = FALSE, which is highly recommended. Setting return_all_tests = TRUE can consume a massive amount of memory depending on the size of grid, the input matrix D, and the value for num_runs.

verbose

(Optional) A logical indicating if you would like verbose output displayed or not. By default, verbose = TRUE. To obtain progress bar updates, the user must wrap the grid_search_cv() call with a call to progressr::with_progress(). The progress bar does not depend on the value passed for verbose.

Value

A list containing:

If return_all_tests = TRUE then the following are also returned as part of the list:

The Monte Carlo style cross-validation procedure

Each hyperparameter setting is cross-validated by:

  1. Randomly corrupting perc_test percent of the entries in D as missing (i.e. NA values), yielding D_tilde. Done via sim_na().

  2. Running the PCP function pcp_fn on D_tilde, yielding estimates L and S.

  3. Recording the relative recovery error of L compared with the input data matrix D for only those values that were imputed as missing during the corruption step (step 1 above). Mathematically, calculate: ||P_{\Omega^c}(D - L)||_F / ||P_{\Omega^c}(D)||_F, where P_{\Omega^c} selects only those entries where is.na(D_tilde) == TRUE.

  4. Repeating steps 1-3 for a total of num_runs-many times, where each "run" has a unique random seed from 1 to num_runs associated with it.

  5. Performance statistics can then be calculated for each "run", and then summarized across all runs for average model performance statistics.

Best practices for perc_test and num_runs

Experimentally, this grid search procedure retrieves the best performing PCP parameter settings when perc_test is relatively low, e.g. perc_test = 0.05, or 5%, and num_runs is relatively high, e.g. num_runs = 100.

The larger perc_test is, the more the test set turns into a matrix completion problem, rather than the desired matrix decomposition problem. To better resemble the actual problem PCP will be faced with come inference time, perc_test should therefore be kept relatively low.

Choosing a reasonable value for num_runs is dependent on the need to keep perc_test relatively low. Ideally, a large enough num_runs is used so that many (if not all) of the entries in D are likely to eventually be tested. Note that since test set entries are chosen randomly for all runs 1 through num_runs, in the pathologically worst case scenario, the same exact test set could be drawn each time. In the best case scenario, a different test set is obtained each run, providing balanced coverage of D. Viewed another way, the smaller num_runs is, the more the results are susceptible to overfitting to the relatively few selected test sets.

Interpretaion of results

Once the grid search of has been conducted, the optimal hyperparameters can be chosen by examining the output statistics summary_stats. Below are a few suggestions for how to interpret the summary_stats table:

See Also

sim_na(), sparsity(), matrix_rank(), get_pcp_defaults()

Examples

#### -------Simple simulated PCP problem-------####
# First we will simulate a simple dataset with the sim_data() function.
# The dataset will be a 100x10 matrix comprised of:
# 1. A rank-3 component as the ground truth L matrix;
# 2. A ground truth sparse component S w/outliers along the diagonal; and
# 3. A dense Gaussian noise component
data <- sim_data()
#### -------Tiny grid search-------####
# Here is a tiny grid search just to test the function quickly.
# In practice we would recommend a larger grid search.
# For examples of larger searches, see the vignettes.
gs <- grid_search_cv(
  data$D,
  rrmc,
  data.frame("eta" = 0.35),
  r = 3,
  num_runs = 2
)
gs$summary_stats

[Package pcpr version 1.0.0 Index]