grid_search_cv {pcpr} | R Documentation |
Cross-validated grid search for PCP models
Description
grid_search_cv()
conducts a Monte Carlo style cross-validated grid search
of PCP parameters for a given data matrix D
, PCP function pcp_fn
, and
grid of parameter settings to search through grid
. The run time of the grid
search can be sped up using bespoke parallelization settings. The call to
grid_search_cv()
can be wrapped in a call to progressr::with_progress()
for progress bar updates. See the below sections for details.
Usage
grid_search_cv(
D,
pcp_fn,
grid,
...,
parallel_strategy = "sequential",
num_workers = 1,
perc_test = 0.05,
num_runs = 100,
return_all_tests = FALSE,
verbose = TRUE
)
Arguments
D |
The input data matrix (can contain |
pcp_fn |
The PCP function to use when grid searching. Must be either
|
grid |
A |
... |
Any parameters required by |
parallel_strategy |
(Optional) The parallelization strategy used when
conducting the grid search (to be passed on to the |
num_workers |
(Optional) An integer specifying the number of workers to
use when parallelizing the grid search, to be passed on to
|
perc_test |
(Optional) The fraction of entries of |
num_runs |
(Optional) The number of times to test a given parameter
setting. By default, |
return_all_tests |
(Optional) A logical indicating if you would like the
output from all the calls made to |
verbose |
(Optional) A logical indicating if you would like verbose
output displayed or not. By default, |
Value
A list containing:
-
all_stats
: Adata.frame
containing the statistics of every run comprising the grid search. These statistics include the parameter settings for the run, along with therun_num
(used as the seed for the corruption step, step 1 in the grid search procedure), the relative error for the runrel_err
, the rank of the recovered L matrixL_rank
, the sparsity of the recovered S matrixS_sparsity
, the number ofiterations
PCP took to reach convergence (forroot_pcp()
only), and the error statusrun_error
of the PCP run (NA
if no error, otherwise a character string). -
summary_stats
: Adata.frame
containing a summary of the information inall_stats
. Summary made by column-wise averaging the results inall_stats
. -
metadata
: A character string containing the metadata associated with the grid search instance.
If return_all_tests = TRUE
then the following are also returned as part
of the list:
-
L_mats
: A list containing all theL
matrices returned from PCP throughout the grid search. Therefore,length(L_mats) == nrow(all_stats)
. Rowi
inall_stats
corresponds toL_mats[[i]]
. -
S_mats
: A list containing all the S matrices returned from PCP throughout the grid search. Therefore,length(S_mats) == nrow(all_stats)
. Rowi
inall_stats
corresponds toS_mats[[i]]
. -
test_mats
: A list oflength(num_runs)
containing all the corrupted test mats (and their masks) used throughout the grid search. Note:all_stats$run[i]
corresponds totest_mats[[i]]
. -
original_mat
: The original data matrixD
. -
constant_params
: A copy of the constant parameters that were originally passed to the grid search (for record keeping).
The Monte Carlo style cross-validation procedure
Each hyperparameter setting is cross-validated by:
Randomly corrupting
perc_test
percent of the entries inD
as missing (i.e.NA
values), yieldingD_tilde
. Done viasim_na()
.Running the PCP function
pcp_fn
onD_tilde
, yielding estimatesL
andS
.Recording the relative recovery error of
L
compared with the input data matrixD
for only those values that were imputed as missing during the corruption step (step 1 above). Mathematically, calculate:||P_{\Omega^c}(D - L)||_F / ||P_{\Omega^c}(D)||_F
, whereP_{\Omega^c}
selects only those entries whereis.na(D_tilde) == TRUE
.Repeating steps 1-3 for a total of
num_runs
-many times, where each "run" has a unique random seed from1
tonum_runs
associated with it.Performance statistics can then be calculated for each "run", and then summarized across all runs for average model performance statistics.
Best practices for perc_test
and num_runs
Experimentally, this grid search procedure retrieves the best performing
PCP parameter settings when perc_test
is relatively low, e.g.
perc_test = 0.05
, or 5%, and num_runs
is relatively high, e.g.
num_runs = 100
.
The larger perc_test
is, the more the test set turns into a matrix
completion problem, rather than the desired matrix decomposition problem. To
better resemble the actual problem PCP will be faced with come inference
time, perc_test
should therefore be kept relatively low.
Choosing a reasonable value for num_runs
is dependent on the need to keep
perc_test
relatively low. Ideally, a large enough num_runs
is used so
that many (if not all) of the entries in D
are likely to eventually be
tested. Note that since test set entries are chosen randomly for all runs 1
through num_runs
, in the pathologically worst case scenario, the same exact
test set could be drawn each time. In the best case scenario, a different
test set is obtained each run, providing balanced coverage of D
. Viewed
another way, the smaller num_runs
is, the more the results are susceptible
to overfitting to the relatively few selected test sets.
Interpretaion of results
Once the grid search of has been conducted, the optimal hyperparameters can
be chosen by examining the output statistics summary_stats
. Below are a
few suggestions for how to interpret the summary_stats
table:
Generally speaking, the first thing a user will want to inspect is the
rel_err
statistic, capturing the relative discrepancy between recovered test sets and their original, observed (yet possibly noisy) values. Lowerrel_err
means the PCP model was better able to recover the held-out test set. So, in general, the best parameter settings are those with the lowestrel_err
. Having said this, it is important to remember that this statistic should be taken with a grain of salt: Because no ground truthL
matrix exists, therel_err
measurement is forced to rely on the comparison between the noisy observed data matrixD
and the estimated low-rank modelL
. So therel_err
metric is an "apples to oranges" relative error. For data that is a priori expected to be subject to a high degree of noise, it may actually be better to discard parameter settings with suspiciously lowrel_err
s (in which case the solution may be hallucinating an inaccurate low-rank structure from the observed noise).For grid searches using
root_pcp()
as the PCP model, parameters that fail to converge can be discarded. Generally, fewerroot_pcp()
iterations (num_iter
) taken to reach convergence portend a more reliable / stable solution. In rare cases, the user may need to increaseroot_pcp()
'smax_iter
argument to reach convergence.rrmc()
does not report convergence metadata, as its optimization scheme runs for a fixed number of iterations.Parameter settings with unreasonable sparsity or rank measurements can also be discarded. Here, "unreasonable" means these reported metrics flagrantly contradict prior assumptions, knowledge, or work. For instance, most air pollution datasets contain a number of extreme exposure events, so PCP solutions returning sparse
S
models with 100% sparsity have obviously been regularized too heavily. Solutions with lower sparsities should be preferred. Note that reported sparsity and rank measurements are estimates heavily dependent on thethresh
set by thesparsity()
&matrix_rank()
functions. E.g. it could be that the actual average matrix rank is much higher or lower when a threshold that better takes into account the relative scale of the singular values is used. Likewise for the sparsity estimations. Also, recall that the given value forperc_test
artifically sets a sparsity floor, since those missing entries in the test set cannot be recovered in theS
matrix. E.g. ifperc_test = 0.05
, then no parameter setting will have an estimated sparsity lower than 5%.
See Also
sim_na()
, sparsity()
, matrix_rank()
, get_pcp_defaults()
Examples
#### -------Simple simulated PCP problem-------####
# First we will simulate a simple dataset with the sim_data() function.
# The dataset will be a 100x10 matrix comprised of:
# 1. A rank-3 component as the ground truth L matrix;
# 2. A ground truth sparse component S w/outliers along the diagonal; and
# 3. A dense Gaussian noise component
data <- sim_data()
#### -------Tiny grid search-------####
# Here is a tiny grid search just to test the function quickly.
# In practice we would recommend a larger grid search.
# For examples of larger searches, see the vignettes.
gs <- grid_search_cv(
data$D,
rrmc,
data.frame("eta" = 0.35),
r = 3,
num_runs = 2
)
gs$summary_stats