optimize_gps {vecmatch} | R Documentation |
Optimize the Matching Process via Random Search
Description
The optimize_gps()
function performs a random search to
identify optimal combinations of parameters for the match_gps()
and
estimate_gps()
functions. The goal is to maximize the percentage of
matched samples (perc_matched
) while minimizing the maximum standardized
mean difference (smd
), thereby improving the overall balance of
covariates across treatment groups. The function supports parallel
execution through the foreach
and future
packages, enabling
multithreaded computation to accelerate the optimization process,
particularly when dealing with large datasets or complex parameter spaces.
Usage
optimize_gps(
data = NULL,
formula,
ordinal_treat = NULL,
n_iter = 1000,
n_cores = 1,
opt_args = NULL
)
Arguments
data |
A |
formula |
A valid formula object used to estimate the generalized
propensity scores (GPS). The treatment variable appears on the left-hand
side, and covariates on the right-hand side. Interactions can be specified
using |
ordinal_treat |
An atomic vector defining the ordered levels of the
treatment variable. This confirms the variable is ordinal and adjusts its
levels accordingly using
|
n_iter |
Integer. Number of unique parameter combinations to evaluate
during optimization. Higher values generally yield better results but
increase computation time. For large datasets or high-dimensional parameter
spaces, increasing |
n_cores |
Integer. Number of CPU cores to use for parallel execution. If
set to a value greater than 1, a parallel backend is registered using
|
opt_args |
An object of class |
Details
The output is an S3 object of class best_opt_result
. Its core
component is a data.frame
containing the parameter settings for the
best-performing models, grouped and ranked based on their balance quality.
Optimization results are categorized into seven bins based on the maximum standardized mean difference (SMD):
0.00-0.05
0.05-0.10
0.10-0.15
0.15-0.20
0.20-0.25
0.25-0.30
Greater than 0.30
Within each SMD group, the parameter combination(s) achieving the highest
perc_matched
(i.e., percentage of matched samples) is selected. In cases
where multiple combinations yield identical smd
and perc_matched
, all
such results are retained. Combinations where matching failed or GPS
estimation did not converge will return NA
in the result columns (e.g.,
perc_matched
, smd
).
The returned data.frame
includes the following columns (depending on the
number of treatment levels):
-
iter_ID
: Unique identifier for each parameter combination -
method_match
: Matching method used inmatch_gps()
, e.g.,"nnm"
or"fullopt"
-
caliper
: Caliper value used inmatch_gps()
-
order
: Ordering of GPS scores prior to matching -
kmeans_cluster
: Number of k-means clusters used -
replace
: Whether replacement was used in matching (nnm
only) -
ties
: Tie-breaking rule in nearest-neighbor matching (nnm
only) -
ratio
: Control-to-treated ratio fornnm
-
min_controls
,max_controls
: Minimum and maximum controls forfullopt
-
reference
: Reference group used in bothestimate_gps()
andmatch_gps()
-
perc_matched
: Percentage of matched samples (frombalqual()
) -
smd
: Maximum standardized mean difference (frombalqual()
) -
p_{group_name}
: Percent matched per treatment group (based on group sample size) -
method_gps
: GPS estimation method used (fromestimate_gps()
) -
link
: Link function used in GPS model -
smd_group
: SMD range category for the row
The resulting best_opt_result
object also includes a custom print()
method that summarizes:
The number of optimal parameter sets per SMD group
Their associated SMD and match rates
Total combinations tested
Total runtime of the optimization loop
Value
An S3 object of class best_opt_result
. The core component is a
data.frame
containing the parameter combinations and results of the
optimization procedure. You can access it using attr(result, "opt_results")
or by calling View(result)
, where result
is your
best_opt_result
object.
The object contains the following custom attributes:
-
opt_results
: Adata.frame
of optimization results. Each row corresponds to a unique parameter combination tested. For a complete description of columns, see the Details section. -
optimization_time
: Time (in seconds) taken by the optimization loop (i.e., the corefor
-loop that evaluates combinations). This does not include the time needed for GPS estimation, pre-processing, or merging of results after loop completion. On large datasets, these excluded steps can still be substantial. -
combinations_tested
: Total number of unique parameter combinations evaluated during optimization. -
smd_results
: A detailed table of standardized mean differences (SMDs) for all pairwise treatment group comparisons and for all covariates specified in theformula
. This is used by theselect_opt()
function to filter optimal models based on covariate-level balance across groups. -
treat_names
: A character vector with the names of the unique treatment groups. -
model_covs
: A character vector listing the model covariates (main effects and interactions) used in theformula
. These names correspond to the variables shown in thesmd_results
table.
Examples
# Define formula for GPS estimation and matching
formula_cancer <- formula(status ~ age * sex)
# Set up the optimization parameter space
opt_args <- make_opt_args(cancer, formula_cancer, gps_method = "m1")
# Run optimization with 2000 random parameter sets and a fixed seed
## Not run:
withr::with_seed(
8252,
{
optimize_gps(
data = cancer,
formula = formula_cancer,
opt_args = opt_args,
n_iter = 2000
)
}
)
## End(Not run)