ma_projection {sae.projection} | R Documentation |
Model-Assisted Projection Estimator
Description
The function addresses the problem of combining information from two or more independent surveys, a common challenge in survey sampling. It focuses on cases where:
-
Survey 1: A large sample collects only auxiliary information.
-
Survey 2: A much smaller sample collects both the variables of interest and the auxiliary variables.
The function implements a model-assisted projection estimation method based on a working model. The working models that can be used include several machine learning models that can be seen in the details section
Usage
ma_projection(
formula,
cluster_ids,
weight,
strata = NULL,
domain,
summary_function = "mean",
working_model,
data_model,
data_proj,
model_metric,
cv_folds = 3,
tuning_grid = 10,
parallel_over = "resamples",
seed = 1,
return_yhat = FALSE,
...
)
Arguments
formula |
A model formula. All variables used must exist in both |
cluster_ids |
Column name (character) or formula specifying cluster identifiers from highest to lowest level. Use |
weight |
Column name in |
strata |
Column name for stratification; use |
domain |
Character vector specifying domain variable names in both datasets. |
summary_function |
A function to compute domain-level estimates (default: |
working_model |
A parsnip model object specifying the working model (see |
data_model |
Data frame (small sample) containing both target and auxiliary variables. |
data_proj |
Data frame (large sample) containing only auxiliary variables. |
model_metric |
A |
cv_folds |
Number of folds for k-fold cross-validation. |
tuning_grid |
Either a data frame with tuning parameters or a positive integer specifying the number of grid search candidates. |
parallel_over |
Specifies parallelization mode: |
seed |
Integer seed for reproducibility. |
return_yhat |
Logical; if |
... |
Additional arguments passed to |
Details
The following working models are supported via the parsnip interface:
-
linear_reg()
– Linear regression -
logistic_reg()
– Logistic regression -
linear_reg(engine = "stan")
– Bayesian linear regression -
logistic_reg(engine = "stan")
– Bayesian logistic regression -
poisson_reg()
– Poisson regression -
decision_tree()
– Decision tree -
nearest_neighbor()
– k-Nearest Neighbors (k-NN) -
naive_bayes()
– Naive Bayes classifier -
mlp()
– Multi-layer perceptron (neural network) -
svm_linear()
– Support vector machine with linear kernel -
svm_poly()
– Support vector machine with polynomial kernel -
svm_rbf()
– Support vector machine with radial basis function (RBF) kernel -
bag_tree()
– Bagged decision tree -
bart()
– Bayesian Additive Regression Trees (BART) -
rand_forest(engine = "ranger")
– Random forest (via ranger) -
rand_forest(engine = "aorsf")
– Accelerated oblique random forest (AORF; Jaeger et al. 2022, 2024) -
boost_tree(engine = "lightgbm")
– Gradient boosting (LightGBM) -
boost_tree(engine = "xgboost")
– Gradient boosting (XGBoost)
For a complete list of supported models and engines, see Tidy Modeling With R.
Value
A list containing:
-
model
– The fitted working model object. -
prediction
– A vector of predictions from the working model. -
df_result
– A data frame with:-
domain
– Domain identifier. -
ypr
– Projection estimator results for each domain. -
var_ypr
– Estimated variance of the projection estimator. -
rse_ypr
– Relative standard error (in \
-
References
Kim, J. K., & Rao, J. N. (2012). Combining data from two independent surveys: a model-assisted approach. Biometrika, 99(1), 85-100.
Examples
## Not run:
library(sae.projection)
library(dplyr)
library(bonsai)
df_svy22_income <- df_svy22 %>% filter(!is.na(income))
df_svy23_income <- df_svy23 %>% filter(!is.na(income))
# Linear regression
lm_proj <- ma_projection(
income ~ age + sex + edu + disability,
cluster_ids = "PSU", weight = "WEIGHT", strata = "STRATA",
domain = c("PROV", "REGENCY"),
working_model = linear_reg(),
data_model = df_svy22_income,
data_proj = df_svy23_income,
nest = TRUE
)
df_svy22_neet <- df_svy22 %>% filter(between(age, 15, 24))
df_svy23_neet <- df_svy23 %>% filter(between(age, 15, 24))
# LightGBM regression with hyperparameter tunning
show_engines("boost_tree")
lgbm_model <- boost_tree(
mtry = tune(), trees = tune(), min_n = tune(),
tree_depth = tune(), learn_rate = tune(),
engine = "lightgbm"
)
lgbm_proj <- ma_projection(
formula = neet ~ sex + edu + disability,
cluster_ids = "PSU",
weight = "WEIGHT",
strata = "STRATA",
domain = c("PROV", "REGENCY"),
working_model = lgbm_model,
data_model = df_svy22_neet,
data_proj = df_svy23_neet,
cv_folds = 3,
tuning_grid = 3,
nest = TRUE
)
## End(Not run)