ma_projection {sae.projection}R Documentation

Model-Assisted Projection Estimator

Description

The function addresses the problem of combining information from two or more independent surveys, a common challenge in survey sampling. It focuses on cases where:

The function implements a model-assisted projection estimation method based on a working model. The working models that can be used include several machine learning models that can be seen in the details section

Usage

ma_projection(
  formula,
  cluster_ids,
  weight,
  strata = NULL,
  domain,
  summary_function = "mean",
  working_model,
  data_model,
  data_proj,
  model_metric,
  cv_folds = 3,
  tuning_grid = 10,
  parallel_over = "resamples",
  seed = 1,
  return_yhat = FALSE,
  ...
)

Arguments

formula

A model formula. All variables used must exist in both data_model and data_proj.

cluster_ids

Column name (character) or formula specifying cluster identifiers from highest to lowest level. Use ~0 or ~1 if there are no clusters.

weight

Column name in data_proj representing the survey weights.

strata

Column name for stratification; use NULL if no strata are used.

domain

Character vector specifying domain variable names in both datasets.

summary_function

A function to compute domain-level estimates (default: "mean", "total", "variance").

working_model

A parsnip model object specifying the working model (see ⁠@details⁠).

data_model

Data frame (small sample) containing both target and auxiliary variables.

data_proj

Data frame (large sample) containing only auxiliary variables.

model_metric

A yardstick::metric_set() function, or NULL to use default metrics.

cv_folds

Number of folds for k-fold cross-validation.

tuning_grid

Either a data frame with tuning parameters or a positive integer specifying the number of grid search candidates.

parallel_over

Specifies parallelization mode: "resamples", "everything", or NULL. If "resamples", then tuning will be performed in parallel over resamples alone. Within each resample, the preprocessor (i.e. recipe or formula) is processed once, and is then reused across all models that need to be fit. If "everything", then tuning will be performed in parallel at two levels. An outer parallel loop will iterate over resamples. Additionally, an inner parallel loop will iterate over all unique combinations of preprocessor and model tuning parameters for that specific resample. This will result in the preprocessor being re-processed multiple times, but can be faster if that processing is extremely fast.

seed

Integer seed for reproducibility.

return_yhat

Logical; if TRUE, returns predicted y values for data_model.

...

Additional arguments passed to svydesign.

Details

The following working models are supported via the parsnip interface:

For a complete list of supported models and engines, see Tidy Modeling With R.

Value

A list containing:

References

  1. Kim, J. K., & Rao, J. N. (2012). Combining data from two independent surveys: a model-assisted approach. Biometrika, 99(1), 85-100.

Examples

## Not run: 
library(sae.projection)
library(dplyr)
library(bonsai)

df_svy22_income <- df_svy22 %>% filter(!is.na(income))
df_svy23_income <- df_svy23 %>% filter(!is.na(income))

# Linear regression
lm_proj <- ma_projection(
  income ~ age + sex + edu + disability,
  cluster_ids = "PSU", weight = "WEIGHT", strata = "STRATA",
  domain = c("PROV", "REGENCY"),
  working_model = linear_reg(),
  data_model = df_svy22_income,
  data_proj = df_svy23_income,
  nest = TRUE
)


df_svy22_neet <- df_svy22 %>% filter(between(age, 15, 24))
df_svy23_neet <- df_svy23 %>% filter(between(age, 15, 24))


# LightGBM regression with hyperparameter tunning
show_engines("boost_tree")
lgbm_model <- boost_tree(
  mtry = tune(), trees = tune(), min_n = tune(),
  tree_depth = tune(), learn_rate = tune(),
  engine = "lightgbm"
)

lgbm_proj <- ma_projection(
  formula = neet ~ sex + edu + disability,
  cluster_ids = "PSU",
  weight = "WEIGHT",
  strata = "STRATA",
  domain = c("PROV", "REGENCY"),
  working_model = lgbm_model,
  data_model = df_svy22_neet,
  data_proj = df_svy23_neet,
  cv_folds = 3,
  tuning_grid = 3,
  nest = TRUE
)

## End(Not run)

[Package sae.projection version 0.1.4 Index]