projection_randomforest {sae.projection} | R Documentation |
Projection Estimator with Random Forest Algorithm
Description
Kim and Rao (2012), the synthetic data obtained through the model-assisted projection method can provide a useful tool for efficient domain estimation when the size of the sample in survey B is much larger than the size of sample in survey A.
The function projects estimated values from a small survey (survey A) onto an independent large survey (survey B) using the random forest classification algorithm.
The two surveys are statistically independent, but the projection relies on shared auxiliary variables.
The process includes data preprocessing, feature selection, model training, and domain-specific estimation based on survey design principles "two stages one phase".
The function automatically selects standard estimation or bias-corrected estimation based on the parameter bias_correction
.
bias_correction = TRUE
can only be used if there is psu, ssu, strata
on the data_model
. If it doesn't, then it will automatically be bias_correction = FALSE
Usage
projection_randomforest(
data_model,
target_column,
predictor_cols,
data_proj,
domain1,
domain2,
psu,
ssu,
strata,
weights,
split_ratio = 0.8,
metric = "Accuracy",
bias_correction = FALSE
)
Arguments
data_model |
The training dataset, consisting of auxiliary variables and the target variable. |
target_column |
The name of the target column in the |
predictor_cols |
A vector of predictor column names. |
data_proj |
The data for projection (prediction), which needs to be projected using the trained model. It must contain the same auxiliary variables as the |
domain1 |
Domain variables for survey estimation (e.g., "province") |
domain2 |
Domain variables for survey estimation (e.g., "regency") |
psu |
Primary sampling units, representing the structure of the sampling frame. |
ssu |
Secondary sampling units, representing the structure of the sampling frame. |
strata |
Stratification variable, ensuring that specific subgroups are represented. |
weights |
Weights used for the direct estimation from |
split_ratio |
Proportion of data used for training (default is 0.8, meaning 80 percent for training and 20 percent for validation). |
metric |
The metric used for model evaluation (default is Accuracy, other options include "AUC", "F1", etc.). |
bias_correction |
Logical; if |
Value
A list containing the following elements:
-
model
The trained Random Forest model. -
importance
Feature importance showing which features contributed most to the model's predictions. -
train_accuracy
Accuracy of the model on the training set. -
validation_accuracy
Accuracy of the model on the validation set. -
validation_performance
Confusion matrix for the validation set, showing performance metrics like accuracy, precision, recall, etc. -
data_proj
The projection data with predicted values.
if bias_correction = FALSE
:
-
Domain1
Estimations for Domain 1, including estimated values, variance, and relative standard error (RSE). -
Domain2
Estimations for Domain 2, including estimated values, variance, and relative standard error (RSE).
if bias_correction = TRUE
:
-
Direct
Direct estimations for Domain 1, including estimated values, variance, and relative standard error (RSE). -
Domain1_corrected_bias
Bias-corrected estimations for Domain 1, including estimated values, variance, and relative standard error (RSE). -
Domain2_corrected_bias
Bias-corrected estimations for Domain 2, including estimated values, variance, and relative standard error (RSE).
References
Kim, J. K., & Rao, J. N. (2012). Combining data from two independent surveys: a model-assisted approach. Biometrika, 99(1), 85-100.
Examples
library(survey)
library(caret)
library(dplyr)
data_A <- df_svy_A
data_B <- df_svy_B
x_predictors <- data_A %>% select(7:32) %>% names()
# The alternative of calculating "psu, ssu, strata" if not present in the data_model is:
data_A <- data_A %>%
left_join(data_B %>% select(psu, ssu, strata, no_sample, no_household),
by = c('no_sample', 'no_household'),
multiple = 'any'
)
# Run projection_randomforest without bias correction
result_standard <- projection_randomforest(
data_model = data_A,
target_column = "Y",
predictor_cols = x_predictors,
data_proj = data_B,
domain1 = "province",
domain2 = "regency",
psu = "psu",
ssu = "ssu",
strata = "strata",
weights = "weight",
metric = "Accuracy",
bias_correction = FALSE)
print(result_standard)
# Run projection_randomforest with bias correction
result_bias_corrected <- projection_randomforest(
data_model = data_A,
target_column = "Y",
predictor_cols = x_predictors,
data_proj = data_B,
domain1 = "province",
domain2 = "regency",
psu = "psu",
ssu = "ssu",
strata = "strata",
weights = "weight",
metric = "Accuracy",
bias_correction = TRUE)
print(result_bias_corrected)