auxsurvey {AuxSurvey}R Documentation

Auxiliary Variables in Survey Analysis

Description

This function provides a user-friendly interface for various estimators in survey analysis when working with discretized auxiliary variables. Probability surveys often use continuous data from administrative records as auxiliary variables, but the utility of this data is diminished when discretized for confidentiality purposes. This package offers different estimators that handle discretized auxiliary variables effectively.

Usage

auxsurvey(
  formula,
  auxiliary = NULL,
  samples,
  population = NULL,
  subset = NULL,
  family = gaussian(),
  method = c("sample_mean", "rake", "postStratify", "MRP", "GAMP", "linear", "BART"),
  weights = NULL,
  levels = c(0.95, 0.8, 0.5),
  stan_verbose = TRUE,
  show_plot = TRUE,
  nskip = 1000,
  npost = 1000,
  nchain = 4,
  HPD_interval = FALSE,
  seed = NULL
)

Arguments

formula

A string or formula specifying the outcome model. For non-model-based methods (e.g., sample mean, raking, post-stratification), only include the outcome variable (e.g., "~Y"). For model-based methods (e.g., MRP, GAMP, linear regression), additional fixed effect predictors can be specified, such as "Y ~ X1 + X2 + I(X^2)". For GAMP, smooth functions can be specified as "Y ~ X1 + s(X2, 10) + s(X3, by = X1)". Categorical variables are automatically treated as dummy variables in model-based methods.

auxiliary

A string specifying the formula for the auxiliary variables. For sample mean and BART, this should be NULL. For raking, post-stratification, and GAMP, this should be an additive model (e.g., "Z1 + Z2 + Z3"). For MRP, specify random effects for terms in this parameter, such as "Z1 + Z2 + Z3" or "Z1 + Z2:Z3".

samples

A dataframe or tibble containing all variables specified in formula and auxiliary. This is typically a subset of the population.

population

A dataframe or tibble containing all variables specified in formula and auxiliary. This is the entire population used for estimation.

subset

A character vector representing filtering conditions to select subsets of samples and population. Default is NULL, in which case the analysis is performed on the entire dataset. If subsets are specified, estimates for both the whole data and the subsets will be calculated.

family

The distribution family of the outcome variable. Supported options are: gaussian for continuous outcomes and binomial for binary outcomes.

method

A string specifying the model to use. Options include "sample_mean", "rake", "postStratify", "MRP", "GAMP", "linear", and "BART".

weights

A numeric vector of case weights. The length should match the number of cases in samples.

levels

A numeric vector specifying the confidence levels for the confidence intervals (CIs). Multiple values can be specified to calculate multiple CIs.

stan_verbose

A logical scalar; if TRUE, prints all messages when running Stan models. Default is FALSE. This parameter only applies to Bayesian models.

show_plot

A logical scalar; if TRUE, shows diagnostic plots for Stan models. Default is FALSE. This parameter only applies to Bayesian models.

nskip

An integer specifying the number of burn-in iterations for each chain in MCMC for Stan models. Default is 1000. This parameter only applies to Bayesian models.

npost

An integer specifying the number of posterior sampling iterations for each chain in MCMC for Stan models. Default is 1000. This parameter only applies to Bayesian models.

nchain

An integer specifying the number of MCMC chains for Stan models. Default is 4. This parameter only applies to Bayesian models.

HPD_interval

A logical scalar; if TRUE, calculates the highest posterior density (HPD) intervals for the CIs of Stan models. Default is FALSE, in which case symmetric intervals are calculated. This parameter only applies to Bayesian models.

seed

An integer specifying the random seed for reproducibility. Default is NULL.

Details

The available estimators include:

These Bayesian models are implemented using the rstan and rstanarm packages.

Value

A list containing the sample mean estimates and CIs for the subset and/or the whole dataset. Each element in the list includes: - estimate: The point estimate of the sample mean. - CI: Confidence intervals for the sample mean. - Other elements for each confidence level specified in levels.

Examples


## Simulate data with nonlinear association (setting 3).
data = simulate(N = 3000, discretize = 10, setting = 3, seed = 123)
population = data$population
samples = data$samples
ipw = 1 / samples$true_pi
true_mean = mean(population$Y1)

## IPW Sample Mean
IPW_sample_mean = auxsurvey("~Y1", auxiliary = NULL, weights = ipw,
                            samples = samples, population = population,
                            subset = c("Z1 == 1 & Z2 == 1"), method = "sample_mean",
                            levels = 0.95)

## Raking
rake = auxsurvey("~Y1", auxiliary = "Z1 + Z2 + Z3 + auX_10", samples = samples,
                 population = population, subset = c("Z1 == 1", "Z1 == 1 & Z2 == 1"),
                 method = "rake", levels = 0.95)

## MRP
MRP = auxsurvey("Y1 ~ 1 + Z1", auxiliary = "Z2 + Z3:auX_10", samples = samples,
                population = population, subset = c("Z1 == 1", "Z1 == 1 & Z2 == 1"),
                method = "MRP", levels = 0.95, nskip = 4000, npost = 4000,
                nchain = 1, stan_verbose = FALSE, HPD_interval = TRUE)

## GAMP
GAMP = auxsurvey("Y1 ~ 1 + Z1 + Z2 + Z3", auxiliary = "s(auX_10) + s(logit_true_pi, by = Z1)",
                 samples = samples, population = population, method = "GAMP",
                 levels = 0.95, nskip = 4000, npost = 4000, nchain = 1,
                 stan_verbose = FALSE, HPD_interval = TRUE)

## BART
BART = auxsurvey("Y1 ~ Z1 + Z2 + Z3 + auX_10", auxiliary = NULL, samples = samples,
                 population = population, method = "BART", levels = 0.95,
                 nskip = 4000, npost = 4000, nchain = 1, HPD_interval = TRUE)




[Package AuxSurvey version 1.0 Index]