modgo {modgo}R Documentation

MOck Data GeneratiOn

Description

modgo Create mock dataset from a real one by using ranked based inverse normal transformation. Data with perturbed characteristics can be generated.

Usage

modgo(
  data,
  ties_method = "max",
  variables = colnames(data),
  bin_variables = NULL,
  categ_variables = NULL,
  count_variables = NULL,
  n_samples = nrow(data),
  sigma = NULL,
  nrep = 100,
  noise_mu = FALSE,
  pertr_vec = NULL,
  change_cov = NULL,
  change_amount = 0,
  seed = 1,
  thresh_var = NULL,
  thresh_force = FALSE,
  var_prop = NULL,
  var_infl = NULL,
  infl_cov_stable = FALSE,
  tol = 1e-06,
  stop_sim = FALSE,
  new_mean_sd = NULL,
  multi_sugg_prop = NULL,
  generalized_mode = FALSE,
  generalized_mode_model = NULL,
  generalized_mode_lmbds = NULL
)

Arguments

data

a data frame containing the data whose characteristics are to be mimicked during the data simulation.

ties_method

Method on how to deal with equal values during rank transformation. Acceptable input:"max","average","min". This parameter is passed by rbi_normal_transform to the parameter ties.method of rank.

variables

a vector of which variables you want to transform. Default:colnames(data)

bin_variables

a character vector listing the binary variables.

categ_variables

a character vector listing the ordinal categorical variables.

count_variables

a character vector listing the count as a sub sub category of categorical variables. Count variables should be part of categorical variables vector. Count variables are treated differently when using gldex to simulate them.

n_samples

Number of rows of each simulated data set. Default is the number of rows of data.

sigma

a covariance matrix of NxN (N= number of variables) provided by the user to bypass the covariance matrix calculations

nrep

number of repetitions.

noise_mu

Logical value if you want to apply noise to multivariate mean. Default: FALSE

pertr_vec

A named vector.Vector's names are the continuous variables that the user want to perturb. Variance of simulated data set mimic original data's variance.

change_cov

change the covariance of a specific pair of variables.

change_amount

the amount of change in the covariance of a specific pair of variables.

seed

A numeric value specifying the random seed. If seed = NA, no random seed is set.

thresh_var

A data frame that contains the thresholds(left and right) of specified variables (1st column: variable names, 2nd column: Left thresholds, 3rd column: Right thresholds)

thresh_force

A logical value indicating if you want to force threshold in case the proportion of samples that can surpass the threshold are less than 10%

var_prop

A named vector that provides a proportion of value=1 for a specific binary variable(=name of the vector) that will be the proportion of this value in the simulated data sets.[this may increase execution time drastically]

var_infl

A named vector.Vector's names are the continuous variables that the user want to perturb and increase their variance

infl_cov_stable

Logical value. If TRUE,perturbation is applied to original data set and simulations values mimic the perturbed original data set.Covariance matrix used for simulation = original data's correlations. If FALSE, perturbation is applied to the simulated data sets.

tol

A numeric value that set up tolerance(relative to largest variance) for numerical lack of positive-definiteness in Sigma

stop_sim

A logical value indicating if the analysis should stop before simulation and produce only the correlation matrix

new_mean_sd

A matrix that contains two columns named "Mean" and "SD" that the user specifies desired Means and Standard Deviations in the simulated data sets for specific continues variables. The variables must be declared as ROWNAMES in the matrix

multi_sugg_prop

A named vector that provides a proportion of value=1 for specific binary variables(=name of the vector) that will be the close to the proportion of this value in the simulated data sets.

generalized_mode

A logical value indicating if generalized lambda/poisson distributions or set up thresholds will be used to generate the simulated values

generalized_mode_model

A matrix that contains two columns named "Variable" and "Model". This matrix can be used only if a generalized_mode_model argument is provided. It specifies what model should be used for each Variable. Model values should be "rmfmkl", "rprs", "star" or a combination of them, e.g. "rmfmkl-rprs" or "star-star", in case the use wants a bimodal simulation. The user can select Generalised Poisson model for poisson variables, but this model cannot be included in bimodal simulation

generalized_mode_lmbds

A matrix that contains lambdas values for each of the variables of the data set to be used for either Generalized Lambda Distribution Generalized Poisson Distribution or setting up thresholds

Details

Simulated data is generated based on available data. The simulated data mimics the characteristics of the original data. The algorithm used is based on the ranked based inverse normal transformation (Koliopanos et al. (2023)).

Value

A list with the following components:

simulated_data

A list of data frames containing the simulated data.

original_data

A data frame with the input data.

correlations

a list of correlation matrices. The ith element is the correlation matrix for the ith simulated dataset. The (repn + 1)the (last) element of the list is the average of the correlation matrices.

bin_variables

character vector listing the binary variables

categ_variables

a character vector listing the ordinal categorical variables

covariance_matrix

Covariance matrix used when generating observations from a multivariate normal distribution.

seed

Random seed used.

samples_produced

Number of rows of each simulated dataset.

sim_dataset_number

Number of simulated datasets produced.

A list with the following components:

simulated_data

A list of data frames containing the simulated data.

original_data

A data frame with the input data.

correlations

a list of correlation matrices. The ith element is the correlation matrix for the ith simulated dataset. The (repn + 1)the (last) element of the list is the average of the correlation matrices.

bin_variables

character vector listing the binary variables

categ_variables

a character vector listing the ordinal categorical variables

covariance_matrix

Covariance matrix used when generating observations from a multivariate normal distribution.

seed

Random seed used.

samples_produced

Number of rows of each simulated dataset.

sim_dataset_number

Number of simulated datasets produced.

Author(s)

Francisco M. Ojeda, George Koliopanos

References

Koliopanos, G. and Ojeda, F. and Ziegler Andreas (2023), “A simple-to-use R package for mimicking study data by simulations,” Methods Inf Med.

Examples

data("Cleveland",package="modgo")
test_modgo <- modgo(data = Cleveland,
     bin_variables = c("CAD","HighFastBloodSugar","Sex","ExInducedAngina"),
     categ_variables =c("Chestpaintype"))

[Package modgo version 1.0.1 Index]