generateSampleDataCat {VICatMix}R Documentation

generateSampleDataCat

Description

Generate sample clustered categorical data with cluster labels. The probability of a '1' in each cluster for each variable is randomly generated via a Dirichlet (1, ..., cat) distribution, where cat is the number of categories for each variable. For noisy variables, the probability of a '1' is also generated by a Dirichlet (1, ..., cat) distribution but this probability is the same regardless of the cluster membership of the observation. An outcome variable associated with the clustering structure can be generated with a different number of categories, also generated with a Dirichlet distribution. Package 'gtools' must be installed for this function.

Usage

generateSampleDataCat(n, K, w, p, Irrp, yout = FALSE, cat = 2, ycat = 2)

Arguments

n

Number of observations in dataset.

K

Number of clusters desired.

w

A vector of mixture weights (proportion of population in each cluster).

p

Number of clustering variables/covariates in dataset.

Irrp

Number of irrelevant/noisy variables/covariates in dataset. Note that these variables will be the final Irrp columns in the simulated dataset. Total data dimension is p + Irrp.

yout

Default FALSE. Indicate whether an outcome associated with clustering is required.

cat

Number of categories in each covariate. Default is 2.

ycat

Number of categories for the outcome varaible. Default is 2.

Value

A list with the following components:

data

A matrix consisting of the simulated data.

trueClusters

A vector with the simulated cluster assignments.

outcome

If yout = TRUE, this will be a vector with the outcome variable.

Examples

# example code
generatedData <- generateSampleDataCat(1000, 4, c(0.1, 0.2, 0.3, 0.4), 100, 0, cat = 3)


[Package VICatMix version 1.0 Index]