simulate_gmm {outlierMBC}R Documentation

Simulate data from a Gaussian mixture model with outliers.

Description

Simulates data from a Gaussian mixture model, then simulates outliers from a hyper-rectangle, with a rejection step to ensure that the outliers are sufficiently unlikely under the model.

Usage

simulate_gmm(
  n,
  mu,
  sigma,
  outlier_num,
  seed = NULL,
  crit_val = 0.9999,
  range_multiplier = 1.5,
  verbose = TRUE,
  max_rejection = 1e+06
)

Arguments

n

Vector of component sizes.

mu

List of component mean vectors.

sigma

List of component covariance matrices.

outlier_num

Desired number of outliers.

seed

Seed.

crit_val

Critical value for uniform sample rejection.

range_multiplier

How much greater should the range of the Uniform samples be than the range of the Normal samples?

verbose

Whether a message should be printed if a high number of outliers are being simulated. This suggests that many simulated outliers are being rejected and the other arguments may need to be adjusted.

max_rejection

Maximum number of simulated outliers to be rejected.

Details

The simulated outliers are sampled from a Uniform distribution over a hyper-rectangle. For each dimension, the hyper-rectangle is centred at the midpoint between the maximum and minimum values for that variable from all of the Gaussian observations. Its width in that dimension is the distance between the minimum and maximum values for that variable multiplied by the value of range_multiplier. If range_multiplier = 1, then this hyper-rectangle is the axis-aligned minimum bounding box for all of the Gaussian data points in this data set.

The crit_val ensures that it would have been sufficiently unlikely for a simulated outlier to have been sampled from any of the Gaussian components. The Mahalanobis distances of a proposed outlier from each component's mean vector with respect to that component's covariance matrix are computed. If any of these Mahalanobis distances are smaller than the critical value of the appropriate Chi-squared distribution, then the proposed outlier is rejected. In summary, for a Uniform sample to be accepted, it must be sufficiently far from each component in terms of Mahalanobis distance.

Value

simulate_gmm return a data.frame with continuous variables X1, X2, ..., followed by a mixture component label vector G with outliers denoted by 0.

Examples

gmm_k3n1000o10 <- simulate_gmm(
  n = c(500, 250, 250),
  mu = list(c(-1, 0), c(+1, -1), c(+1, +1)),
  sigma = list(diag(c(0.2, 4 * 0.2)), diag(c(0.2, 0.2)), diag(c(0.2, 0.2))),
  outlier_num = 10,
  seed = 123,
  crit_val = 0.9999,
  range_multiplier = 1.5
)

plot(
  gmm_k3n1000o10[, c("X1", "X2")],
  col = gmm_k3n1000o10$G + 1, pch = gmm_k3n1000o10$G + 1
)

[Package outlierMBC version 0.0.1 Index]