simulate_gmm {outlierMBC} | R Documentation |
Simulate data from a Gaussian mixture model with outliers.
Description
Simulates data from a Gaussian mixture model, then simulates outliers from a hyper-rectangle, with a rejection step to ensure that the outliers are sufficiently unlikely under the model.
Usage
simulate_gmm(
n,
mu,
sigma,
outlier_num,
seed = NULL,
crit_val = 0.9999,
range_multiplier = 1.5,
verbose = TRUE,
max_rejection = 1e+06
)
Arguments
n |
Vector of component sizes. |
mu |
List of component mean vectors. |
sigma |
List of component covariance matrices. |
outlier_num |
Desired number of outliers. |
seed |
Seed. |
crit_val |
Critical value for uniform sample rejection. |
range_multiplier |
How much greater should the range of the Uniform samples be than the range of the Normal samples? |
verbose |
Whether a message should be printed if a high number of outliers are being simulated. This suggests that many simulated outliers are being rejected and the other arguments may need to be adjusted. |
max_rejection |
Maximum number of simulated outliers to be rejected. |
Details
The simulated outliers are sampled from a Uniform distribution over a
hyper-rectangle. For each dimension, the hyper-rectangle is centred at the
midpoint between the maximum and minimum values for that variable from all of
the Gaussian observations. Its width in that dimension is the distance
between the minimum and maximum values for that variable multiplied by the
value of range_multiplier
. If range_multiplier = 1
, then this
hyper-rectangle is the axis-aligned minimum bounding box for all of the
Gaussian data points in this data set.
The crit_val
ensures that it would have been sufficiently unlikely for a
simulated outlier to have been sampled from any of the Gaussian components.
The Mahalanobis distances of a proposed outlier from each component's mean
vector with respect to that component's covariance matrix are computed. If
any of these Mahalanobis distances are smaller than the critical value of the
appropriate Chi-squared distribution, then the proposed outlier is rejected.
In summary, for a Uniform sample to be accepted, it must be sufficiently far
from each component in terms of Mahalanobis distance.
Value
simulate_gmm
return a data.frame
with continuous variables
X1
, X2
, ..., followed by a mixture component label vector G
with
outliers denoted by 0
.
Examples
gmm_k3n1000o10 <- simulate_gmm(
n = c(500, 250, 250),
mu = list(c(-1, 0), c(+1, -1), c(+1, +1)),
sigma = list(diag(c(0.2, 4 * 0.2)), diag(c(0.2, 0.2)), diag(c(0.2, 0.2))),
outlier_num = 10,
seed = 123,
crit_val = 0.9999,
range_multiplier = 1.5
)
plot(
gmm_k3n1000o10[, c("X1", "X2")],
col = gmm_k3n1000o10$G + 1, pch = gmm_k3n1000o10$G + 1
)