simulate_lcwm {outlierMBC} | R Documentation |
Simulate data from a linear cluster-weighted model with outliers.
Description
Simulates data from a linear cluster-weighted model, then simulates outliers from a region around each mixture component, with a rejection step to control how unlikely the outliers are under the model.
Usage
simulate_lcwm(
n,
mu,
sigma,
beta,
error_sd,
outlier_num,
outlier_type = c("x_and_y", "x_only", "y_only"),
seed = NULL,
prob_range = c(1e-08, 1e-06),
range_multipliers = c(3, 3),
more_extreme = FALSE
)
Arguments
n |
Vector of component sizes. |
mu |
List of component mean vectors. |
sigma |
List of component covariance matrices. |
beta |
List of component regression coefficient vectors. |
error_sd |
Vector of component regression error standard deivations. |
outlier_num |
Desired number of outliers. |
outlier_type |
Character string governing whether the outliers are
outlying with respect to the explanatory variable only
( |
seed |
Seed. |
prob_range |
Values for uniform sample rejection. |
range_multipliers |
For every explanatory variable, the sampling region The sampling region for the Uniform distribution used to simulate proposed outliers is controlled by multiplying the component widths by these values. |
more_extreme |
Whether to return a column in the data frame consisting of the probabilities of sampling more extreme true observations than the simulated outliers. |
Details
simulate_lcwm
samples a user-defined number of outliers for each component.
However, even though an outlier may be associated with one component, it must
be outlying with respect to every component.
The covariate values of the simulated outliers for a given component g
are
sampled from a Uniform distribution over a hyper-rectangle which is specific
to that component. For each covariate dimension, the hyper-rectangle is
centred at the midpoint between the maximum and minimum values for that
variable from all of the Gaussian observations from component g
. Its width
in that dimension is the distance between the minimum and maximum values for
that variable multiplied by the value of range_multiplier[1]
.
The response values of the simulated outliers for a given component g
are
obtained by sampling random errors from a Uniform distribution over a
univariate interval, simulating covariate values as discussed above,
computing the mean response value for those covariate values, then adding
this simulated error to the response. The error sampling interval is centred
at the midpoint between the maximum and minimum errors for that variable from
all of the Gaussian observations from component g
. Its width is the
distance between the minimum and maximum errors multiplied by the value of
range_multiplier[2]
.
A proposed outlier for component g
is rejected if the probability of
sampling a more extreme point from any of the components is greater than
prob_range[2]
or if the probability of sampling a less extreme point from
component g
is less than prob_range[1]
. This can be visualised as a pair
of inner and outer envelopes around each component. To be accepted, a
proposed outlier must lie inside the outer envelope for its component and
outside the inner envelopes of all components. Setting prob_range[1] = 0
will eliminate the outer envelope, while setting prob_range[2] = 0
will
eliminate the inner envelope.
By setting outlier_type
= "x_only"
and giving arbitrary values to
error_sd
(e.g. a zero vector) and beta
(e.g. a list of zero vectors),
then ignoring the simulated Y
variable, simulate_lcwm
can be used to
simulate a Gaussian mixture model. Since simulate_lcwm
simulates
component-specific outliers from sampling regions around each component,
rather than a single sampling region around all of the components, this will
not be equivalent to simulate_gmm. simulate_lcwm
also allows the user to
set an upper bound on how unlikely an outlier is, as well as a lower bound,
whereas simulate_gmm only sets a lower bound.
Value
simulate_lcwm
returns a data.frame
with continuous variables
X1
, X2
, ..., followed by a continuous response variable, Y
, and a
mixture component label vector G
with outliers denoted by 0
. The
optional variable more_extreme
may be included, if specified by the
corresponding argument.
Examples
lcwm_k3n1000o10 <- simulate_lcwm(
n = c(300, 300, 400),
mu = list(c(3), c(6), c(3)),
sigma = list(as.matrix(1), as.matrix(0.1), as.matrix(1)),
beta = list(c(0, 0), c(-75, 15), c(0, 5)),
error_sd = c(1, 1, 1),
outlier_num = c(3, 3, 4),
outlier_type = "x_and_y",
seed = 123,
prob_range = c(1e-8, 1e-6),
range_multipliers = c(1, 2)
)
plot(
lcwm_k3n1000o10[, c("X1", "Y")],
col = lcwm_k3n1000o10$G + 1,
pch = lcwm_k3n1000o10$G + 1
)