simulate_lcwm {outlierMBC}R Documentation

Simulate data from a linear cluster-weighted model with outliers.

Description

Simulates data from a linear cluster-weighted model, then simulates outliers from a region around each mixture component, with a rejection step to control how unlikely the outliers are under the model.

Usage

simulate_lcwm(
  n,
  mu,
  sigma,
  beta,
  error_sd,
  outlier_num,
  outlier_type = c("x_and_y", "x_only", "y_only"),
  seed = NULL,
  prob_range = c(1e-08, 1e-06),
  range_multipliers = c(3, 3),
  more_extreme = FALSE
)

Arguments

n

Vector of component sizes.

mu

List of component mean vectors.

sigma

List of component covariance matrices.

beta

List of component regression coefficient vectors.

error_sd

Vector of component regression error standard deivations.

outlier_num

Desired number of outliers.

outlier_type

Character string governing whether the outliers are outlying with respect to the explanatory variable only ("x_only"), the response variable only ("y_only"), or both ("x_and_y"). "x_and_y" is the default value.

seed

Seed.

prob_range

Values for uniform sample rejection.

range_multipliers

For every explanatory variable, the sampling region The sampling region for the Uniform distribution used to simulate proposed outliers is controlled by multiplying the component widths by these values.

more_extreme

Whether to return a column in the data frame consisting of the probabilities of sampling more extreme true observations than the simulated outliers.

Details

simulate_lcwm samples a user-defined number of outliers for each component. However, even though an outlier may be associated with one component, it must be outlying with respect to every component.

The covariate values of the simulated outliers for a given component g are sampled from a Uniform distribution over a hyper-rectangle which is specific to that component. For each covariate dimension, the hyper-rectangle is centred at the midpoint between the maximum and minimum values for that variable from all of the Gaussian observations from component g. Its width in that dimension is the distance between the minimum and maximum values for that variable multiplied by the value of range_multiplier[1].

The response values of the simulated outliers for a given component g are obtained by sampling random errors from a Uniform distribution over a univariate interval, simulating covariate values as discussed above, computing the mean response value for those covariate values, then adding this simulated error to the response. The error sampling interval is centred at the midpoint between the maximum and minimum errors for that variable from all of the Gaussian observations from component g. Its width is the distance between the minimum and maximum errors multiplied by the value of range_multiplier[2].

A proposed outlier for component g is rejected if the probability of sampling a more extreme point from any of the components is greater than prob_range[2] or if the probability of sampling a less extreme point from component g is less than prob_range[1]. This can be visualised as a pair of inner and outer envelopes around each component. To be accepted, a proposed outlier must lie inside the outer envelope for its component and outside the inner envelopes of all components. Setting prob_range[1] = 0 will eliminate the outer envelope, while setting prob_range[2] = 0 will eliminate the inner envelope.

By setting outlier_type = "x_only" and giving arbitrary values to error_sd (e.g. a zero vector) and beta (e.g. a list of zero vectors), then ignoring the simulated Y variable, simulate_lcwm can be used to simulate a Gaussian mixture model. Since simulate_lcwm simulates component-specific outliers from sampling regions around each component, rather than a single sampling region around all of the components, this will not be equivalent to simulate_gmm. simulate_lcwm also allows the user to set an upper bound on how unlikely an outlier is, as well as a lower bound, whereas simulate_gmm only sets a lower bound.

Value

simulate_lcwm returns a data.frame with continuous variables X1, X2, ..., followed by a continuous response variable, Y, and a mixture component label vector G with outliers denoted by 0. The optional variable more_extreme may be included, if specified by the corresponding argument.

Examples

lcwm_k3n1000o10 <- simulate_lcwm(
  n = c(300, 300, 400),
  mu = list(c(3), c(6), c(3)),
  sigma = list(as.matrix(1), as.matrix(0.1), as.matrix(1)),
  beta = list(c(0, 0), c(-75, 15), c(0, 5)),
  error_sd = c(1, 1, 1),
  outlier_num = c(3, 3, 4),
  outlier_type = "x_and_y",
  seed = 123,
  prob_range = c(1e-8, 1e-6),
  range_multipliers = c(1, 2)
)

plot(
  lcwm_k3n1000o10[, c("X1", "Y")],
  col = lcwm_k3n1000o10$G + 1,
  pch = lcwm_k3n1000o10$G + 1
)

[Package outlierMBC version 0.0.1 Index]