ombc_gmm {outlierMBC}R Documentation

Sequentially identify outliers while fitting a Gaussian mixture model.

Description

This function performs model-based clustering and outlier identification. It does so by iteratively fitting a Gaussian mixture model and removing the observation that is least likely under the model. Its procedure is summarised below:

  1. Fit a Gaussian mixture model to the data.

  2. Compute a dissimilarity between the theoretical and observed distributions of the scaled squared sample Mahalanobis distances for each mixture component.

  3. Aggregate across the components to obtain a single dissimilarity value.

  4. Remove the observation with the lowest mixture density.

  5. Repeat Steps 1-4 until max_out observations have been removed.

  6. Identify the number of outliers which minimised the aggregated dissimilarity, remove only those observations, and fit a Gaussian mixture model to the remaining data.

Usage

ombc_gmm(
  x,
  comp_num,
  max_out,
  gross_outs = rep(FALSE, nrow(x)),
  init_scheme = c("update", "reinit", "reuse"),
  mnames = "VVV",
  nmax = 1000,
  atol = 1e-08,
  init_z = NULL,
  init_model = NULL,
  init_method = c("hc", "kmpp"),
  init_scaling = FALSE,
  kmpp_seed = 123,
  fixed_labels = NULL,
  verbose = TRUE
)

Arguments

x

Data.

comp_num

Number of mixture components.

max_out

Maximum number of outliers.

gross_outs

Logical vector identifying gross outliers.

init_scheme

Which initialisation scheme to use.

mnames

Model names for mixture::gpcm.

nmax

Maximum number of iterations for mixture::gpcm.

atol

EM convergence tolerance threshold for mixture::gpcm.

init_z

Initial component assignment probability matrix.

init_model

Initial mixture model (mixture::gpcm best_model).

init_method

Method used to initialise each mixture model.

init_scaling

Logical value controlling whether the data should be scaled for initialisation.

kmpp_seed

Optional seed for k-means++ initialisation.

fixed_labels

Cluster labels that are known a prior. See label argument in mixture::gpcm.

verbose

Whether the iteration count is printed.

Value

ombc_gmm returns an object of class "outliermbc_gmm", which is essentially a list with the following elements:

labels

Vector of mixture component labels with outliers denoted by 0.

outlier_bool

Logical vector indicating if an observation has been classified as an outlier.

outlier_num

Number of observations classified as outliers.

outlier_rank

Order in which observations are removed from the data set. Observations which were provisionally removed, including those that were eventually not classified as outliers, are ranked from 1 to max_out. All gross outliers have rank 1. If there are gross_num gross outliers, then the observations removed during the main algorithm itself will be numbered from gross_num + 1 to max_out. Observations that were ever removed have rank 0.

gross_outs

Logical vector identifying the gross outliers. This is identical to the gross_outs vector passed to this function as an argument / input.

mix

Output from mixture::gpcm fitted to the non-outlier observations.

loglike

Vector of log-likelihood values for each iteration.

removal_dens

Vector of mixture densities for the removed observations. These are the lowest mixture densities at each iteration.

distrib_diff_vec

Vector of aggregated cross-component dissimilarity values for each iteration.

distrib_diff_mat

Matrix of component-specific dissimilarity values for each iteration.

call

Arguments / parameter values used in this function call.

version

Version of outlierMBC used in this function call.

conv_status

Logical vector indicating which iterations' mixture models reached convergence during model-fitting.

Examples

ombc_gmm_k3n1000o10 <- ombc_gmm(
  gmm_k3n1000o10[, 1:2],
  comp_num = 3, max_out = 20
)

plot_curve(ombc_gmm_k3n1000o10)

[Package outlierMBC version 0.0.1 Index]