ombc_lcwm {outlierMBC}R Documentation

Sequentially identify outliers while fitting a linear cluster-weighted model.

Description

This function performs model-based clustering, clusterwise regression, and outlier identification. It does so by iteratively fitting a linear cluster-weighted model and removing the observation that is least likely under the model. Its procedure is summarised below:

  1. Fit a linear cluster-weighted model to the data.

  2. Compute a dissimilarity between the theoretical and observed distributions of the scaled squared sample Mahalanobis distances for each mixture component.

  3. Compute a dissimilarity between the theoretical and observed distributions of the scaled squared studentised residuals for each mixture component.

  4. Aggregate these two dissimilarities to obtain one dissimilarity value for each component.

  5. Aggregate across the components to obtain a single dissimilarity value.

  6. Remove the observation with the lowest mixture density.

  7. Repeat Steps 1-6 until max_out observations have been removed.

  8. Identify the number of outliers which minimised the aggregated dissimilarity, remove only those observations, and fit a linear cluster-weighted model to the remaining data.

Usage

ombc_lcwm(
  xy,
  x,
  y_formula,
  comp_num,
  max_out,
  gross_outs = rep(FALSE, nrow(x)),
  init_scheme = c("update", "reinit", "reuse"),
  mnames = "VVV",
  nmax = 1000,
  atol = 1e-08,
  init_z = NULL,
  init_method = c("hc", "kmpp"),
  init_scaling = TRUE,
  kmpp_seed = 123,
  verbose = TRUE,
  dd_weight = 0.5
)

Arguments

xy

data.frame containing covariates and response.

x

Covariate data only.

y_formula

Regression formula.

comp_num

Number of mixture components.

max_out

Maximum number of outliers.

gross_outs

Logical vector identifying gross outliers.

init_scheme

Which initialisation scheme to use.

mnames

Model names for mixture::gpcm.

nmax

Maximum number of iterations for flexCWM::cwm.

atol

EM convergence threshold for flexCWM::cwm.

init_z

Initial component assignment probability matrix.

init_method

Method used to initialise each mixture model.

init_scaling

Logical value controlling whether the data should be scaled for initialisation.

kmpp_seed

Optional seed for k-means++ initialisation.

verbose

Whether the iteration count is printed.

dd_weight

A value between 0 and 1 which controls the weighting of the response and covariate dissimilarities when aggregating.

Value

ombc_lcwm returns an object of class "outliermbc_lcwm", which is essentially a list with the following elements:

labels

Vector of mixture component labels with outliers denoted by 0.

outlier_bool

Logical vector indicating if an observation has been classified as an outlier.

outlier_num

Number of observations classified as outliers.

outlier_rank

Order in which observations are removed from the data set. Observations which were provisionally removed, including those that were eventually not classified as outliers, are ranked from 1 to max_out. All gross outliers have rank 1. If there are gross_num gross outliers, then the observations removed during the main algorithm itself will be numbered from gross_num + 1 to max_out. Observations that were ever removed have rank 0.

gross_outs

Logical vector identifying the gross outliers. This is identical to the gross_outs vector passed to this function as an argument / input.

lcwm

Output from flexCWM::cwm fitted to the non-outlier observations.

loglike

Vector of log-likelihood values for each iteration.

removal_dens

Vector of mixture densities for the removed observations. These are the lowest mixture densities at each iteration.

distrib_diff_vec

Vector of aggregated cross-component dissimilarity values for each iteration.

distrib_diff_mat

Matrix of component-specific dissimilarity values for each iteration.

distrib_diff_arr

Array of component-specific response and covariate dissimilarity values for each iteration.

call

Arguments / parameter values used in this function call.

version

Version of outlierMBC used in this function call.

conv_status

Logical vector indicating which iterations' mixture models reached convergence during model-fitting.

Examples

gross_lcwm_k3n1000o10 <- find_gross(lcwm_k3n1000o10, 20)

ombc_lcwm_k3n1000o10 <- ombc_lcwm(
  xy = lcwm_k3n1000o10[, c("X1", "Y")],
  x = lcwm_k3n1000o10$X1,
  y_formula = Y ~ X1,
  comp_num = 3,
  max_out = 20,
  mnames = "V",
  gross_outs = gross_lcwm_k3n1000o10$gross_bool
)

[Package outlierMBC version 0.0.1 Index]