ombc_gmm {outlierMBC} | R Documentation |
Sequentially identify outliers while fitting a Gaussian mixture model.
Description
This function performs model-based clustering and outlier identification. It does so by iteratively fitting a Gaussian mixture model and removing the observation that is least likely under the model. Its procedure is summarised below:
Fit a Gaussian mixture model to the data.
Compute a dissimilarity between the theoretical and observed distributions of the scaled squared sample Mahalanobis distances for each mixture component.
Aggregate across the components to obtain a single dissimilarity value.
Remove the observation with the lowest mixture density.
Repeat Steps 1-4 until
max_out
observations have been removed.Identify the number of outliers which minimised the aggregated dissimilarity, remove only those observations, and fit a Gaussian mixture model to the remaining data.
Usage
ombc_gmm(
x,
comp_num,
max_out,
gross_outs = rep(FALSE, nrow(x)),
init_scheme = c("update", "reinit", "reuse"),
mnames = "VVV",
nmax = 1000,
atol = 1e-08,
init_z = NULL,
init_model = NULL,
init_method = c("hc", "kmpp"),
init_scaling = FALSE,
kmpp_seed = 123,
fixed_labels = NULL,
verbose = TRUE
)
Arguments
x |
Data. |
comp_num |
Number of mixture components. |
max_out |
Maximum number of outliers. |
gross_outs |
Logical vector identifying gross outliers. |
init_scheme |
Which initialisation scheme to use. |
mnames |
Model names for mixture::gpcm. |
nmax |
Maximum number of iterations for |
atol |
EM convergence tolerance threshold for |
init_z |
Initial component assignment probability matrix. |
init_model |
Initial mixture model ( |
init_method |
Method used to initialise each mixture model. |
init_scaling |
Logical value controlling whether the data should be scaled for initialisation. |
kmpp_seed |
Optional seed for k-means++ initialisation. |
fixed_labels |
Cluster labels that are known a prior. See |
verbose |
Whether the iteration count is printed. |
Value
ombc_gmm
returns an object of class "outliermbc_gmm", which is essentially
a list with the following elements:
labels
Vector of mixture component labels with outliers denoted by 0.
outlier_bool
Logical vector indicating if an observation has been classified as an outlier.
outlier_num
Number of observations classified as outliers.
outlier_rank
Order in which observations are removed from the data set. Observations which were provisionally removed, including those that were eventually not classified as outliers, are ranked from
1
tomax_out
. All gross outliers have rank1
. If there aregross_num
gross outliers, then the observations removed during the main algorithm itself will be numbered fromgross_num + 1
tomax_out
. Observations that were ever removed have rank0
.gross_outs
Logical vector identifying the gross outliers. This is identical to the
gross_outs
vector passed to this function as an argument / input.mix
Output from
mixture::gpcm
fitted to the non-outlier observations.loglike
Vector of log-likelihood values for each iteration.
removal_dens
Vector of mixture densities for the removed observations. These are the lowest mixture densities at each iteration.
distrib_diff_vec
Vector of aggregated cross-component dissimilarity values for each iteration.
distrib_diff_mat
Matrix of component-specific dissimilarity values for each iteration.
call
Arguments / parameter values used in this function call.
version
Version of
outlierMBC
used in this function call.conv_status
Logical vector indicating which iterations' mixture models reached convergence during model-fitting.
Examples
ombc_gmm_k3n1000o10 <- ombc_gmm(
gmm_k3n1000o10[, 1:2],
comp_num = 3, max_out = 20
)
plot_curve(ombc_gmm_k3n1000o10)