ombc_lcwm {outlierMBC} | R Documentation |
Sequentially identify outliers while fitting a linear cluster-weighted model.
Description
This function performs model-based clustering, clusterwise regression, and outlier identification. It does so by iteratively fitting a linear cluster-weighted model and removing the observation that is least likely under the model. Its procedure is summarised below:
Fit a linear cluster-weighted model to the data.
Compute a dissimilarity between the theoretical and observed distributions of the scaled squared sample Mahalanobis distances for each mixture component.
Compute a dissimilarity between the theoretical and observed distributions of the scaled squared studentised residuals for each mixture component.
Aggregate these two dissimilarities to obtain one dissimilarity value for each component.
Aggregate across the components to obtain a single dissimilarity value.
Remove the observation with the lowest mixture density.
Repeat Steps 1-6 until
max_out
observations have been removed.Identify the number of outliers which minimised the aggregated dissimilarity, remove only those observations, and fit a linear cluster-weighted model to the remaining data.
Usage
ombc_lcwm(
xy,
x,
y_formula,
comp_num,
max_out,
gross_outs = rep(FALSE, nrow(x)),
init_scheme = c("update", "reinit", "reuse"),
mnames = "VVV",
nmax = 1000,
atol = 1e-08,
init_z = NULL,
init_method = c("hc", "kmpp"),
init_scaling = TRUE,
kmpp_seed = 123,
verbose = TRUE,
dd_weight = 0.5
)
Arguments
xy |
|
x |
Covariate data only. |
y_formula |
Regression formula. |
comp_num |
Number of mixture components. |
max_out |
Maximum number of outliers. |
gross_outs |
Logical vector identifying gross outliers. |
init_scheme |
Which initialisation scheme to use. |
mnames |
Model names for mixture::gpcm. |
nmax |
Maximum number of iterations for |
atol |
EM convergence threshold for |
init_z |
Initial component assignment probability matrix. |
init_method |
Method used to initialise each mixture model. |
init_scaling |
Logical value controlling whether the data should be scaled for initialisation. |
kmpp_seed |
Optional seed for k-means++ initialisation. |
verbose |
Whether the iteration count is printed. |
dd_weight |
A value between |
Value
ombc_lcwm
returns an object of class "outliermbc_lcwm", which is
essentially a list with the following elements:
labels
Vector of mixture component labels with outliers denoted by 0.
outlier_bool
Logical vector indicating if an observation has been classified as an outlier.
outlier_num
Number of observations classified as outliers.
outlier_rank
Order in which observations are removed from the data set. Observations which were provisionally removed, including those that were eventually not classified as outliers, are ranked from
1
tomax_out
. All gross outliers have rank1
. If there aregross_num
gross outliers, then the observations removed during the main algorithm itself will be numbered fromgross_num + 1
tomax_out
. Observations that were ever removed have rank0
.gross_outs
Logical vector identifying the gross outliers. This is identical to the
gross_outs
vector passed to this function as an argument / input.lcwm
Output from
flexCWM::cwm
fitted to the non-outlier observations.loglike
Vector of log-likelihood values for each iteration.
removal_dens
Vector of mixture densities for the removed observations. These are the lowest mixture densities at each iteration.
distrib_diff_vec
Vector of aggregated cross-component dissimilarity values for each iteration.
distrib_diff_mat
Matrix of component-specific dissimilarity values for each iteration.
distrib_diff_arr
Array of component-specific response and covariate dissimilarity values for each iteration.
call
Arguments / parameter values used in this function call.
version
Version of
outlierMBC
used in this function call.conv_status
Logical vector indicating which iterations' mixture models reached convergence during model-fitting.
Examples
gross_lcwm_k3n1000o10 <- find_gross(lcwm_k3n1000o10, 20)
ombc_lcwm_k3n1000o10 <- ombc_lcwm(
xy = lcwm_k3n1000o10[, c("X1", "Y")],
x = lcwm_k3n1000o10$X1,
y_formula = Y ~ X1,
comp_num = 3,
max_out = 20,
mnames = "V",
gross_outs = gross_lcwm_k3n1000o10$gross_bool
)