oclust {oclust} | R Documentation |
The OCLUST Algorithm
Description
oclust is a trimming method in model-based clustering. It iterates over possible values for the number of outliers and returns the model parameters for the best model as determined by the minimum KL divergence. If kuiper=TRUE, oclust calculates an approximate p-value using the Kuiper test and stops the algorithm if the p-value exceeds the specified threhold.
Usage
oclust(
X,
maxO,
G,
grossOuts = NULL,
modelNames = "VVV",
mc.cores = 1,
nmax = 1000,
kuiper = FALSE,
pval = 0.05,
B = 100,
verb = FALSE,
scale = TRUE
)
Arguments
X |
A matrix or data frame with n rows of observations and p columns |
maxO |
An upper bound for the number of outliers |
G |
The number of clusters |
grossOuts |
The indices of the initial outliers to remove. Default is NULL. |
modelNames |
The model to fit using the gpcm function in the mixture package. Default is "VVV" (unconstrained). If modelNames=NULL, all models are fitted for each subset at each iteration. The BIC chooses the best model for each subset. |
mc.cores |
Number of cores to use if running in parallel. Default is 1 |
nmax |
Maximum number of iterations for each EM algorithm. Decreasing nmax may speed up the algorithm but lose precision in finding the log-likelihoods. |
kuiper |
A logical specifying whether to use the Kuiper test (Kuiper, 1960) to stop the algorithm when p-value exceeds the specified threshold. Default is FALSE. |
pval |
The p-value for the Kuiper test. Default is 0.05. |
B |
Number of samples to calculate the approximate p-value. Default is 100. |
verb |
A logical specifying whether to print the current iteration number. Default is FALSE |
scale |
A logical specifying whether to centre and scale the data. Default is TRUE |
Details
Gross outlier indices can be found with the findGrossOuts
function.
N. H. Kuiper, Tests concerning random points on a circle, in: Nederl. Akad. Wetensch. Proc. Ser. A, Vol. 63, 1960, pp. 38–47.
Value
oclust returns a list of class oclust with
data |
A list containing the raw and scaled data |
numO |
The predicted number of outliers |
outliers |
The most likely outliers in the optimal solution in order of likelihood |
class |
The classification for the optimal solution |
model |
The model selected for the optimal solution |
G |
The number of clusters |
pi.g |
The group proportions for the optimal solution |
mu |
The cluster means for the optimal solution |
sigma |
The cluster variances for the optimal solution |
KL |
The KL divergence for each iteration, with the first value being for the initial dataset with the gross outliers removed |
allCand |
All outlier candidates in order of likelihood |
Examples
# simulate 4D dataset
library(mvtnorm)
set.seed(123)
data <- rbind(rmvnorm(250, rep(-3, 4), diag(4)),
rmvnorm(250, rep(3, 4), diag(4)))
# add outliers
noisy <- simOuts(data = data, alpha = 0.02, seed = 123)
# Find gross outliers
findGrossOuts(X = noisy, minPts = 10)
# Elbow between 5 and 10. Specify limits of graph
findGrossOuts(X = noisy, minPts = 10, xlim = c(5, 10))
# Elbow at 9
gross <- findGrossOuts(X = noisy, minPts = 10, elbow = 9)
# run algorithm
if (interactive()) {
# This example takes a few minutes to run
result <- oclust(X = noisy, maxO = 15, G = 2, grossOuts = gross,
modelNames = "EEE", mc.cores = 1, nmax = 50,
kuiper = FALSE, verb = TRUE, scale = TRUE)
}