GIBcont {IBclust}R Documentation

Cluster Continuous Data Using the Generalised Information Bottleneck Algorithm

Description

The GIBcont function implements the Generalised Information Bottleneck (GIB) algorithm for fuzzy clustering of continuous data. This method optimizes an information-theoretic objective to preserve relevant information while forming concise and interpretable cluster representations (Strouse and Schwab 2019).

Usage

GIBcont(X, ncl, beta, alpha, randinit = NULL, s = -1, scale = TRUE,
        maxiter = 100, nstart = 100,
        verbose = FALSE)

Arguments

X

A numeric matrix or data frame containing the continuous data to be clustered. All variables should be of type numeric.

ncl

An integer specifying the number of clusters to form.

beta

Regularisation strength.

alpha

Strength of relative entropy term.

randinit

Optional. A vector specifying initial cluster assignments. If NULL, cluster assignments are initialized randomly.

s

A numeric value or vector specifying the bandwidth parameter(s) for continuous variables. The values must be greater than 0. The default value is -1, which enables the automatic selection of optimal bandwidth(s).

scale

A logical value indicating whether the continuous variables should be scaled to have unit variance before clustering. Defaults to TRUE.

maxiter

The maximum number of iterations allowed for the clustering algorithm. Defaults to 100.

nstart

The number of random initializations to run. The best clustering result (based on the information-theoretic criterion) is returned. Defaults to 100.

verbose

Logical. Default to FALSE to suppress progress messages. Change to TRUE to print.

Details

The GIBcont function applies the Generalised Information Bottleneck algorithm to do fuzzy clustering of datasets comprising only continuous variables. This method leverages an information-theoretic objective to optimize the trade-off between data compression and the preservation of relevant information about the underlying data distribution. Set \alpha = 1 and \alpha = 0 to recover the Information Bottleneck and its Deterministic variant, respectively. If \alpha = 0, the algorithm ignores the value of the regularisation parameter \beta.

The function utilizes the Gaussian kernel (Silverman 1998) for estimating probability densities of continuous features. The kernel is defined as:

K_c\left(\frac{x - x'}{s}\right) = \frac{1}{\sqrt{2\pi}} \exp\left\{-\frac{\left(x - x'\right)^2}{2s^2}\right\}, \quad s > 0.

The bandwidth parameter s, which controls the smoothness of the density estimate, is automatically determined by the algorithm if not provided by the user.

Value

A list containing the following elements:

Cluster

A cluster membership matrix.

Entropy

A numeric value representing the entropy of the cluster assignment, H(T).

RelEntropy

A numeric value representing the relative entropy of cluster assignment, given the observation weights H(X \mid T).

MutualInfo

A numeric value representing the mutual information, I(Y;T), between the original labels (Y) and the cluster assignments (T).

beta

A numeric value of the regularisation strength beta used.

alpha

A numeric value of the strength of relative entropy used.

s

A numeric vector of bandwidth parameters used for the continuous variables.

ht

A numeric vector tracking the entropy value of the cluster assignments across iterations.

hy_t

A numeric vector tracking the relative entropy values between the cluster assignments and observations weights across iterations.

iyt

A numeric vector tracking the mutual information values between original labels and cluster assignments across iterations.

losses

A numeric vector tracking the final loss values across iterations.

Author(s)

Efthymios Costa, Ioanna Papatsouma, Angelos Markos

References

Strouse DJ, Schwab DJ (2017). “The Deterministic Information Bottleneck.” Neural Computation, 29(6), 1611–1630.

Silverman BW (1998). Density Estimation for Statistics and Data Analysis (1st Ed.). Routledge.

See Also

GIBmix, GIBcat

Examples

# Generate simulated continuous data
set.seed(123)
X <- matrix(rnorm(200), ncol = 5)  # 200 observations, 5 features

# Run GIBcont with automatic bandwidth selection and multiple initializations
result <- GIBcont(X = X, ncl = 2, beta = 50, alpha = 0.75, s = -1, nstart = 20)

# Print clustering results
print(result$Cluster)       # Cluster membership matrix
print(result$Entropy)       # Entropy of final clustering
print(result$RelEntropy)    # Relative entropy of final clustering
print(result$MutualInfo)    # Mutual information between Y and T

[Package IBclust version 1.2 Index]