DIBcat {IBclust}R Documentation

Cluster Categorical Data Using the Deterministic Information Bottleneck Algorithm

Description

The DIBcat function implements the Deterministic Information Bottleneck (DIB) algorithm for clustering datasets containing categorical variables. This method balances information retention and data compression to create meaningful clusters, leveraging bandwidth parameters to handle different categorical data types (nominal and ordinal) effectively (Costa et al. 2025).

Usage

DIBcat(X, ncl, randinit = NULL, lambda = -1,
       maxiter = 100, nstart = 100,
       verbose = FALSE)

Arguments

X

A data frame containing the categorical data to be clustered. All variables should be categorical, either factor (for nominal variables) or ordered (for ordinal variables).

ncl

An integer specifying the number of clusters to form.

randinit

Optional. A vector specifying initial cluster assignments. If NULL, cluster assignments are initialized randomly.

lambda

A numeric value or vector specifying the bandwidth parameter for categorical variables. The default value is -1, which enables automatic determination of the optimal bandwidth. For nominal variables, the maximum allowable value of lambda is (l - 1)/l, where l represents the number of categories. For ordinal variables, the maximum allowable value of lambda is 1.

maxiter

The maximum number of iterations for the clustering algorithm. Defaults to 100.

nstart

The number of random initializations to run. The best clustering result (based on the information-theoretic criterion) is returned. Defaults to 100.

verbose

Logical. Default to FALSE to suppress progress messages. Change to TRUE to print.

Details

The DIBcat function applies the Deterministic Information Bottleneck algorithm to cluster datasets containing only categorical variables, both nominal and ordinal. The algorithm optimizes an information-theoretic objective to balance the trade-off between data compression and the retention of information about the original distribution.

To estimate the distributions of categorical features, the function utilizes specialized kernel functions, as follows:

K_u(x = x'; \lambda) = \begin{cases} 1 - \lambda, & \text{if } x = x' \\ \frac{\lambda}{\ell - 1}, & \text{otherwise} \end{cases}, \quad 0 \leq \lambda \leq \frac{\ell - 1}{\ell},

where \ell is the number of categories, and \lambda controls the smoothness of the Aitchison & Aitken kernel for nominal variables (Aitchison and Aitken 1976).

K_o(x = x'; \nu) = \begin{cases} 1, & \text{if } x = x' \\ \nu^{|x - x'|}, & \text{otherwise} \end{cases}, \quad 0 \leq \nu \leq 1,

where \nu is the bandwidth parameter for ordinal variables, accounting for the ordinal relationship between categories (Li and Racine 2003).

Here, \lambda, and \nu are bandwidth or smoothing parameters, while \ell is the number of levels of the categorical variable. The lambda parameter is automatically determined by the algorithm if not provided by the user. For ordinal variables, the lambda parameter of the function is used to define \nu.

Value

A list containing the following elements:

Cluster

An integer vector indicating the cluster assignment for each data point at convergence.

Entropy

A numeric value representing the entropy of the cluster assignments at the end of the iterative procedure.

MutualInfo

A numeric value representing the mutual information, I(Y;T), between the data distribution and the cluster assignments.

lambda

A numeric vector of bandwidth parameters for categorical variables, controlling how categories are weighted in the clustering.

beta

A numeric vector of the final beta values used during the iterative optimization.

ents

A numeric vector tracking the entropy values across iterations, providing insights into the convergence pattern.

mis

A numeric vector tracking the mutual information values across iterations.

Author(s)

Efthymios Costa, Ioanna Papatsouma, Angelos Markos

References

Costa E, Papatsouma I, Markos A (2025). “A Deterministic Information Bottleneck Method for Clustering Mixed-Type Data.” doi:10.48550/arXiv.2407.03389, arXiv:2407.03389, https://arxiv.org/abs/2407.03389.

Aitchison J, Aitken CG (1976). “Multivariate binary discrimination by the kernel method.” Biometrika, 63(3), 413–420.

Li Q, Racine J (2003). “Nonparametric estimation of distributions with categorical and continuous data.” Journal of Multivariate Analysis, 86(2), 266–292.

See Also

DIBmix, DIBcont

Examples

# Simulated categorical data
set.seed(123)
X <- data.frame(
  Var1 = as.factor(sample(letters[1:3], 200, replace = TRUE)),  # Nominal variable
  Var2 = as.factor(sample(letters[4:6], 200, replace = TRUE)),  # Nominal variable
  Var3 = factor(sample(c("low", "medium", "high"), 200, replace = TRUE),
                levels = c("low", "medium", "high"), ordered = TRUE)  # Ordinal variable
)

# Run DIBcat with automatic lambda selection and multiple initializations
result <- DIBcat(X = X, ncl = 3, lambda = -1, nstart = 50)

# Print clustering results
print(result$Cluster)       # Cluster assignments
print(result$Entropy)       # Final entropy
print(result$MutualInfo)    # Mutual information

[Package IBclust version 1.2 Index]