IBcat {IBclust} | R Documentation |
Cluster Categorical Data Using the Information Bottleneck Algorithm
Description
The IBcat
function implements the Information Bottleneck (IB) algorithm
for fuzzy clustering of datasets containing categorical variables. This method balances information retention
and data compression to create meaningful clusters, leveraging bandwidth parameters to handle
different categorical data types (nominal and ordinal) effectively (Strouse and Schwab 2019).
Usage
IBcat(X, ncl, beta, randinit = NULL, lambda = -1,
maxiter = 100, nstart = 100, verbose = FALSE)
Arguments
X |
A data frame containing the categorical data to be clustered. All variables should be categorical,
either |
beta |
Regularisation strength. |
ncl |
An integer specifying the number of clusters to form. |
randinit |
Optional. A vector specifying initial cluster assignments. If |
lambda |
A numeric value or vector specifying the bandwidth parameter for categorical variables. The default value is |
maxiter |
The maximum number of iterations for the clustering algorithm. Defaults to |
nstart |
The number of random initializations to run. The best clustering result (based on the information-theoretic criterion) is returned. Defaults to |
verbose |
Logical. Default to |
Details
The IBcat
function applies the Information Bottleneck algorithm to do fuzzy clustering of datasets containing only categorical variables, both nominal and ordinal. The algorithm optimizes an information-theoretic objective to balance the trade-off between data compression and the retention of information about the original distribution.
To estimate the distributions of categorical features, the function utilizes specialized kernel functions, as follows:
K_u(x = x'; \lambda) = \begin{cases}
1 - \lambda, & \text{if } x = x' \\
\frac{\lambda}{\ell - 1}, & \text{otherwise}
\end{cases}, \quad 0 \leq \lambda \leq \frac{\ell - 1}{\ell},
where \ell
is the number of categories, and \lambda
controls the smoothness of the Aitchison & Aitken kernel for nominal variables (Aitchison and Aitken 1976).
K_o(x = x'; \nu) = \begin{cases}
1, & \text{if } x = x' \\
\nu^{|x - x'|}, & \text{otherwise}
\end{cases}, \quad 0 \leq \nu \leq 1,
where \nu
is the bandwidth parameter for ordinal variables, accounting for the ordinal relationship between categories (Li and Racine 2003).
Here, \lambda
, and \nu
are bandwidth or smoothing parameters, while \ell
is the number of levels of the categorical variable. The lambda parameter is automatically determined by the algorithm if not provided by the user. For ordinal variables, the lambda parameter of the function is used to define \nu
.
Value
A list containing the following elements:
Cluster |
A cluster membership matrix. |
InfoXT |
A numeric value representing the mutual information, |
InfoYT |
A numeric value representing the mutual information, |
beta |
A numeric value of the regularisation strength beta used. |
lambda |
A numeric vector of bandwidth parameters for categorical variables, controlling how categories are weighted in the clustering. |
ixt |
A numeric vector tracking the mutual information values between original observation weights and cluster assignments across iterations. |
iyt |
A numeric vector tracking the mutual information values between original labels and cluster assignments across iterations. |
losses |
A numeric vector tracking the final loss values across iterations. |
Author(s)
Efthymios Costa, Ioanna Papatsouma, Angelos Markos
References
Strouse DJ, Schwab DJ (2019). “The information bottleneck and geometric clustering.” Neural Computation, 31(3), 596–612.
Aitchison J, Aitken CG (1976). “Multivariate binary discrimination by the kernel method.” Biometrika, 63(3), 413–420.
Li Q, Racine J (2003). “Nonparametric estimation of distributions with categorical and continuous data.” Journal of Multivariate Analysis, 86(2), 266–292.
See Also
Examples
# Simulated categorical data
set.seed(123)
X <- data.frame(
Var1 = as.factor(sample(letters[1:3], 200, replace = TRUE)), # Nominal variable
Var2 = as.factor(sample(letters[4:6], 200, replace = TRUE)), # Nominal variable
Var3 = factor(sample(c("low", "medium", "high"), 200, replace = TRUE),
levels = c("low", "medium", "high"), ordered = TRUE) # Ordinal variable
)
# Run IBcat with automatic lambda selection and multiple initializations
result <- IBcat(X = X, ncl = 3, beta = 15, lambda = -1, nstart = 5)
# Print clustering results
print(result$Cluster) # Cluster membership matrix
print(result$InfoXT) # Mutual information between X and T
print(result$InfoYT) # Mutual information between Y and T