AIBcat {IBclust} | R Documentation |
Cluster Categorical Data Using the Agglomerative Information Bottleneck Algorithm
Description
The AIBcat
function implements the Agglomerative Information Bottleneck (AIB) algorithm
for hierarchical clustering of datasets containing categorical variables. This method merges clusters
so that information retention is maximised at each step to create meaningful clusters,
leveraging bandwidth parameters to handle
different categorical data types (nominal and ordinal) effectively (Slonim and Tishby 1999).
Usage
AIBcat(X, lambda = -1)
Arguments
X |
A data frame containing the categorical data to be clustered. All variables should be categorical,
either |
lambda |
A numeric value or vector specifying the bandwidth parameter for categorical variables. The default value is |
Details
The AIBcat
function applies the Agglomerative Information Bottleneck algorithm to do hierarchical clustering of datasets containing only categorical variables, both nominal and ordinal. The algorithm uses an information-theoretic criterion to merge clusters so that information retention is maximised at each step to create meaningful clusters with maximal information about the original distribution.
To estimate the distributions of categorical features, the function utilizes specialized kernel functions, as follows:
K_u(x = x'; \lambda) = \begin{cases}
1 - \lambda, & \text{if } x = x' \\
\frac{\lambda}{\ell - 1}, & \text{otherwise}
\end{cases}, \quad 0 \leq \lambda \leq \frac{\ell - 1}{\ell},
where \ell
is the number of categories, and \lambda
controls the smoothness of the Aitchison & Aitken kernel for nominal variables (Aitchison and Aitken 1976).
K_o(x = x'; \nu) = \begin{cases}
1, & \text{if } x = x' \\
\nu^{|x - x'|}, & \text{otherwise}
\end{cases}, \quad 0 \leq \nu \leq 1,
where \nu
is the bandwidth parameter for ordinal variables, accounting for the ordinal relationship between categories (Li and Racine 2003).
Here, \lambda
, and \nu
are bandwidth or smoothing parameters, while \ell
is the number of levels of the categorical variable. The lambda parameter is automatically determined by the algorithm if not provided by the user. For ordinal variables, the lambda parameter of the function is used to define \nu
.
Value
A list containing the following elements:
merges |
A data frame with 2 columns and |
merge_costs |
A numeric vector tracking the cost incurred by each merge |
partitions |
A list containing |
I_Z_Y |
A numeric vector including the mutual information |
I_X_Y |
A numeric value of the mutual information |
info_ret |
A numeric vector of length |
dendrogram |
A dendrogram visualising the cluster hierarchy. The height is determined by the cost of cluster merges. |
Author(s)
Efthymios Costa, Ioanna Papatsouma, Angelos Markos
References
Slonim N, Tishby N (1999). “Agglomerative Information Bottleneck.” Advances in Neural Information Processing Systems, 12.
Aitchison J, Aitken CG (1976). “Multivariate binary discrimination by the kernel method.” Biometrika, 63(3), 413–420.
Li Q, Racine J (2003). “Nonparametric estimation of distributions with categorical and continuous data.” Journal of Multivariate Analysis, 86(2), 266–292.
See Also
Examples
# Simulated categorical data
set.seed(123)
X <- data.frame(
Var1 = as.factor(sample(letters[1:3], 200, replace = TRUE)), # Nominal variable
Var2 = as.factor(sample(letters[4:6], 200, replace = TRUE)), # Nominal variable
Var3 = factor(sample(c("low", "medium", "high"), 200, replace = TRUE),
levels = c("low", "medium", "high"), ordered = TRUE) # Ordinal variable
)
# Run AIBcat with automatic lambda selection
result <- AIBcat(X = X, lambda = -1)
# Print clustering results
plot(result$dendrogram, xlab = "", sub = "") # Plot dendrogram