AIBmix {IBclust} | R Documentation |
Agglomerative Information Bottleneck Clustering for Mixed-Type Data
Description
The AIBmix
function implements the Agglomerative Information Bottleneck (AIB) algorithm
for hierarchical clustering of datasets containing mixed-type variables, including categorical (nominal and ordinal)
and continuous variables. This method merges clusters so that information retention is maximised at each step to create meaningful clusters,
leveraging bandwidth parameters to handle different categorical data types (nominal and ordinal) effectively (Slonim and Tishby 1999).
Usage
AIBmix(X, catcols, contcols, lambda = -1, s = -1, scale = TRUE)
Arguments
X |
A data frame containing the categorical data to be clustered. All variables should be categorical,
either |
catcols |
A vector indicating the indices of the categorical variables in |
contcols |
A vector indicating the indices of the continuous variables in |
lambda |
A numeric value or vector specifying the bandwidth parameter for categorical variables. The default value is |
s |
A numeric value or vector specifying the bandwidth parameter(s) for continuous variables. The values must be greater than |
scale |
A logical value indicating whether the continuous variables should be scaled to have unit variance before clustering. Defaults to |
Details
The AIBmix
function produces a hierarchical agglomerative clustering of the data while retaining maximal information about the original variable
distributions. The Agglomerative Information Bottleneck algorithm uses an information-theoretic criterion to merge clusters so that information retention is maximised at each step,
hence creating meaningful clusters with maximal information about the original distribution. Bandwidth parameters for categorical
(nominal, ordinal) and continuous variables are adaptively determined if not provided. This process identifies stable and interpretable cluster assignments by maximizing mutual information while
controlling complexity. The method is well-suited for datasets with mixed-type variables and integrates
information from all variable types effectively.
The following kernel functions are used to estimate densities for the clustering procedure:
-
Continuous variables: Gaussian kernel
K_c\left(\frac{x-x'}{s}\right) = \frac{1}{\sqrt{2\pi}} \exp\left\{ - \frac{\left(x-x'\right)^2}{2s^2} \right\}, \quad s > 0.
-
Nominal categorical variables: Aitchison & Aitken kernel
K_u\left(x = x' ; \lambda\right) = \begin{cases} 1-\lambda & \text{if } x = x' \\ \frac{\lambda}{\ell-1} & \text{otherwise} \end{cases}, \quad 0 \leq \lambda \leq \frac{\ell-1}{\ell}.
-
Ordinal categorical variables: Li & Racine kernel
K_o\left(x = x' ; \nu\right) = \begin{cases} 1 & \text{if } x = x' \\ \nu^{|x - x'|} & \text{otherwise} \end{cases}, \quad 0 \leq \nu \leq 1.
Value
A list containing the following elements:
merges |
A data frame with 2 columns and |
merge_costs |
A numeric vector tracking the cost incurred by each merge |
partitions |
A list containing |
I_Z_Y |
A numeric vector including the mutual information |
I_X_Y |
A numeric value of the mutual information |
info_ret |
A numeric vector of length |
dendrogram |
A dendrogram visualising the cluster hierarchy. The height is determined by the cost of cluster merges. |
Author(s)
Efthymios Costa, Ioanna Papatsouma, Angelos Markos
References
Slonim N, Tishby N (1999). “Agglomerative Information Bottleneck.” Advances in Neural Information Processing Systems, 12.
Aitchison J, Aitken CG (1976). “Multivariate binary discrimination by the kernel method.” Biometrika, 63(3), 413–420.
Li Q, Racine J (2003). “Nonparametric estimation of distributions with categorical and continuous data.” Journal of Multivariate Analysis, 86(2), 266–292.
Silverman BW (1998). Density Estimation for Statistics and Data Analysis (1st Ed.). Routledge.
See Also
Examples
# Example dataset with categorical, ordinal, and continuous variables
set.seed(123)
data <- data.frame(
cat_var = factor(sample(letters[1:3], 100, replace = TRUE)), # Nominal categorical variable
ord_var = factor(sample(c("low", "medium", "high"), 100, replace = TRUE),
levels = c("low", "medium", "high"),
ordered = TRUE), # Ordinal variable
cont_var1 = rnorm(100), # Continuous variable 1
cont_var2 = runif(100) # Continuous variable 2
)
# Perform Mixed-Type Hierarchical Clustering with Agglomerative IB
result <- AIBmix(X = data, catcols = 1:2, contcols = 3:4, lambda = -1, s = -1, scale = TRUE)
# Print clustering results
plot(result$dendrogram, xlab = "", sub = "") # Plot dendrogram