AIBcont {IBclust} | R Documentation |
Cluster Continuous Data Using the Agglomerative Information Bottleneck Algorithm
Description
The AIBcont
function implements the Agglomerative Information Bottleneck (AIB) algorithm
for hierarchical clustering of datasets containing categorical variables. This method merges clusters
so that information retention is maximised at each step to create meaningful clusters,
leveraging bandwidth parameters to handle
different categorical data types (nominal and ordinal) effectively (Slonim and Tishby 1999).
Usage
AIBcont(X, s = -1, scale = TRUE)
Arguments
X |
A data frame containing the categorical data to be clustered. All variables should be categorical,
either |
s |
A numeric value or vector specifying the bandwidth parameter(s) for continuous variables. The values must be greater than |
scale |
A logical value indicating whether the continuous variables should be scaled to have unit variance before clustering. Defaults to |
Details
The AIBcat
function applies the Agglomerative Information Bottleneck algorithm to do hierarchical clustering of datasets containing only continuous variables, both nominal and ordinal. The algorithm uses an information-theoretic criterion to merge clusters so that information retention is maximised at each step to create meaningful clusters with maximal information about the original distribution.
The function utilizes the Gaussian kernel (Silverman 1998) for estimating probability densities of continuous features. The kernel is defined as:
K_c\left(\frac{x - x'}{s}\right) = \frac{1}{\sqrt{2\pi}} \exp\left\{-\frac{\left(x - x'\right)^2}{2s^2}\right\}, \quad s > 0.
The bandwidth parameter s
, which controls the smoothness of the density estimate, is automatically determined by the algorithm if not provided by the user.
Value
A list containing the following elements:
merges |
A data frame with 2 columns and |
merge_costs |
A numeric vector tracking the cost incurred by each merge |
partitions |
A list containing |
I_Z_Y |
A numeric vector including the mutual information |
I_X_Y |
A numeric value of the mutual information |
info_ret |
A numeric vector of length |
dendrogram |
A dendrogram visualising the cluster hierarchy. The height is determined by the cost of cluster merges. |
Author(s)
Efthymios Costa, Ioanna Papatsouma, Angelos Markos
References
Slonim N, Tishby N (1999). “Agglomerative Information Bottleneck.” Advances in Neural Information Processing Systems, 12.
Silverman BW (1998). Density Estimation for Statistics and Data Analysis (1st Ed.). Routledge.
See Also
Examples
# Generate simulated continuous data
set.seed(123)
X <- matrix(rnorm(1000), ncol = 5) # 200 observations, 5 features
# Run AIBcont with automatic bandwidth selection
result <- AIBcont(X = X, s = -1, scale = TRUE)
# Print clustering results
plot(result$dendrogram, xlab = "", sub = "") # Plot dendrogram