IBmix {IBclust} | R Documentation |
Information Bottleneck Clustering for Mixed-Type Data
Description
The IBmix
function implements the Information Bottleneck (IB) algorithm
for clustering datasets containing mixed-type variables, including categorical (nominal and ordinal)
and continuous variables. This method optimizes an information-theoretic objective to preserve
relevant information in the cluster assignments while achieving effective data compression
(Strouse and Schwab 2019).
Usage
IBmix(X, ncl, beta, catcols, contcols, randinit = NULL,
lambda = -1, s = -1, scale = TRUE,
maxiter = 100, nstart = 100,
verbose = FALSE)
Arguments
X |
A data frame containing the input data to be clustered. It should include categorical variables
( |
ncl |
An integer specifying the number of clusters. |
beta |
Regularisation strength. |
catcols |
A vector indicating the indices of the categorical variables in |
contcols |
A vector indicating the indices of the continuous variables in |
randinit |
An optional vector specifying the initial cluster assignments. If |
lambda |
A numeric value or vector specifying the bandwidth parameter for categorical variables. The default value is |
s |
A numeric value or vector specifying the bandwidth parameter(s) for continuous variables. The values must be greater than |
scale |
A logical value indicating whether the continuous variables should be scaled to have unit variance before clustering. Defaults to |
maxiter |
The maximum number of iterations allowed for the clustering algorithm. Defaults to |
nstart |
The number of random initializations to run. The best clustering solution is returned. Defaults to |
verbose |
Logical. Default to |
Details
The IBmix
function produces a fuzzy clustering of the data while retaining maximal information about the original variable
distributions. The Information Bottleneck algorithm optimizes an information-theoretic
objective that balances information preservation and compression. Bandwidth parameters for categorical
(nominal, ordinal) and continuous variables are adaptively determined if not provided. This iterative
process identifies stable and interpretable cluster assignments by maximizing mutual information while
controlling complexity. The method is well-suited for datasets with mixed-type variables and integrates
information from all variable types effectively.
The following kernel functions are used to estimate densities for the clustering procedure:
-
Continuous variables: Gaussian kernel
K_c\left(\frac{x-x'}{s}\right) = \frac{1}{\sqrt{2\pi}} \exp\left\{ - \frac{\left(x-x'\right)^2}{2s^2} \right\}, \quad s > 0.
-
Nominal categorical variables: Aitchison & Aitken kernel
K_u\left(x = x' ; \lambda\right) = \begin{cases} 1-\lambda & \text{if } x = x' \\ \frac{\lambda}{\ell-1} & \text{otherwise} \end{cases}, \quad 0 \leq \lambda \leq \frac{\ell-1}{\ell}.
-
Ordinal categorical variables: Li & Racine kernel
K_o\left(x = x' ; \nu\right) = \begin{cases} 1 & \text{if } x = x' \\ \nu^{|x - x'|} & \text{otherwise} \end{cases}, \quad 0 \leq \nu \leq 1.
Here, s
, \lambda
, and \nu
are bandwidth or smoothing parameters, while \ell
is the number of levels of the categorical variable. s
and \lambda
are automatically determined by the algorithm if not provided by the user. For ordinal variables, the lambda parameter of the function is used to define \nu
.
Value
A list containing the following elements:
Cluster |
A cluster membership matrix. |
InfoXT |
A numeric value representing the mutual information, |
InfoYT |
A numeric value representing the mutual information, |
beta |
A numeric value of the regularisation strength beta used. |
s |
A numeric vector of bandwidth parameters used for the continuous variables. |
lambda |
A numeric vector of bandwidth parameters used for the categorical variables. |
ixt |
A numeric vector tracking the mutual information values between original observation weights and cluster assignments across iterations. |
iyt |
A numeric vector tracking the mutual information values between original labels and cluster assignments across iterations. |
losses |
A numeric vector tracking the final loss values across iterations. |
Author(s)
Efthymios Costa, Ioanna Papatsouma, Angelos Markos
References
Strouse DJ, Schwab DJ (2019). “The information bottleneck and geometric clustering.” Neural Computation, 31(3), 596–612.
Aitchison J, Aitken CG (1976). “Multivariate binary discrimination by the kernel method.” Biometrika, 63(3), 413–420.
Li Q, Racine J (2003). “Nonparametric estimation of distributions with categorical and continuous data.” Journal of Multivariate Analysis, 86(2), 266–292.
Silverman BW (1998). Density Estimation for Statistics and Data Analysis (1st Ed.). Routledge.
See Also
Examples
# Example dataset with categorical, ordinal, and continuous variables
set.seed(123)
data <- data.frame(
cat_var = factor(sample(letters[1:3], 100, replace = TRUE)), # Nominal categorical variable
ord_var = factor(sample(c("low", "medium", "high"), 100, replace = TRUE),
levels = c("low", "medium", "high"),
ordered = TRUE), # Ordinal variable
cont_var1 = rnorm(100), # Continuous variable 1
cont_var2 = runif(100) # Continuous variable 2
)
# Perform Mixed-Type Fuzzy Clustering
result <- IBmix(X = data, ncl = 3, beta = 2, catcols = 1:2, contcols = 3:4, nstart = 20)
# Print clustering results
print(result$Cluster) # Cluster membership matrix
print(result$InfoXT) # Mutual information between X and T
print(result$InfoYT) # Mutual information between Y and T