Icluster {VecDep}R Documentation

Icluster

Description

This function clusters the columns (variables) of a dataset via agglomerative hierarchical variable clustering using estimated multivariate similarities (dependence coefficients) between random vectors.

Usage

Icluster(
  data,
  est_method,
  max_dim = Inf,
  norm = NULL,
  link = "average",
  trace = 1
)

Arguments

data

The dataset (n \times q matrix with observations in rows, variables in columns) whose columns need to be clustered.

est_method

The method for estimating the similarity between two clusters of variables.

max_dim

The maximum dimension of the random vectors for which no link function is used when computing the similarity (default = Inf).

norm

A possible normalization function applied to the dependence measure (default = NULL, meaning no normalization).

link

The link function to be used when max_dim is exceeded (default = "average").

trace

Controls how verbose output should be (default = 1, showing the progress).

Details

Suppose that the q variables (of which we have n observations in data) are \mathcal{S} = \{X_{1}, \dots, X_{q}\}. Then, most important in hierarchical variable clustering, is computing the similarity

\mathcal{D}(\mathbb{X},\mathbb{Y})

between two disjoint subsets of variables \mathbb{X}, \mathbb{Y} \subset \mathcal{S}. In particular, the main algorithm is as follows:

For i = 1,2,\dots,q-1:

We call \{\aleph_{1}, \dots, \aleph_{q}\} the hierarchy constructed throughout the algorithm, and define, for i \in \{1, \dots, q\}, \text{Adiam}(\aleph_{i}) = |\aleph_{i}|^{-1} \sum_{\mathbb{X} \in \aleph_{i}} \text{diam}(\mathbb{X}), with

\text{diam}(\mathbb{X}) = \begin{cases} \underset{\{X,Y\} \subset \mathbb{X} }{\min} \mathcal{D}(X,Y) & \mbox{if } |\mathbb{X}| > 1 \\ 1 & \mbox{if } |\mathbb{X}| = 1, \end{cases}

and \text{Msplit}(\aleph_{i}) = \max_{\mathbb{X} \in \aleph_{i}} \text{split}(\mathbb{X}), with

\text{split}(\mathbb{X}) = \underset{Y \in \aleph_{i} \setminus \mathbb{X}}{\underset{X \in \mathbb{X}}{\max}} \mathcal{D}(X,Y) \hspace{0.2cm} \text{for} \hspace{0.2cm} \{\mathbb{X}\} \neq \aleph_{i}.

Adiam stands for the average diameter of a partition (measuring the homogeneity, which we want to be large), while Msplit stands for the maximum split of a partition (measuring the non-separation, which we want to be small).

For measuring the similarity \mathcal{D}(\mathbb{X},\mathbb{Y}), we approach \mathbb{X} and \mathbb{Y} as being two random vectors (let's say of dimensions d_{1} and d_{2} respectively). For \mathcal{D}, we take an estimated dependence measure between (two) random vectors. The following options are possible:

When d_{1} + d_{2} > max_dim, the specified link function (say L) is used for computing the similarity between \mathbb{X} and \mathbb{Y}, i.e.,

\mathcal{D} \left ( \mathbb{X}, \mathbb{Y} \right ) = L \left (\left \{\mathcal{D}(X,Y) : X \in \mathbb{X}, Y \in \mathbb{Y} \right \} \right ),

which by default is the average of all inter-pairwise similarities. Other options are "single" for the minimum, and "complete" for the maximum.

The function norm (say N) is a possible normalization applied to the similarity measure, i.e., instead of computing \mathcal{D} (using the method specified by est_method), the similarity becomes N \circ \mathcal{D}. The default is N(t) = t, meaning that no normalization is applied.

Value

A list with elements "hierarchy" containing the hierarchy constructed throughout the algorithm (a hash object), "all" containing all similarities that were computed throughout the algorithm (a hash object), "diam" containing the average diameters of all partitions created throughout the algorithm (a vector), and "split" containing the maximum splits of all partitions created throughout the algorithm (a vector).

References

De Keyser, S. & Gijbels, I. (2024). Hierarchical variable clustering via copula-based divergence measures between random vectors. International Journal of Approximate Reasoning 165:109090. doi: https://doi.org/10.1016/j.ijar.2023.109090.

De Keyser, S. & Gijbels, I. (2024). Parametric dependence between random vectors via copula-based divergence measures. Journal of Multivariate Analysis 203:105336.
doi: https://doi.org/10.1016/j.jmva.2024.105336.

De Keyser, S. & Gijbels, I. (2024). High-dimensional copula-based Wasserstein dependence. doi: https://doi.org/10.48550/arXiv.2404.07141.

See Also

minormal for the computation of the Gaussian copula mutual information, Helnormal for the computation of the Gaussian copula Hellinger distance, estphi for several approach to estimating the \Phi-dependence between k random vectors, bwd1 for the computation of the first Bures-Wasserstein dependence coefficient \mathcal{D}_{1}, bwd2 for the computation of the second Bures-Wasserstein dependence coefficient \mathcal{D}_{2}.

Examples


q = 20

# We will impose a clustering
# {{X1,X2},{X3,X4,X5},{X6,X7,X8},{X9,X10,X11,X12,X13},{X14,X15,X16,X17,X18,X19,X20}}
dim = c(2,3,3,5,7)

# Sample size
n = 200

# Twenty dimensional hierarchical Gumbel copula with parameters
# (theta_0,theta_1,theta_2,theta_3,theta_4,theta_5) = (2,3,4,5,6,7)
hac = gethac(dim,c(2,3,4,5,6,7),type = 1)

# So, strongest cluster is {X14,X15,X16,X17,X18,X19,X20}, then {X9,X10,X11,X12,X13},
# then {X6,X7,X8}, then {X3,X4,X5}, and finally {X1,X2}

# Sample
sample =  suppressWarnings(HAC::rHAC(n,hac))

# Cluster using different methods

# Gaussian copula based methods

Clustering1 =  Icluster(data = sample,
                        est_method = list("phi", "mi", "Gauss", omegas = 1))

# 5-cluster partition
Clustering1$hierarchy$Aleph_16

Clustering2 =  Icluster(data = sample,
                        est_method = list("phi", "mi", "Gauss",
                                          omegas = seq(0.01,0.999,len = 50)))

# 5-cluster partition
Clustering2$hierarchy$Aleph_16

Clustering3 =  Icluster(data = sample,
                        est_method = list("phi", "mi", "Gauss", omegas = 1),
                        max_dim = 2)

# 5-cluster partition
Clustering3$hierarchy$Aleph_16

Clustering4 =  Icluster(data = sample,
                        est_method = list("phi", "Hel", "Gauss", omegas = 1))

# 5-cluster partition
Clustering4$hierarchy$Aleph_16

Clustering5 =  Icluster(data = sample,
                        est_method = list("ot", coef = 1, omegas = 1))

# 5-cluster partition
Clustering5$hierarchy$Aleph_16

Clustering6 =  Icluster(data = sample,
                        est_method = list("ot", coef = 2, omegas = 1))

# 5-cluster partition
Clustering6$hierarchy$Aleph_16

Clustering7 =  Icluster(data = sample,
                        est_method = list("ot", coef = 2, omegas = 1), max_dim = 4)

# 5-cluster partition
Clustering7$hierarchy$Aleph_16

# Parametric hierarchical Archimedean copula approach

Clustering8 = Icluster(data = sample,
                       est_method = list("phi", function(t){t * log(t)}, "hac",
                                         type = 1, M = 1000), max_dim = 4)

# 5-cluster partition
Clustering8$hierarchy$Aleph_16

Clustering9 = Icluster(data = sample,
                       est_method = list("phi", function(t){(sqrt(t)-1)^2}, "hac",
                                         type = 1, M = 1000), max_dim = 2)

# 5-cluster partition
Clustering9$hierarchy$Aleph_16

# Non-parametric approaches

Clustering10 = Icluster(data = sample,
                        est_method = list("phi", function(t){t * log(t)}, "nphac",
                                       estimator = "beta", type = 1), max_dim = 3)

# 5-cluster partition
Clustering10$hierarchy$Aleph_16

Clustering11 = Icluster(data = sample,
                        est_method = list("phi", function(t){t * log(t)}, "nphac",
                                     estimator = "trans", type = 1), max_dim = 3)

# 5-cluster partition
Clustering11$hierarchy$Aleph_16

Clustering12 = Icluster(data = sample,
                        est_method = list("phi", function(t){t * log(t)}, "np",
                                     estimator = "beta", bw_method = 1), max_dim = 3)

# 5-cluster partition
Clustering12$hierarchy$Aleph_16

Clustering13 = Icluster(data = sample,
                        est_method = list("phi", function(t){t * log(t)}, "np",
                                     estimator = "trans", bw_method = 2), max_dim = 3)

# 5-cluster partition
Clustering13$hierarchy$Aleph_16

Clustering14 = Icluster(data = sample,
                        est_method = list("phi", function(t){(sqrt(t)-1)^2}, "np",
                                      estimator = "trans", bw_method = 1), max_dim = 2)

# 5-cluster partition
Clustering14$hierarchy$Aleph_16

# Semi-parametric meta-elliptical copula approach
# Uncomment to run (takes a long time)

# Clustering15 = Icluster(data = sample,
#                        est_method = list("phi", function(t){t * log(t)}, "ellip",
#                                     grid = seq(0.005,100,by = 0.005)), max_dim = 2)

# 5-cluster partition
# Clustering15$hierarchy$Aleph_16




[Package VecDep version 0.1.3 Index]