compare_bioregionalizations {bioregion} | R Documentation |
Compare cluster memberships among multiple bioregionalizations
Description
This function computes pairwise comparisons for several
bioregionalizations, usually outputs from netclu_
, hclu_
, or nhclu_
functions. It also provides the confusion matrix from pairwise comparisons,
enabling the user to compute additional comparison metrics.
Usage
compare_bioregionalizations(
bioregionalizations,
indices = c("rand", "jaccard"),
cor_frequency = FALSE,
store_pairwise_membership = TRUE,
store_confusion_matrix = TRUE
)
Arguments
bioregionalizations |
A |
indices |
|
cor_frequency |
A |
store_pairwise_membership |
A |
store_confusion_matrix |
A |
Details
This function operates in two main steps:
Within each bioregionalization, the function compares all pairs of items and documents whether they are clustered together (
TRUE
) or separately (FALSE
). For example, if site 1 and site 2 are clustered in the same cluster in bioregionalization 1, their pairwise membershipsite1_site2
will beTRUE
. This output is stored in thepairwise_membership
slot ifstore_pairwise_membership = TRUE
.Across all bioregionalizations, the function compares their pairwise memberships to determine similarity. For each pair of bioregionalizations, it computes a confusion matrix with the following elements:
-
a
: Number of item pairs grouped in both bioregionalizations. -
b
: Number of item pairs grouped in the first but not in the second bioregionalization. -
c
: Number of item pairs grouped in the second but not in the first bioregionalization. -
d
: Number of item pairs not grouped in either bioregionalization.
The confusion matrix is stored in confusion_matrix
if
store_confusion_matrix = TRUE
.
Based on these confusion matrices, various indices can be computed to measure agreement among bioregionalizations. The currently implemented indices are:
-
Rand index:
(a + d) / (a + b + c + d)
Measures agreement by considering both grouped and ungrouped item pairs. -
Jaccard index:
a / (a + b + c)
Measures agreement based only on grouped item pairs.
These indices are complementary: the Jaccard index evaluates clustering similarity, while the Rand index considers both clustering and separation. For example, if two bioregionalizations never group the same pairs, their Jaccard index will be 0, but their Rand index may be > 0 due to ungrouped pairs.
Users can compute additional indices manually using the list of confusion matrices.
To identify which bioregionalization is most representative of the others,
the function can compute the correlation between the pairwise membership of
each bioregionalization and the total frequency of pairwise membership across
all bioregionalizations. This is enabled by setting cor_frequency = TRUE
.
Value
A list
containing 4 to 7 elements:
args: A
list
of user-provided arguments.inputs: A
list
containing information on the input bioregionalizations, such as the number of items clustered.pairwise_membership (optional): If
store_pairwise_membership = TRUE
, aboolean matrix
whereTRUE
indicates two items are in the same cluster, andFALSE
indicates they are not.freq_item_pw_membership: A
numeric vector
containing the number of times each item pair is clustered together, corresponding to the sum of rows inpairwise_membership
.bioregionalization_freq_cor (optional): If
cor_frequency = TRUE
, anumeric vector
of correlations between individual bioregionalizations and the total frequency of pairwise membership.confusion_matrix (optional): If
store_confusion_matrix = TRUE
, alist
of confusion matrices for each pair of bioregionalizations.bioregionalization_comparison: A
data.frame
containing comparison results, where the first column indicates the bioregionalizations compared, and the remaining columns contain the requestedindices
.
Author(s)
Boris Leroy (leroy.boris@gmail.com)
Maxime Lenormand (maxime.lenormand@inrae.fr)
Pierre Denelle (pierre.denelle@gmail.com)
See Also
For more details illustrated with a practical example, see the vignette: https://biorgeo.github.io/bioregion/articles/a5_2_compare_bioregionalizations.html.
Associated functions: bioregionalization_metrics
Examples
# We here compare three different bioregionalizations
comat <- matrix(sample(0:1000, size = 500, replace = TRUE, prob = 1/1:1001),
20, 25)
rownames(comat) <- paste0("Site",1:20)
colnames(comat) <- paste0("Species",1:25)
dissim <- dissimilarity(comat, metric = "Simpson")
bioregion1 <- nhclu_kmeans(dissim, n_clust = 3, index = "Simpson")
net <- similarity(comat, metric = "Simpson")
bioregion2 <- netclu_greedy(net)
bioregion3 <- netclu_walktrap(net)
# Make one single data.frame with the bioregionalizations to compare
compare_df <- merge(bioregion1$clusters, bioregion2$clusters, by = "ID")
compare_df <- merge(compare_df, bioregion3$clusters, by = "ID")
colnames(compare_df) <- c("Site", "Hclu", "Greedy", "Walktrap")
rownames(compare_df) <- compare_df$Site
compare_df <- compare_df[, c("Hclu", "Greedy", "Walktrap")]
# Running the function
compare_bioregionalizations(compare_df)
# Find out which bioregionalizations are most representative
compare_bioregionalizations(compare_df,
cor_frequency = TRUE)