fast_anticlustering {anticlust} | R Documentation |
The most efficient way to solve anticlustering optimizing the k-means variance criterion with an exchange method. Can be used for very large data sets.
fast_anticlustering(x, K, k_neighbours = Inf, categories = NULL)
x |
A numeric vector, matrix or data.frame of data points. Rows correspond to elements and columns correspond to features. A vector represents a single numeric feature. |
K |
How many anticlusters should be created. |
k_neighbours |
The number of neighbours that serve as exchange partner for each element. Defaults to Inf, i.e., each element is exchanged with each element in other groups. |
categories |
A vector, data.frame or matrix representing one or several categorical constraints. |
This function was created to make anticlustering applicable
to large data sets (e.g., 100,000 elements). It optimizes the k-means
variance objective because computing all pairwise distances is not
feasible for many elements. Additionally, this function employs a
speed-optimized exchange method. For each element, the potential
exchange partners are generated using a nearest neighbor search with the
function nn2
from the RANN
package. The nearest
neighbors then serve as exchange partners. This approach is inspired by the
preclustering heuristic according to which good solutions are found
when similar elements are in different sets—by swapping nearest
neighbors, this will often be the case. The number of exchange partners
per element has to be set using the argument k_neighbours
; by
default, it is set to Inf
, meaning that all possible swaps are
tested. This default must be changed by the user for large data sets.
More exchange partners generally improve the output, but also increase
run time.
When setting the categories
argument, exchange partners will
be generated from the same category. Note that when
categories
has multiple columns (i.e., each element is
assigned to multiple columns), each combination of categories is
treated as a distinct category by the exchange method.
Martin Papenberg martin.papenberg@hhu.de
features <- iris[, - 5] start <- Sys.time() ac_exchange <- fast_anticlustering(features, K = 3) Sys.time() - start ## The following call is equivalent to the call above: start <- Sys.time() ac_exchange <- anticlustering(features, K = 3, objective = "variance") Sys.time() - start ## Improve run time by using fewer exchange partners: start <- Sys.time() ac_fast <- fast_anticlustering(features, K = 3, k_neighbours = 10) Sys.time() - start by(features, ac_exchange, function(x) round(colMeans(x), 2)) by(features, ac_fast, function(x) round(colMeans(x), 2))