anticlustering {anticlust} | R Documentation |
Create groups of elements (anticlusters) that are as similar as possible to each other, by maximizing the heterogeneity within groups. Implements anticlustering algorithms as described in Papenberg and Klau (2020; <doi:10.1037/met0000301>).
anticlustering( x, K, objective = "diversity", method = "exchange", preclustering = FALSE, categories = NULL, repetitions = NULL )
x |
The data input. Can be one of two structures: (1) A
feature matrix where rows correspond to elements and columns
correspond to variables (a single numeric variable can be
passed as a vector). (2) An N x N matrix dissimilarity matrix;
can be an object of class |
K |
How many anticlusters should be created. Alternatively: (a)
A vector describing the size of each group, or (b) a
vector of length |
objective |
The objective to be maximized. The option "diversity" (default; previously called "distance", which is still supported) maximizes the cluster editing objective function; the option "variance" maximizes the k-means objective function; "kplus" is an extension of k-means anticlustering. See Details. |
method |
One of "exchange" (default) , "local-maximum", or "ilp". See Details. |
preclustering |
Boolean. Should a preclustering be conducted
before anticlusters are created? Defaults to |
categories |
A vector, data.frame or matrix representing one or several categorical constraints. See Details. |
repetitions |
The number of times a new exchange procedure is
initiated when |
This function is used to solve anticlustering. That is, K
groups are created in such a way that all groups are as similar as
possible (this usually corresponds to creating groups with high
within-group heterogeneity). This is accomplished by maximizing
instead of minimizing a clustering objective function. The
maximization of three clustering objective functions is natively
supported (other functions can also defined by the user as
described below):
cluster editing 'diversity' objective, setting objective = "diversity"
(default)
k-means 'variance' objective, setting objective = "variance"
k-plus objective, an extension of the k-means variance criterion,
setting objective = "kplus"
The k-means objective is the within-group variance—that is, the
sum of the squared distances between each element and its cluster
center (see variance_objective
). K-means
anticlustering focuses on minimizing differences with regard to the
means of the input variables x
; k-plus objective = "kplus"
anticlustering is an extension of this criterion that also tries to
minimize differences with regard to the standard deviations between
groups (see kplus_objective
).
The cluster editing "diversity" objective is the sum of pairwise
distances within groups (see
diversity_objective
). Anticluster editing is also
known as the »maximum diverse grouping problem« because it
maximizes group diversity as measured by the sum of pairwise
distances. Hence, anticlustering maximizes between-group similarity
by maximizing within-group heterogeneity. In previous versions of
this package, method = "distance"
was used (and is still
supported) to request anticluster editing, but now method =
"diversity"
is preferred because there are several clustering
objectives based on pairwise distances (e.g., see
dispersion_objective
).
If the data input x
is a feature matrix (that is: each row
is a "case" and each column is a "variable") and the option
objective = "diversity"
is used, the Euclidean distance is
computed as the basic unit of the anticluster editing objective. If
a different measure of dissimilarity is preferred, you may pass a
self-generated dissimiliarity matrix via the argument x
.
In the standard case, groups of equal size are generated. Adjust
the argument K
to create groups of different size (see
examples).
Heuristic anticlustering
By default, a heuristic method is employed for anticlustering: the
exchange method (method = "exchange"
). Building on an
initial assignment of elements to anticlusters, elements are
sequentially swapped between anticlusters in such a way that each
swap improves set similarity by the largest amount that is
possible. In the default case, elements are randomly assigned to
anticlusters before the exchange procedure starts; however, it is
also possible to explicitly specify the initial assignment using
the argument K
(in this case, K
has length
nrow(x)
). The exchange procedure is repeated for each
element. Because each possible swap is investigated for each
element, the total number of exchanges grows quadratically with
input size, rendering the exchange method unsuitable for large N.
When using method = "local-maximum"
, the exchange method is
repeated until an local maximum is reached. That means after the
exchange process has been conducted once for each data point, the
algorithm restarts with the first element and proceeds to conduct
exchanges until the objective cannot be improved.
When setting preclustering = TRUE
, only the K - 1
most similar elements serve as exchange partners, which can
dramatically speed up the optimization (more information on the
preclustering option is included below). This option is recommended
for larger N. For very large N, check out the function
fast_anticlustering
that was specifically implemented
to process very large data sets.
Exact anticlustering
An optimal anticluster editing objective can be found via integer
linear programming (the integer linear program implemented here can
be found in Papenberg & Klau, 2020, (8) - (12)). To this end, set
method = "ilp"
. To obtain an optimal solution, the open
source GNU linear programming kit (available from
https://www.gnu.org/software/glpk/glpk.html) and the R package
Rglpk
must be installed. The optimal solution is retrieved
by setting objective = "diversity"
, method = "ilp"
and preclustering = FALSE
. Use this combination of arguments
only for small problem sizes.
To relax the optimality requirement, it is possible to set the
argument preclustering = TRUE
. In this case, the anticluster
editing objective is still optimized using integer linear
programming, but a preprocessing forbids very similar elements to
be assigned to the same anticluster. The preclustering reduces the
size of the solution space, making the integer linear programming
approach applicable for larger problem instances. With
preclustering, optimality is no longer guaranteed, but the solution
is usually optimal or very close to optimal.
The variance criterion cannot be optimized to optimality using
integer linear programming because the k-means objective function
is not linear. However, it is possible to employ the function
generate_partitions
to obtain optimal solutions for
small problem instances.
Preclustering
A useful heuristic for anticlustering is to form small groups of
very similar elements and assign these to different groups. This
logic is used as a preprocessing when setting preclustering =
TRUE
. That is, before the anticlustering objective is optimized, a
cluster analysis identifies small groups of similar elements (pairs
if K = 2
, triplets if K = 3
, and so forth). The
optimization of the anticlustering objective is then conducted
under the constraint that these matched elements cannot be assigned
to the same group. When using the exchange algorithm, preclustering
is conducted using a call to matching
. When using
method = "ilp"
, the preclustering optimally finds groups of
minimum pairwise distance by solving the integer linear program
described in Papenberg and Klau (2020; (8) - (10), (12) - (13)).
Categorical constraints
The argument categories
may induce categorical constraints.
The grouping variables indicated by categories
will be
balanced out across anticlusters. Currently, this functionality is
only available in combination with the heuristic methods, but not
with the exact integer linear programming approach.
Optimize a custom objective function
It is possible to pass a function
to the argument
objective
. See dispersion_objective
for an
example. If objective
is a function, the exchange method
assigns elements to anticlusters in such a way that the return
value of the custom function is maximized (hence, the function
should return larger values when the between-group similarity is
higher). The custom function has to take two arguments: the first
is the data argument, the second is the clustering assignment. That
is, the argument x
will be passed down to the user-defined
function as first argument. However, only after
as.matrix
has been called on x
. This implies
that in the function body, columns of the data set cannot be
accessed using data.frame
operations such as
$
. Objects of class dist
will be converted to matrix
as well.
A vector of length N that assigns a group (i.e, a number
between 1 and K
) to each input element.
Martin Papenberg martin.papenberg@hhu.de
Grötschel, M., & Wakabayashi, Y. (1989). A cutting plane algorithm for a clustering problem. Mathematical Programming, 45, 59-96.
Papenberg, M., & Klau, G. W. (2020). Using anticlustering to partition data sets into equivalent parts. Psychological Methods. Advance Online Publication. https://doi.org/10.1037/met0000301.
Späth, H. (1986). Anticlustering: Maximizing the variance criterion. Control and Cybernetics, 15, 213-218.
# Optimize the cluster editing (diversity) criterion anticlusters <- anticlustering( schaper2019[, 3:6], K = 3, categories = schaper2019$room ) # Compare feature means by anticluster by(schaper2019[, 3:6], anticlusters, function(x) round(colMeans(x), 2)) # Compare standard deviations by anticluster by(schaper2019[, 3:6], anticlusters, function(x) round(apply(x, 2, sd), 2)) # check that the "room" is balanced across anticlusters: table(anticlusters, schaper2019$room) # Use multiple starts of the algorithm to improve the objective and # optimize the k-means criterion ("variance") anticlusters <- anticlustering( schaper2019[, 3:6], objective = "variance", K = 3, categories = schaper2019$room, method = "local-maximum", repetitions = 2 ) # Compare means and standard deviations by anticluster by(schaper2019[, 3:6], anticlusters, function(x) round(colMeans(x), 2)) by(schaper2019[, 3:6], anticlusters, function(x) round(apply(x, 2, sd), 2)) # Use different group sizes and optimize the extended k-means # criterion ("kplus") anticlusters <- anticlustering( schaper2019[, 3:6], objective = "kplus", K = c(24, 24, 48), categories = schaper2019$room, repetitions = 2, method = "local-maximum" ) table(anticlusters, schaper2019$room) # Compare means and standard deviations by anticluster by(schaper2019[, 3:6], anticlusters, function(x) round(colMeans(x), 2)) by(schaper2019[, 3:6], anticlusters, function(x) round(apply(x, 2, sd), 2)) ## Use preclustering and variance (k-means) criterion on large data set N <- 1000 K = 2 lds <- data.frame(f1 = rnorm(N), f2 = rnorm(N)) ac <- anticlustering( lds, K = K, objective = "variance", preclustering = TRUE ) # The following is equivalent to setting `preclustering = TRUE`: cl <- balanced_clustering(lds, K = N / K) ac <- anticlustering( lds, K = K, objective = "variance", categories = cl )