cdist {manydist} | R Documentation |
Calculation of Pairwise Distances for Categorical Data
Description
Computes a distance matrix for categorical variables with support for validation data, multiple distance metrics, and variable weighting. The function implements various distance calculation approaches as described in van de Velden et al. (2024), including commensurable distances and supervised options when response variable is provided.
Usage
cdist(x, response = NULL, validate_x = NULL, method = "tot_var_dist",
commensurable = FALSE, weights = 1)
Arguments
x |
A data frame or matrix of categorical variables (factors). |
response |
Optional response variable for supervised distance calculations. Default is |
validate_x |
Optional validation data frame or matrix. If provided, distances are computed between observations in |
method |
Character string or vector specifying the distance metric(s). Options include:
Can be a single string or vector for different methods per variable. |
commensurable |
Logical. If |
weights |
Numeric vector or matrix of weights. If vector, must have length equal to number of variables. If matrix, must match the dimension of level-wise distances. Default is 1 (equal weighting). |
Details
The cdist
function provides a comprehensive framework for categorical distance calculations:
Supports multiple distance calculation methods that can be specified globally or per variable
Handles validation data through
validate_x
parameterImplements supervised distances when response variable is provided
Supports commensurable distances for better comparability across variables
Provides flexible weighting schemes at variable and level granularity
Important notes:
Input variables are automatically converted to factors with dropped unused levels
Different methods per variable is not supported for
"none"
,"st_dev"
,"HL"
,"cat_dis"
,"HLeucl"
,"mca"
Weight vector length must match the number of variables when specified as a vector
Variables should be factors; numeric variables will cause errors
Value
A list containing:
distance_mat |
Matrix of pairwise distances. If |
delta |
Matrix or list of matrices containing level-wise distances for each variable. |
delta_names |
Vector of level names used in the delta matrices. |
References
van de Velden, M., Iodice D'Enza, A., Markos, A., Cavicchia, C. (2024). (Un)biased distances for mixed-type data. arXiv preprint. Retrieved from https://arxiv.org/abs/2411.00429.
See Also
mdist
for mixed-type data distances, ndist
for continuous data distances
Examples
library(palmerpenguins)
library(rsample)
# Prepare data with complete cases for both categorical variables and response
complete_vars <- c("species", "island", "sex", "body_mass_g")
penguins_complete <- penguins[complete.cases(penguins[, complete_vars]), ]
penguins_cat <- penguins_complete[, c("species", "island", "sex")]
response <- penguins_complete$body_mass_g
# Create training-test split
set.seed(123)
penguins_split <- initial_split(penguins_cat, prop = 0.8)
tr_penguins <- training(penguins_split)
ts_penguins <- testing(penguins_split)
response_tr <- response[penguins_split$in_id]
response_ts <- response[-penguins_split$in_id]
# Basic usage
result <- cdist(tr_penguins)
# With validation data
val_result <- cdist(x = tr_penguins,
validate_x = ts_penguins,
method = "tot_var_dist")
# ...and commensurability
val_result_COMM <- cdist(x = tr_penguins,
validate_x = ts_penguins,
method = "tot_var_dist",
commensurable = TRUE)
# Supervised distance with response variable
sup_result <- cdist(x = tr_penguins,
response = response_tr,
method = "supervised")
# Supervised with validation data
sup_val_result <- cdist(x = tr_penguins,
validate_x = ts_penguins,
response = response_tr,
method = "supervised")
# Commensurable distances with custom weights
comm_result <- cdist(tr_penguins,
commensurable = TRUE,
weights = c(2, 1, 1))
# Different methods per variable
multi_method <- cdist(tr_penguins,
method = c("matching", "goodall_3", "tot_var_dist"))