cdist {manydist}R Documentation

Calculation of Pairwise Distances for Categorical Data

Description

Computes a distance matrix for categorical variables with support for validation data, multiple distance metrics, and variable weighting. The function implements various distance calculation approaches as described in van de Velden et al. (2024), including commensurable distances and supervised options when response variable is provided.

Usage

cdist(x, response = NULL, validate_x = NULL, method = "tot_var_dist",
      commensurable = FALSE, weights = 1)

Arguments

x

A data frame or matrix of categorical variables (factors).

response

Optional response variable for supervised distance calculations. Default is NULL.

validate_x

Optional validation data frame or matrix. If provided, distances are computed between observations in validate_x and x. Default is NULL.

method

Character string or vector specifying the distance metric(s). Options include:

  • "tot_var_dist": Total variation distance (default)

  • "HL", "HLeucl": Hennig-Liao distance

  • "cat_dis": Category-based dissimilarity

  • "mca": Multiple correspondence analysis based

  • "st_dev": Standard deviation based

  • "matching", "eskin", "iof", "of": Various coefficients

  • "goodall_3", "goodall_4": Goodall-based distances

  • "gifi_chi2": Gifi chi-square distance

  • "lin": Lin's similarity measure

  • "var_entropy", "var_mutability": Variability-based measures

  • "supervised", "supervised_full": Response-guided distances

  • "le_and_ho": Le and Ho distance

  • Additional methods from philentropy package

Can be a single string or vector for different methods per variable.

commensurable

Logical. If TRUE, standardizes each variable's distance matrix by dividing by its mean distance. Default is FALSE.

weights

Numeric vector or matrix of weights. If vector, must have length equal to number of variables. If matrix, must match the dimension of level-wise distances. Default is 1 (equal weighting).

Details

The cdist function provides a comprehensive framework for categorical distance calculations:

Important notes:

Value

A list containing:

distance_mat

Matrix of pairwise distances. If validate_x is provided, rows correspond to validation observations and columns to training observations.

delta

Matrix or list of matrices containing level-wise distances for each variable.

delta_names

Vector of level names used in the delta matrices.

References

van de Velden, M., Iodice D'Enza, A., Markos, A., Cavicchia, C. (2024). (Un)biased distances for mixed-type data. arXiv preprint. Retrieved from https://arxiv.org/abs/2411.00429.

See Also

mdist for mixed-type data distances, ndist for continuous data distances

Examples

library(palmerpenguins)
library(rsample)

# Prepare data with complete cases for both categorical variables and response
complete_vars <- c("species", "island", "sex", "body_mass_g")
penguins_complete <- penguins[complete.cases(penguins[, complete_vars]), ]
penguins_cat <- penguins_complete[, c("species", "island", "sex")]
response <- penguins_complete$body_mass_g

# Create training-test split
set.seed(123)
penguins_split <- initial_split(penguins_cat, prop = 0.8)
tr_penguins <- training(penguins_split)
ts_penguins <- testing(penguins_split)
response_tr <- response[penguins_split$in_id]
response_ts <- response[-penguins_split$in_id]

# Basic usage
result <- cdist(tr_penguins)

# With validation data
val_result <- cdist(x = tr_penguins, 
                   validate_x = ts_penguins,
                   method = "tot_var_dist")
                   
# ...and commensurability
val_result_COMM <- cdist(x = tr_penguins, 
                   validate_x = ts_penguins,
                   method = "tot_var_dist",
                   commensurable = TRUE)

# Supervised distance with response variable
sup_result <- cdist(x = tr_penguins, 
                   response = response_tr,
                   method = "supervised")

# Supervised with validation data
sup_val_result <- cdist(x = tr_penguins,
                       validate_x = ts_penguins,
                       response = response_tr,
                       method = "supervised")

# Commensurable distances with custom weights
comm_result <- cdist(tr_penguins,
                    commensurable = TRUE,
                    weights = c(2, 1, 1))

# Different methods per variable
multi_method <- cdist(tr_penguins,
                     method = c("matching", "goodall_3", "tot_var_dist"))


[Package manydist version 0.4.8 Index]