mdist {manydist} | R Documentation |
Calculation of Pairwise Distances for Mixed-Type Data
Description
Computes pairwise distances between observations described by numeric and/or categorical attributes, with support for validation data. The function provides options for computing independent, dependent, and practice-based distances as defined in van de Velden et al. (2024), with support for various continuous and categorical distance metrics, scaling, and commensurability adjustments.
Usage
mdist(x, validate_x = NULL, response = NULL, distance_cont = "manhattan",
distance_cat = "tot_var_dist", commensurable = FALSE, scaling_cont = "none",
ncomp = ncol(x), threshold = NULL, preset = "custom")
Arguments
x |
A dataframe or tibble containing continuous (coded as numeric), categorical (coded as factors), or mixed-type variables. |
validate_x |
Optional validation data with the same structure as |
response |
An optional factor for supervised distance calculation in categorical variables, applied only if |
distance_cont |
Character string specifying the distance metric for continuous variables. Options include |
distance_cat |
Character string specifying the distance metric for categorical variables. Options include |
commensurable |
Logical. If |
scaling_cont |
Character string specifying the scaling method for continuous variables. Options include |
ncomp |
Integer specifying the number of components to retain when |
threshold |
Numeric value specifying the percentage of variance explained by retained components when |
preset |
Character string specifying pre-defined combinations of arguments. Options include:
|
Value
A matrix of pairwise distances. If validate_x
is provided, rows correspond to validation observations and columns to training observations.
References
van de Velden, M., Iodice D'Enza, A., Markos, A., Cavicchia, C. (2024). (Un)biased distances for mixed-type data. arXiv preprint. Retrieved from https://arxiv.org/abs/2411.00429.
See Also
cdist
for categorical-only distances, ndist
for continuous-only distances
Examples
library(palmerpenguins)
library(rsample)
# Prepare complete data
pengmix <- palmerpenguins::penguins[complete.cases(palmerpenguins::penguins), ]
# Create training-test split
set.seed(123)
pengmix_split <- initial_split(pengmix, prop = 0.8)
tr_pengmix <- training(pengmix_split)
ts_pengmix <- testing(pengmix_split)
# Example 1: Basic usage with validation data
dist_matrix <- mdist(x = tr_pengmix,
validate_x = ts_pengmix)
# Example 2: Gower preset with validation
dist_gower <- mdist(x = tr_pengmix,
validate_x = ts_pengmix,
preset = "gower",
commensurable = TRUE)
# Example 3: Euclidean one-hot preset with validation
dist_onehot <- mdist(x = tr_pengmix,
validate_x = ts_pengmix,
preset = "euclidean_onehot")
# Example 4: Custom preset with standardization
dist_custom <- mdist(x = tr_pengmix,
validate_x = ts_pengmix,
preset = "custom",
distance_cont = "manhattan",
distance_cat = "matching",
commensurable = TRUE,
scaling_cont = "std")
# Example 5: PCA-based scaling with threshold
dist_pca <- mdist(x = tr_pengmix,
validate_x = ts_pengmix,
distance_cont = "euclidean",
scaling_cont = "pc_scores",
threshold = 0.85)
# Example 6: Categorical variables only
cat_vars <- c("species", "island", "sex")
dist_cat <- mdist(tr_pengmix[, cat_vars],
validate_x = ts_pengmix[, cat_vars],
distance_cat = "tot_var_dist")
# Example 7: Continuous variables only
num_vars <- c("bill_length_mm", "bill_depth_mm",
"flipper_length_mm", "body_mass_g")
dist_cont <- mdist(tr_pengmix[, num_vars],
validate_x = ts_pengmix[, num_vars],
distance_cont = "manhattan",
scaling_cont = "std")
# Example 8: Supervised distance with response
response_tr <- tr_pengmix$body_mass_g
dist_sup <- mdist(tr_pengmix,
validate_x = ts_pengmix,
response = response_tr,
distance_cat = "supervised")