ndist {manydist} | R Documentation |
Calculation of Pairwise Distances for Continuous Data
Description
Computes a distance matrix for continuous data with support for multiple distance metrics, scaling methods, dimensionality reduction, and validation data. The function implements various distance calculation approaches as described in van de Velden et al. (2024), including options for commensurable distances and variable weighting.
Usage
ndist(x, validate_x = NULL, commensurable = FALSE, method = "manhattan",
sig = NULL, scaling = "none", ncomp = ncol(x), threshold = NULL,
weights = rep(1, ncol(x)))
Arguments
x |
A data frame or matrix of continuous input variables. |
validate_x |
Optional data frame or matrix for validation data. If provided, distances are computed between observations in |
commensurable |
Logical. If |
method |
Character string specifying the distance metric. Options include |
sig |
Covariance matrix to be used when |
scaling |
Character string specifying the scaling method. Options:
Default is |
ncomp |
Number of principal components to retain when |
threshold |
Proportion of variance to retain when |
weights |
Numeric vector of weights for each variable. Must have length equal to the number of variables in |
Details
The ndist
function provides a comprehensive framework for distance calculations in continuous data:
When
validate_x
is provided, computes distances between observations invalidate_x
andx
.Supports multiple scaling methods that can be applied before distance calculation.
PCA-based dimensionality reduction can be controlled either by number of components or variance threshold.
For Mahalanobis distance, handles singular covariance matrices with appropriate error messages.
Implements commensurable distances for better comparability across variables.
Warning: The function validates:
Weight vector length must match the number of variables
Covariance matrix singularity for Mahalanobis distance
Compatibility of
x
andvalidate_x
dimensions
Value
A distance matrix where element [i,j] represents the distance between:
observation i and j of
x
ifvalidate_x
isNULL
observation i of
validate_x
and observation j ofx
ifvalidate_x
is provided
References
van de Velden, M., Iodice D'Enza, A., Markos, A., Cavicchia, C. (2024). (Un)biased distances for mixed-type data. arXiv preprint. Retrieved from https://arxiv.org/abs/2411.00429.
See Also
mdist
for mixed-type data distances, cdist
for categorical data distances.
Examples
library(palmerpenguins)
library(rsample)
penguins_cont <- palmerpenguins::penguins[, c("bill_length_mm",
"bill_depth_mm", "flipper_length_mm", "body_mass_g")]
penguins_cont <- penguins_cont[complete.cases(penguins_cont), ]
# Basic usage
dist_matrix <- ndist(penguins_cont)
# Commensurable distances with standardization
dist_matrix <- ndist(penguins_cont,
commensurable = TRUE,
scaling = "std")
# PCA-based dimensionality reduction
dist_matrix <- ndist(penguins_cont,
scaling = "pc_scores",
threshold = 0.95)
# Mahalanobis distance
dist_matrix <- ndist(penguins_cont,
method = "mahalanobis")
# Weighted Euclidean distance
dist_matrix <- ndist(penguins_cont,
method = "euclidean",
weights = c(1, 0.5, 2, 1))
# Training-test split example with validation data
set.seed(123)
# Create training-test split using rsample
penguins_split <- initial_split(penguins_cont, prop = 0.8)
tr_penguins <- training(penguins_split)
ts_penguins <- testing(penguins_split)
# Basic usage with training data only
dist_matrix <- ndist(tr_penguins)
# Computing distances between test and training sets
val_dist_matrix <- ndist(x = tr_penguins,
validate_x = ts_penguins,
method = "euclidean")
# Using validation data with standardization
val_dist_matrix_std <- ndist(x = tr_penguins,
validate_x = ts_penguins,
scaling = "std",
method = "manhattan")
# Validation with PCA and commensurability
val_dist_matrix_pca <- ndist(x = tr_penguins,
validate_x = ts_penguins,
scaling = "pc_scores",
ncomp = 2,
commensurable = TRUE)
# Validation with robust scaling and custom weights
val_dist_matrix_robust <- ndist(x = tr_penguins,
validate_x = ts_penguins,
scaling = "robust",
weights = c(1, 0.5, 2, 1))
# Mahalanobis distance with validation data
val_dist_matrix_mahal <- ndist(x = tr_penguins,
validate_x = ts_penguins,
method = "mahalanobis")