ndist {manydist}R Documentation

Calculation of Pairwise Distances for Continuous Data

Description

Computes a distance matrix for continuous data with support for multiple distance metrics, scaling methods, dimensionality reduction, and validation data. The function implements various distance calculation approaches as described in van de Velden et al. (2024), including options for commensurable distances and variable weighting.

Usage

ndist(x, validate_x = NULL, commensurable = FALSE, method = "manhattan",
      sig = NULL, scaling = "none", ncomp = ncol(x), threshold = NULL,
      weights = rep(1, ncol(x)))

Arguments

x

A data frame or matrix of continuous input variables.

validate_x

Optional data frame or matrix for validation data. If provided, distances are computed between observations in validate_x and x. Default is NULL.

commensurable

Logical. If TRUE, standardizes each variable's distance matrix by dividing by its mean distance, making distances comparable across variables. Default is FALSE.

method

Character string specifying the distance metric. Options include "manhattan", "euclidean", and "mahalanobis". Default is "manhattan".

sig

Covariance matrix to be used when method = "mahalanobis". If NULL, computed from the data. Default is NULL.

scaling

Character string specifying the scaling method. Options:

  • "none": No scaling

  • "std": Standardization (zero mean, unit variance)

  • "range": Min-max scaling to [0,1]

  • "pc_scores": PCA-based dimensionality reduction

  • "robust": Robust scaling using median and IQR

Default is "none".

ncomp

Number of principal components to retain when scaling = "pc_scores". Default is the number of columns in x.

threshold

Proportion of variance to retain when scaling = "pc_scores". If specified, overrides ncomp. Default is NULL.

weights

Numeric vector of weights for each variable. Must have length equal to the number of variables in x. Default is a vector of ones.

Details

The ndist function provides a comprehensive framework for distance calculations in continuous data:

Warning: The function validates:

Value

A distance matrix where element [i,j] represents the distance between:

References

van de Velden, M., Iodice D'Enza, A., Markos, A., Cavicchia, C. (2024). (Un)biased distances for mixed-type data. arXiv preprint. Retrieved from https://arxiv.org/abs/2411.00429.

See Also

mdist for mixed-type data distances, cdist for categorical data distances.

Examples

library(palmerpenguins)
library(rsample)

penguins_cont <- palmerpenguins::penguins[, c("bill_length_mm",
"bill_depth_mm", "flipper_length_mm", "body_mass_g")]
penguins_cont <- penguins_cont[complete.cases(penguins_cont), ]

# Basic usage
dist_matrix <- ndist(penguins_cont)

# Commensurable distances with standardization
dist_matrix <- ndist(penguins_cont, 
                    commensurable = TRUE, 
                    scaling = "std")

# PCA-based dimensionality reduction
dist_matrix <- ndist(penguins_cont, 
                    scaling = "pc_scores", 
                    threshold = 0.95)

# Mahalanobis distance
dist_matrix <- ndist(penguins_cont, 
                    method = "mahalanobis")

# Weighted Euclidean distance
dist_matrix <- ndist(penguins_cont, 
                    method = "euclidean",
                    weights = c(1, 0.5, 2, 1))
                    
# Training-test split example with validation data
set.seed(123)
# Create training-test split using rsample
penguins_split <- initial_split(penguins_cont, prop = 0.8)
tr_penguins <- training(penguins_split)
ts_penguins <- testing(penguins_split)

# Basic usage with training data only
dist_matrix <- ndist(tr_penguins)

# Computing distances between test and training sets
val_dist_matrix <- ndist(x = tr_penguins, 
                        validate_x = ts_penguins,
                        method = "euclidean")

# Using validation data with standardization
val_dist_matrix_std <- ndist(x = tr_penguins,
                            validate_x = ts_penguins,
                            scaling = "std",
                            method = "manhattan")

# Validation with PCA and commensurability
val_dist_matrix_pca <- ndist(x = tr_penguins,
                            validate_x = ts_penguins,
                            scaling = "pc_scores",
                            ncomp = 2,
                            commensurable = TRUE)

# Validation with robust scaling and custom weights
val_dist_matrix_robust <- ndist(x = tr_penguins,
                               validate_x = ts_penguins,
                               scaling = "robust",
                               weights = c(1, 0.5, 2, 1))

# Mahalanobis distance with validation data
val_dist_matrix_mahal <- ndist(x = tr_penguins,
                              validate_x = ts_penguins,
                              method = "mahalanobis")


[Package manydist version 0.4.8 Index]