cpfa {cpfa} | R Documentation |
Classification with Parallel Factor Analysis
Description
Fits Richard A. Harshman's Parallel Factor Analysis-1 (Parafac) model or Parallel Factor Analysis-2 (Parafac2) model to a three-way or four-way data array. Allows for different constraint options on multiple tensor modes. Uses Parafac component weights from a single mode of this model as predictors to tune parameters for one or more classification methods via a k-fold cross-validation procedure. Predicts class labels and calculates multiple performance measures for binary or multiclass classification across multiple replications with different train-test splits. Provides descriptive statistics to pool output across replications.
Usage
cpfa(x, y, model = c("parafac", "parafac2"), nfac = 1, nrep = 5, ratio = 0.8,
nfolds = 10, method = c("PLR", "SVM", "RF", "NN", "RDA", "GBM"),
family = c("binomial", "multinomial"), parameters = list(),
type.out = c("measures", "descriptives"), foldid = NULL,
prior = NULL, cmode = NULL, seeds = NULL, plot.out = FALSE,
plot.measures = NULL, parallel = FALSE, cl = NULL, verbose = TRUE, ...)
Arguments
x |
A three-way or four-way data array. For Parafac2, can be a list where each element is a matrix or three-way array. Array or list must contain only real numbers. See note below. |
y |
A vector containing at least two unique class labels. Should be a factor that contains two or more levels. For binary case, ensure the order of factor levels (left to right) is such that negative class is first and positive class is second. |
model |
Character designating the Parafac model to use, either
|
nfac |
Number of components for each Parafac or Parafac2 model to fit. Default is
|
nrep |
Number of replications to repeat the procedure. Default is |
ratio |
Split ratio for dividing data into train and test sets. Default is
|
nfolds |
Numeric value specifying number of folds for k-fold cross-validation. Must
be 2 or greater. Default is |
method |
Character vector indicating classification methods to use. Possible methods
include penalized logistic regression (PLR); support vector machine (SVM);
random forest (RF); feed-forward neural network (NN); regularized
discriminant analysis (RDA); and gradient boosting machine (GBM). If none
are selected, default is to use all methods with |
family |
Character value specifying binary classification ( |
parameters |
List containing arguments related to classification methods. When specified, must contain one or more of the following:
|
type.out |
Type of output desired: |
foldid |
Integer vector containing fold IDs for k-fold cross-validation. If not
provided, fold IDs are generated randomly for number of folds |
prior |
Prior probabilities of class membership. If unspecified, the class
proportions for input |
cmode |
Integer value of 1, 2, or 3 (or 4 if |
seeds |
Random seeds to be associated with each replication. Default is
|
plot.out |
Logical indicating whether to output one or more box plots of classification performance measures that are plotted across classification methods and number of components. |
plot.measures |
Character vector containing values that specify for plotting one or more of
11 possible classification performance measures. Only relevant when
|
parallel |
Logical indicating if parallel computing should be implemented. If TRUE, the package parallel is used for parallel computing. For all classification methods except penalized logistic regression, the doParallel package is used as a wrapper. Defaults to FALSE, which implements sequential computing. |
cl |
Cluster for parallel computing, which is used when |
verbose |
If TRUE, progress is printed. |
... |
Additional arguments to be passed to function |
Details
Data are split into a training set and a testing set. After fitting a Parafac
or Parafac2 model with the training set using package multiway
(see parafac
or parafac2
in multiway for details), the
estimated classification mode weight matrix is passed to one or more
classification methods. The methods include: penalized logistic regression
(PLR); support vector machine (SVM); random forest (RF); feed-forward neural
network (NN); regularized discriminant analysis (RDA); and gradient boosting
machine (GBM).
Package glmnet fits models for PLR. PLR tunes penalty parameter lambda
while the elastic net parameter alpha is set by the user (see the help file
for function cv.glmnet
in package glmnet). For SVM, package
e1071 is used with a radial basis kernel. Penalty parameter cost and
radial basis parameter gamma are used (see svm
in package e1071).
For RF, package randomForest is used and implements Breiman's random
forest algorithm. The number of predictors sampled at each node split is set
at the default of sqrt(R), where R is the number of Parafac or Parafac2
components. Two tuning parameters allowed are ntree, the number of trees to be
grown, and nodesize, the minimum size of terminal nodes (see
randomForest
in package randomForest). For NN, package
nnet fits a single-hidden-layer, feed-forward neural network model.
Penalty parameters size (i.e., number of hidden layer units) and decay (i.e.,
weight decay) are used (see nnet). For RDA, package rda fits a
shrunken centroids regularized discriminant analysis model. Tuning parameters
include rda.alpha, the shrinkage penalty for the within-class covariance matrix,
and delta, the shrinkage penalty of class centroids towards the overall dataset
centroid. For GBM, package xgboost fits a gradient boosting machine
model. Four tuning parameters are allowed: (1) eta, the learning rate; (2)
max.depth, the maximum tree depth; (3) subsample, the fraction of samples per
tree; and (4) nrounds, the number of boosting trees to build.
For all six methods, k-fold cross-validation is implemented to tune
classification parameters where the number of folds is set by argument
nfolds
. Separately, the trained Parafac or Parafac2 model is used to
predict the classification mode's component weights using the testing set data.
The predicted component weights and the optimized classification method are
then used to predict class labels. Finally, classification performance measures
are calculated. The process is repeated over a number of replications with
different random splits of the input array and of the class labels at each
replication.
Value
Returns an object of class wrapcpfa
either with a three-way array
with classification performance measures for each model and for each
replication, or with a list containing matrices with descriptive statistics
for performance measures calculated across all replications. Specify
type.out = "measures"
to output the array of performance measures.
Specify type.out = "descriptives"
to output descriptive statistics
across replications. In addition, for both options, the following are also
provided:
predweights |
List of predicted classification weights for each Parafac or Parafac2 model and for each replication. |
train.weights |
List of lists of training weights for each Parafac or Parafac2 model and for each replication. |
opt.tune |
List of optimal tuning parameters for classification methods for each Parafac or Parafac2 model and for each replication. |
mean.opt.tune |
Mean across all replications of optimal tuning parameters for classification methods for each Parafac or Parafac2 model. |
X |
Three-way or four-way data array or list used in argument |
y |
Vector of class labels used in input argument |
nfac |
Number of components used to fit each Parafac or Parafac2 model. |
model |
Character designating the Parafac model that was used, either
|
method |
Classification methods used. |
const |
Constraints used in fitting Parafac or Parafac2 models. |
cmode |
Integer value used to specify the mode whose component weights were predictors for classification. |
Note
If argument cmode
is not null, input array x
is reshaped with
function aperm
such that the cmode
dimension of x
is
ordered last. Estimated mode A and B (and mode C for a four-way array) weights
that are outputted as Aweights
and Bweights
(and Cweights
)
reflect this permutation. For example, if x
is a four-way array
and cmode = 2
, the original input modes 1, 2, 3, and 4 will correspond
to output modes 1, 3, 4, 2. Here, output A = input 1; B = 3, and C = 4
(i.e., the second mode specified by cmode
has been moved to the D
mode/last mode). For model = "parafac2"
, classification mode is
assumed to be the last mode (i.e., mode C for three-way array and mode D for
four-way array).
In addition, note that the following combination of arguments will give an
error: nfac = 1, family = "multinomial", method = "PLR"
. The issue
arises from providing glmnet::cv.glmnet
input x
an input matrix
that has a single column. The issue is resolved for family = "binomial"
because a column of 0s is appended to the single column, but this solution
does not appear to work for the multiclass case. As such, this combination of
arguments is not currently allowed. Future updates are planned to resolve this
issue.
Author(s)
Matthew A. Asisgress <mattgress@protonmail.ch>
References
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32.
Chen, T., He, T., Benesty, M., Khotilovich, V., Tang, Y., Cho, H., Chen, K., Mitchell, R., Cano, I., Zhou, T., Li, M., Xie, J., Lin, M., Geng, Y., Li, Y., Yuan, J. (2025). xgboost: Extreme gradient boosting. R Package Version 1.7.9.1.
Cortes, C. and Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273-297.
Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals of Statistics, 29(5), 1189-1232.
Friedman, J. H. (1989). Regularized discriminant analysis. Journal of the American Statistical Association, 84(405), 165-175.
Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1-22.
Guo, Y., Hastie, T., and Tibshirani, R. (2007). Regularized linear discriminant analysis and its application in microarrays. Biostatistics, 8(1), 86-100.
Guo, Y., Hastie, T., and Tibshirani, R. (2023). rda: Shrunken centroids regularized discriminant analysis. R Package Version 1.2-1.
Harshman, R. (1970). Foundations of the PARAFAC procedure: Models and conditions for an "explanatory" multimodal factor analysis. UCLA Working Papers in Phonetics, 16, 1-84.
Harshman, R. (1972). PARAFAC2: Mathematical and technical notes. UCLA Working Papers in Phonetics, 22, 30-44.
Harshman, R. and Lundy, M. (1994). PARAFAC: Parallel factor analysis. Computational Statistics and Data Analysis, 18, 39-72.
Helwig, N. (2017). Estimating latent trends in multivariate longitudinal data via Parafac2 with functional and structural constraints. Biometrical Journal, 59(4), 783-803.
Helwig, N. (2025). multiway: Component models for multi-way data. R Package Version 1.0-7.
Liaw, A. and Wiener, M. (2002). Classification and regression by randomForest. R News 2(3), 18–22.
Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A., and Leisch, F. (2024). e1071: Misc functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien. R Package Version 1.7-16.
Ripley, B. (1994). Neural networks and related methods for classification. Journal of the Royal Statistical Society: Series B (Methodological), 56(3), 409-437.
Venables, W. and Ripley, B. (2002). Modern applied statistics with S. Fourth Edition. Springer, New York. ISBN 0-387-95457-0.
Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301-320.
Examples
########## Parafac2 example with 4-way array and multiclass response ##########
## Not run:
# set seed
set.seed(5)
# define correlation matrix for columns of classification mode weight matrix
cormat <- matrix(c(1, .35, .35, .35, 1, .35, .35, .35, 1), nrow = 3, ncol = 3)
# simulate a four-way ragged array connected to a multiclass response
data <- simcpfa(arraydim = c(10, 11, 12, 100), model = "parafac2", nfac = 3,
nclass = 3, nreps = 1e2, onreps = 10, corresp = rep(.75, 3),
meanpred = rep(2, 3), modes = 4, corrpred = cormat)
# initialize
alpha <- seq(0, 1, length = 2)
gamma <- c(0, 1)
cost <- c(0.1, 5)
ntree <- c(200, 300)
nodesize <- c(1, 2)
size <- c(1, 2)
decay <- c(0, 1)
rda.alpha <- seq(0.1, 0.9, length = 2)
delta <- c(0.1, 2)
eta <- c(0.3, 0.7)
max.depth <- c(1, 2)
subsample <- c(0.75)
nrounds <- c(100)
method <- c("PLR", "SVM", "RF", "NN", "RDA", "GBM")
family <- "multinomial"
parameters <- list(alpha = alpha, gamma = gamma, cost = cost, ntree = ntree,
nodesize = nodesize, size = size, decay = decay,
rda.alpha = rda.alpha, delta = delta, eta = eta,
max.depth = max.depth, subsample = subsample,
nrounds = nrounds)
model <- "parafac2"
nfolds <- 3
nstart <- 3
# constrain first mode weights to be orthogonal, fourth mode to be nonnegative
const <- c("orthog", "uncons", "uncons", "nonneg")
# fit Parafac2 model and use fourth mode weights to tune classification
# methods, to predict class labels, and to return classification
# performance measures pooled across multiple train-test splits
output <- cpfa(x = data$X, y = as.factor(data$y), model = model, nfac = 3,
nrep = 5, ratio = 0.8, nfolds = nfolds, method = method,
family = family, parameters = parameters,
type.out = "descriptives", seeds = NULL, plot.out = TRUE,
parallel = FALSE, const = const, nstart = nstart)
# print performance measure means across train-test splits
output$descriptive$mean
## End(Not run)