simcpfa {cpfa} | R Documentation |
Simulate Data for Classification with Parallel Factor Analysis
Description
Simulates a three-way or four-way data array and a set of class labels that are related to the simulated array through one mode of the array. Data array is simulated using either a Parafac or Parafac2 model with no constraints. Weights for mode weight matrices can be drawn from 12 common probability distributions. Alternatively, custom weights can be provided for any mode.
Usage
simcpfa(arraydim = NULL, model = "parafac", nfac = 2, nclass = 2, nreps = 100,
onreps = 10, corresp = c(0.3, -0.3), meanpred = c(0, 0), modes = 3,
corrpred = matrix(c(1, 0.2, 0.2, 1), nrow = 2), pf2num = NULL,
Amat = NULL, Bmat = NULL, Cmat = NULL, Dmat = NULL, Gmat = NULL,
Emat = NULL, technical = list())
Arguments
arraydim |
Numeric vector containing the number of dimensions for each mode of the simulated data array. Must contain integers greater than or equal to 2. |
model |
Character specifying the model to use for simulating the data array. Must be either 'parafac' or 'parafac2'. |
nfac |
Number of components in the Parafac or Parafac2 model. Must be an integer greater than or equal to 1. |
nclass |
Number of classes in simulated class labels. Must be an integer greater than or equal to 2. |
nreps |
Number of replications for simulating class labels for a given set of classification mode component weights. |
onreps |
Number of replications for simulating a set of classification mode component weights. |
corresp |
Numeric vector of target correlations between simulated class labels and columns of the classification mode component weight matrix. Must have length equal to 'nfac'. |
meanpred |
Numeric vector of means used to generate the classification mode component
weights. Must be real numbers. Operates as the mean vector parameterizing a
multivariate normal distribution from which classification mode component
weights are generated. Length must be equal to input |
modes |
Single integer of either 3 or 4, indicating whether to simulate a three-way or four-way data array, respectively. |
corrpred |
A positive definite correlation matrix containing the target correlations for the classification mode component weights. Must have number of rows and columns equal to input 'nfac'. Operates as the covariance matrix parameterizing a multivariate normal distribution from which classification mode component weights are generated. |
pf2num |
When |
Amat |
When |
Bmat |
A matrix of B mode weights with number of rows equal to the second element of
input 'arraydim' and with number of columns equal to the input 'nfac'. When
provided, replaces a simulated |
Cmat |
A matrix of C mode weights with number of rows equal to the third element of
input 'arraydim' and with number of columns equal to the input 'nfac'. When
provided, replaces a simulated |
Dmat |
A matrix of D mode weights with number of rows equal to the fourth element of
input 'arraydim' and with number of columns equal to the input 'nfac'. When
|
Gmat |
When |
Emat |
When |
technical |
List containing arguments related to distributions from which to simulate data. When specified, must contain one or more of the following:
|
Details
Data array simulation consists of two steps. First, a Monte Carlo simulation
is conducted to simulate class labels using a binomial logistic (i.e.,
in the binary case) or multinomial logistic (i.e., in the multiclass case)
regression model. Specifically, columns of the classification mode weights
matrix (e.g., Cmat
when modes = 3
) are generated from a
multivariate normal distribution with mean vector meanpred
and with
covariance matrix corrpred
. Values are then drawn randomly from a
uniform or a normal distribution and serve as beta coefficients. A linear
combination of these beta coefficients and the generated classification
weights produces a linear systematic part, which is passed through the
logistic function (i.e., the sigmoid) in the binary case or through the
softmax function in the multiclass case. Resulting probabilities are used
to assign class labels. The simulation repeats classification weights
generation onreps
times and repeats class label generation, within
each onreps
iteration, a total of nreps
times. The generated
class labels that correlate best with the generated classification weights
(i.e., with correlations closest to corresp
) are retained as the final
class labels with corresponding final classification weights. An adaptive
sampling technique is used during the simulation such that optimal beta
coefficients from previous iterations are used to parameterize a normal
distribution, from which new coefficients are drawn in subsequent iterations.
Note that, if any simulation replicate produces a set of class labels where
all labels are the same (i.e., have no variance), that replicate is discarded.
Note also that onreps
is ignored when the classification mode weight
matrix (i.e., Cmat
when modes = 3
or Dmat
when
modes = 4
) is provided; in this case, class labels are simulated with
respect to the provided classification mode weight matrix.
Second, depending on the chosen model (i.e., Parafac or Parafac2) specified
via model
, and depending on the number of modes specified
via modes
, component matrices are randomly generated for each mode
of the data array. A data array is then constructed using a Parafac or
Parafac2 structure from these weight matrices, including the generated
classification mode weight matrix (i.e., Cmat
or Dmat
) from the
first step. Alternatively, weight matrices can be provided to override random
generation for any weight matrix with the exception of the classification
mode. When provided, weight matrices are used to form the final data array.
Finally, random noise is added to each value in the array.
The resulting output is a synthetic multiway data array
paired, through one mode of the array, with a simulated binary or multiclass
response.
The technical
argument controls the probability distributions used to
simulate weights for different modes. Currently, technical
is highly
structured. In particular, technical
must be provided as a named list
whose elements must be one of 'distA', 'distB', 'distC', 'distG', or 'distE',
with the last letter of each name designating a mode or, in the case of
'distE', designating error. Each element provided must itself be a list where
the first inner list element is named 'dname', specifying the distribution to
be used to generate weights for a given mode or for error. There are 12
'dname' options: 'normal', 'uniform', 'gamma', 'beta', 'binomial', 'poisson',
'exponential', 'geometric', 'negbinomial', 'hypergeo', 'lognormal', and
'cauchy'. Additional arguments can be added to each inner list to parameterize
the probability distribution being used. These arguments can be one of the
following, for each distribution allowed:
For dname = 'normal'
, allowed arguments are mean
or
sd
(i.e., function rnorm
is called).
For dname = 'uniform'
, allowed arguments are min
or
max
(i.e., function runif
is called).
For dname = 'gamma'
, allowed arguments are shape
or
scale
(i.e., function rgamma
is called).
For dname = 'beta'
, allowed arguments are shape1
or
shape2
(i.e., function rbeta
is called).
For dname = 'binomial'
, allowed arguments are size
or
prob
(i.e., function rbinom
is called).
For dname = 'poisson'
, allowed argument is lambda
(i.e.,
function rpois
is called).
For dname = 'exponential'
, allowed argument is rate
(i.e.,
function rexp
is called).
For dname = 'geometric'
, allowed argument is prob
(i.e.,
function rgeom
is called).
For dname = 'negbinomial'
, allowed arguments are size
or
prob
(i.e., function rnbinom
is called).
For dname = 'hypergeo'
, allowed arguments are m
, n
, or
k
(i.e., function rhyper
is called).
For dname = 'lognormal'
, allowed arguments are meanlog
or
sdlog
(i.e., function rlnorm
is called).
For dname = 'cauchy'
, allowed arguments are location
or
scale
(i.e., function rcauchy
is called).
Note that if a weight matrix and technical information are both provided
for a given mode (or for error), the weight matrix is used while technical
information is ignored. See Examples below for an example of how to set up
technical
.
Value
X |
Simulated data array with dimensions specified by |
y |
Simulated class labels provided as an object of class 'matrix', with number
of rows equal to the last element of |
model |
Character value indicating whether Parafac or Parafac2 model was used to simulate the data array. |
Amat |
Simulated A mode weights. When |
Bmat |
Simulated B mode weights provided as a matrix with number of rows equal to
the second element of |
Cmat |
Simulated C mode weights provided as a matrix with number of rows equal to
the third element of |
Dmat |
Simulated D mode weights provided when |
Gmat |
Simulated G weights provided when |
Emat |
Error array or list containing noise added to corresponding elements of
simulated data array. Output has dimensions specified by
|
Note
This simulation implementation contains at least two limitations. First, there is currently no argument to control the proportions of generated class labels. Second, the covariance matrix parameterizing the multivariate normal distribution generating classification mode weights is restricted to a correlation matrix. Future updates are planned to address these limitations.
In addition, the simulation could be expanded in at least two ways. First,
the Monte Carlo simulation, as a brute-force strategy, is simple but
not optimal and could be replaced with a more efficient approach. Note that
the correlations between simulated class labels and classification mode
weights, while ideally close to corresp
, represent a best-case
scenario—given values for onreps
, nreps
, and corrpred
.
Second, class labels are currently connected to the data array through only one
mode but could be simulated such that they are connected through two or more
modes. Future updates are planned to implement these enhancements.
Author(s)
Matthew Asisgress <mattgress@protonmail.ch>
References
See help file for function cpfa
for a list of references.
Examples
########## Parafac2 example with 4-way array and multiclass response ##########
## Not run:
# set seed for reproducibility
set.seed(5)
# define list of arguments specifying distributions for A and G weights
techlist <- list(distA = list(dname = "poisson",
lambda = 3), # for A weights
distG = list(dname = "gamma", shape = 2,
scale = 4)) # for G weights
# define target correlation matrix for columns of D mode weights matrix
cormat <- matrix(c(1, .35, .35, .35, 1, .35, .35, .35, 1), nrow = 3, ncol = 3)
# simulate a four-way ragged array connected to a response
data <- simcpfa(arraydim = c(10, 11, 12, 100), model = "parafac2", nfac = 3,
nclass = 3, nreps = 1e2, onreps = 10, corresp = rep(.75, 3),
meanpred = rep(2, 3), modes = 4, corrpred = cormat,
technical = techlist)
# examine correlations among columns of classification mode matrix Dmat
cor(data$Dmat)
# examine correlations between columns of classification mode matrix Dmat and
# simulated class labels
cor(data$Dmat, data$y)
## End(Not run)