SimMixPrInDT {PrInDT} | R Documentation |
Interdependent estimation for classification-regression mixtures
Description
The function SimMixPrInDT applies structured subsampling for finding an optimal subsample to model
the relationship between the dependent variables specified in the sublist 'targets' of the list 'datalist' and all other factor and numerical variables
in the corresponding data frame specified in the sublist 'datanames' of the list 'datalist' in the same order as 'targets'.
The function is prepared to handle classification tasks with 2 or more classes and regression tasks. At first stage, the targets are estimated based on
the full sample of all exogenous variables and on summaries of the observed endogenous variables.
For generating summaries, the variables representing the substructure have to be specified in the sublist 'datastruc' of 'datalist'.
The sublist 'summ' of 'datalist' includes for each discrete target the classes for which you want to calculate the summary percentages and for each
continuous target just NA (in the same order as in the sublist 'targets'). For a discrete target in this list, you can provide a sublist of classes to be combined
(for an example see the below example).
At second stage, structured subsampling is used for improving the models from first stage.
The substructure of the observations used for structured subsampling is specified by the list 'Struc' which here only consists of
the name 'check' of the variable with the information about the categories of the substructure (without specification of the dataset names already specified
in 'datanames', see example below), and the matrix 'labs' which specifies the values of 'check' corresponding to two categories in its rows, i.e. in 'labs[1,]' and 'labs[2,]'.
The names of the categories have to be specified by rownames(labs)
.
In structured subsampling first 'M' repetitions of subsampling of the variable 'name' with 'nsub' different elements of the substructure are realized. If 'nsub' is a list, each entry is employed individually. Then,
for each of the subsamples 'N' repetitions of subsampling in classification or regression with the specified percentages of classes, observations, and predictors are carried out.
These percentages are specified in the matrix 'percent', one row per estimation task. For binary classification tasks,
percentages 'percl' and 'percs' for the larger and the smaller class have to be specified. For multilevel classification tasks, NA is specified (see the below example)
since the percentages are generated automatically. For regression tasks, 'pobs' and 'ppre' have to be specified for observations and predictors, respectively.
The optimization citerion is balanced accuracy for classification and goodness of fit R2 for regression on the full sample, respectively.
At stage 2, the models are optimized individually. At stage 3, the models are optimized on the maximum of joint elements in their substructures.
The trees generated from undersampling can be restricted by not accepting trees
including split results specified in the character strings of the vector 'ctestv'.
The parameters 'conf.level', 'minsplit', and 'minbucket' can be used to control the size of the trees.
Usage
SimMixPrInDT(datalist,ctestv=NA,Struc,M=12,N=99,nsub,percent=NA,conf.level=0.95,
minsplit=NA,minbucket=NA)
Arguments
datalist |
list(datanames,targets,datastruc,summ) Input data: For specification see the above description |
ctestv |
Vector of character strings of forbidden split results; |
Struc |
list(name,check,labs) Paprametes for structured subsampling, as explained in the desciption above. |
M |
Number of repetitions of subsampling of elements of substructure; default = 12 |
N |
Number of repetitions of subsampling for predictors (integer); default = 99 |
nsub |
(List of) numbers of different elements of substructure per subsample |
percent |
matrix of percent spefications: For specification see the above description; default: 'percent = NA' meaning default values for percentages. |
conf.level |
(1 - significance level) in function |
minsplit |
Minimum number of elements in a node to be splitted; default = 20 |
minbucket |
Minimum number of elements in a node; default = 7 |
Details
See Buschfeld & Weihs (2025), Optimizing decision trees for the analysis of World Englishes and sociolinguistic data. Cambridge University Press, section 4.5.6.2, for further information.
Standard output can be produced by means of print(name)
or just name
as well as plot(name
where 'name' is the output data
frame of the function.
Value
- modelsF
Best trees at stage 1 (Full sample)
- modelsI
Best trees at stage 2 (Individual optimization)
- modelsJ
Best trees at stage 3 (Joint optimization)
- depnames
names of dependent variables
- nmod
number of models of tasks
- nlev
levels of tasks
- accAll
accuracies of best trees at both stages
Examples
# zero data
datazero <- PrInDT::data_zero
datazero <- na.omit(datazero) # cleaned full data: no NAs
names(datazero)[names(datazero)=="real"] <- "zero"
CHILDzero <- PrInDT::participant_zero
# interpretation restrictions (split exclusions)
ctestv <- rbind('ETH == {C2a, C1a}','MLU == {1, 3}') # split exclusions
##
# multi-level data
datastrat <- PrInDT::data_zero
datamult <- na.omit(datastrat)
# ctestv <- NA
datamult$mult[datamult$ETH %in% c("C1a","C1b","C1c") & datamult$real == "zero"] <- "zero1"
datamult$mult[datamult$ETH %in% c("C2a","C2b","C2c") & datamult$real == "zero"] <- "zero2"
datamult$mult[datamult$real == "realized"] <- "real"
datamult$mult <- as.factor(datamult$mult) # mult is new class variable
datamult$real <- NULL # remove old class variable
CHILDmult <- CHILDzero
##
# vowel data
data <- PrInDT::data_vowel
data <- na.omit(data)
CHILDvowel <- data$Nickname
CHILDvowel <- as.factor(gsub("Nick","P",CHILDvowel))
data$Nickname <- NULL
syllable <- 3 - data$syllables
data$syllabels <- NULL
data$syllables <- syllable
data$speed <- data$word_duration / data$syllables
names(data)[names(data) == "target"] <- "vowel"
datavowel <- data
##
# function preparation and call
# datalist and percent
datanames <- list("datazero","datamult","datavowel")
targets <- c("zero","mult","vowel")
datastruc <- list(CHILDzero,CHILDmult,CHILDvowel)
summult <- paste("zero1","zero2",sep=",")
summ <- c("zero",summult,NA)
datalist <- list(datanames=datanames,targets=targets,datastruc=datastruc,summ=summ)
percent <- matrix(NA,nrow=3,ncol=2)
percent[1,] <- c("percl=0.075","percs=0.9") # percentages for datazero
# no percentages needed for datapast
percent[3,] <- c("pobs=0.9","ppre=c(0.9,0.8)") # percentages for datavowel
# substructures
labs <- matrix(1:6,nrow=2,ncol=3)
labs[1,] <- c("C1a","C1b","C1c")
labs[2,] <- c("C2a","C2b","C2c")
rownames(labs) <- c("children 1","children 2")
Struc <- list(check="ETH",labs=labs)
outSimMix <- SimMixPrInDT(datalist,ctestv=ctestv,Struc=Struc,M=2,N=9,nsub=c(19,20),
percent=percent,conf.level=0.99)
outSimMix
plot(outSimMix)+