PrInDTCstruc {PrInDT} | R Documentation |
Structured subsampling for classification
Description
The function PrInDTCstruc applies structured subsampling for finding an optimal subsample to model
the relationship between the two-class factor variable 'classname' and all other factor and numerical variables
in the data frame 'datain' by means of 'N' repetitions of subsampling from a substructure and 'Ni' repetitions of subsampling from the predictors.
The optimization citerion is the balanced accuracy on the validation sample 'valdat' (default is the full input sample 'datain').
Other criteria are possible (cf. parameter description of 'crit'). The trees generated from undersampling can be restricted by not accepting trees
including split results specified in the character strings of the vector 'ctestv'.
The substructure of the observations used for subsampling is specified by the list 'Struc' which consists of the 'name' of the variable representing the substructure,
the name 'check' of the variable with the information about the categories of the substructure, and the matrix 'labs' which specifies the values of 'check'
corresponding to two categories in its rows, i.e. in 'labs[1,]' and 'labs[2,]'. The names of the categories have to be specified by rownames(labs)
.
See parameter description of 'Struc' for its specification for 'vers="b"' and 'indrep=1'.
The number of predictors 'Pit' to be included in the model and the number of elements of the substructure 'Eit' have to be specified (lists allowed), and
undersampling of the categories of 'classname' can be controlled by 'undersamp=TRUE/FALSE'.
Four different versions of structured subsampling exist:
a) just of the elements in the substructure (possibly with additional undersampling) with parameters 'N' and 'Eit',
b) just of the predictors with parameters 'Ni' and 'Pit',
c) of the predictors and for each subset of predictors subsampling of the elements of the substructure (possibly with additional undersampling)
with parameters 'N', 'Ni', 'Eit', and 'Pit', and
d) of the elements of the substructure (possibly with additional undersampling) and for each of these subsets subsampling of the predictors
with the same parameters as version c).
Sampling of the elements of the substructure can be influenced by using weights of the elements ('weights=TRUE') according to the number of appearances of the smaller
class of 'classnames'. This way, elements with more realisations in the smaller class are preferred.
The parameters 'conf.level', 'minsplit', and 'minbucket' can be used to control the size of the trees.
The parameter 'indrep' indicates repeated measurement situations ('indrep=1', only implemented for 'crit="ba"'); default = 0.
Repeated measurements are multiple measurements of the same variable taken on the same subjects (or objects) either under different conditions or over two or more time periods.
The name of the variable with the repeatedly observed subjects (or objects) has to be specified by 'name' in 'Struc'. Only one value of 'classname' is allowed for each value of 'Struc$name'.
By means of 'indrep=1' it is automatically assumed that the same number of
subjects (or objects) of the two classes under study have to be used for model building. Possible such numbers can be specified by 'Eit'.
Usage
PrInDTCstruc(datain,classname,ctestv=NA,Struc=NA,vers="d",weight=FALSE,
Eit=NA,Pit=NA,N=99,Ni=99,undersamp=TRUE,crit="ba",ktest=0,
stest=integer(length=0),conf.level=0.95,indrep=0,
minsplit=NA,minbucket=NA,valdat=datain,thr=0.5)
Arguments
datain |
Input data frame with class factor variable 'classname' and the |
classname |
Name of class variable (character) |
ctestv |
Vector of character strings of forbidden split results; |
Struc |
= list(name,check,labs), cf. description for explanations; Struc not needed for vers="b"; for indrep=1, Struc = list(name); |
vers |
Version of structured subsampling: "a", "b", "c", "d", cf. description; |
weight |
Weights to be used for subsampling of elements of substructure (logical, TRUE or FALSE); default = FALSE |
Eit |
List of number of elements of substructure (integers); |
Pit |
List of number of predictors (integers) |
N |
Number of repetitions of subsampling from substructure (integer) |
Ni |
Number of repetitions of subsampling from predictors |
undersamp |
Undersampling of categories of 'classname' to be used ((logical, TRUE or FALSE) |
crit |
Optimisation criterion: "ba" for balanced accuracy, "bat" for balanced accuracy on test sets, "ta" for test accuracy,
"tal" for test accuracy of continuing parts of length 'ktest' in substructure elements 'stest'; |
ktest |
Length of continuing parts to be tested (for crit="tal"); |
stest |
Part of substructure to be tested (for crit="tal")(integer vector); |
conf.level |
(1 - significance level) in function |
indrep |
Indicator for repeated measurements, i.e. more than one observation with the same class for each element; |
minsplit |
Minimum number of elements in a node to be splitted; |
minbucket |
Minimum number of elements in a node; |
valdat |
validation data; default = datain |
thr |
threshold for element classfication: minimum percentage of correct class entries; default = 0.5 |
Details
See Buschfeld & Weihs (2025), Optimizing decision trees for the analysis of World Englishes and sociolinguistic data. Cambridge University Press, section 4.5.1, for further information.
Standard output can be produced by means of print(name$besttree)
or just name$besttree
as well as plot(name$besttree)
where 'name' is the output data
frame of the function.
Value
- modbest
Best tree
- interp
Number of interpretable trees, overall number of trees
- dmax
Number of predictors in training set for best tree
- ntmax
Size of training set for best tree
- acc1
Accuracy of best tree on large class
- acc2
Accuracy of best tree on small class
- bamax
Balanced accuracy of best tree
- tamax
Test accuracy of best tree
- kumin
Number of elements with misclassified parts longer than 'ktest' for best tree
- elems
Elements with long misclassified parts for best tree
- mindlong
Indices of long misclassified parts for best tree
- ind1max
Elements of 1st category of substructure used by best tree
- ind2max
Elements of 2nd category of substructure used by best tree
- indmax
Predictors used by best tree
- bestTrain
Training set for best tree
- bestTest
Test set for best tree
- labs
labs from Struc
- lablarge
Label of large class
- labsmall
Label of small class
- vers
Version used for structured subsampling
- acc1E
Accuracy of large class on Elements of best tree
- acc2E
Accuracy of small class on Elements of best tree
- bamaxE
Balanced accuracy of best tree on Elements
- nam1
Names of misclassified Elements of large class
- nam2
Names of misclassified Elements of small class
- thr
Threshold for element classification
Examples
data <- PrInDT::data_zero
data <- na.omit(data) # cleaned full data: no NAs
# interpretation restrictions (split exclusions)
ctestv <- rbind('ETH == {C2a, C1a}','MLU == {1, 3}') # split exclusions
# substructure
name <- PrInDT::participant_zero
check <- "data$ETH"
labs <- matrix(1:6,nrow=2,ncol=3)
labs[1,] <- c("C1a","C1b","C1c")
labs[2,] <- c("C2a","C2b","C2c")
rownames(labs) <- c("Children 1","Children 2")
Struc <- list(name=name,check=check,labs=labs)
out <- PrInDTCstruc(data,"real",ctestv,Struc,vers="c",weight=TRUE,N=5,Pit=5,conf.level=0.99)
out
plot(out)
# indrep = 1
Struc <- list(name=name)
out <- PrInDTCstruc(data,"real",Struc=Struc,vers="c",Pit=5,Eit=5,N=5,crit="ba",indrep=1,
conf.level=0.99)