PrInDT {PrInDT} | R Documentation |
The basic undersampling loop for classification
Description
The function PrInDT uses ctrees (conditional inference trees from the package "party") for optimal modeling of
the relationship between the two-class factor variable 'classname' and all other factor and numerical variables
in the data frame 'datain' by means of 'N' repetitions of undersampling. The optimization citerion is the balanced accuracy
on the validation sample 'valdat' (default = full input sample 'datain'). The trees generated from undersampling can be restricted by not accepting trees
including split results specified in the character strings of the vector 'ctestv'.
The undersampling percentages are 'percl' for the larger class and 'percs' for the smaller class (default = 1).
The probability threshold 'thres' for the prediction of the smaller class may be specified (default = 0.5).
Undersampling may be stratified in two ways by the feature 'strat'.
The parameters 'conf.level', 'minsplit', and 'minbucket' can be used to control the size of the trees.
In the case of repeated measurements ('indrep=1'), the values of the substructure variable have to be given in 'repvar'.
Only one value of 'classname' is allowed for each value of 'repvar'.
If for a value of 'repvar' the percentage 'thr' of the observed occurence of a value of 'classname' is not reached by the number of predictions of the value of 'classname', a misclassification is detected.
Usage
PrInDT(datain,classname,ctestv=NA,N,percl,percs=1,conf.level=0.95,thres=0.5,
stratvers=0,strat=NA,seedl=TRUE,minsplit=NA,minbucket=NA,repvar=NA,indrep=0,
valdat=datain,thr=0.5)
Arguments
datain |
Input data frame with class factor variable 'classname' and the |
classname |
Name of class variable (character) |
ctestv |
Vector of character strings of forbidden split results; |
N |
Number (> 2) of repetitions (integer) |
percl |
Undersampling percentage of larger class (numerical, > 0 and <= 1); |
percs |
Undersampling percentage of smaller class (numerical, > 0 and <= 1); |
conf.level |
(1 - significance level) in function |
thres |
Probability threshold for prediction of smaller class (numerical, >= 0 and < 1); default = 0.5 |
stratvers |
Version of stratification; |
strat |
Name of one (!) stratification variable for undersampling (character); |
seedl |
Should the seed for random numbers be set (TRUE / FALSE)? |
minsplit |
Minimum number of elements in a node to be splitted; |
minbucket |
Minimum number of elements in a node; |
repvar |
Values of variable defining the substructure in the case of repeated measurements; default = NA |
indrep |
Indicator of repeated measurements ('indrep=1'); default = 0 |
valdat |
Validation data; default = datain |
thr |
threshold for element classification: minimum percentage of correct class entries; default = 0.5 |
Details
For the optimzation of the trees, we employ a method we call Sumping (Subsampling umbrella of model parameters), a variant of Bumping (Bootstrap umbrella of model parameters) (Tibshirani & Knight, 1999) which uses subsampling instead of bootstrapping. The aim of the optimization is to identify conditional inference trees with maximum predictive power on the full sample under interpretability restrictions.
References
– Tibshirani, R., Knight, K. 1999. Model Search and Inference By Bootstrap "bumping".
Journal of Computational and Graphical Statistics, Vol. 8, No. 4 (Dec., 1999), pp. 671-686
– Weihs, C., Buschfeld, S. 2021a. Combining Prediction and Interpretation in Decision Trees (PrInDT) -
a Linguistic Example. arXiv:2103.02336
Standard output can be produced by means of print(name)
or just name
as well as plot(name)
where 'name' is the output data
frame of the function.
The plot function will produce a series of more than one plot. If you use R, you might want to specify windows(record=TRUE)
before
plot(name)
to save the whole series of plots. In R-Studio this functionality is provided automatically.
Value
- tree1st
best tree on validation sample
- tree2nd
2nd-best tree on validation sample
- tree3rd
3rd-best tree on validation sample
- treet1st
best tree on test sample
- treet2nd
2nd-best tree on test sample
- treet3rd
3rd-best tree on test sample
- ba1st
accuracies: largeClass, smallClass, balanced of 'tree1st', both for validation and test sample
- ba2nd
accuracies: largeClass, smallClass, balanced of 'tree2nd', both for validation and test sample
- ba3rd
accuracies: largeClass, smallClass, balanced of 'tree3rd', both for validation and test sample
- baen
accuracies: largeClass, smallClass, balanced of ensemble of all interpretable, 3 best acceptable, and all acceptable trees on validation sample
- bafull
vector of balanced accuracies of all trees from undersampling
- batest
vector of test accuracies of all trees from undersampling
- treeAll
tree based on all observations
- baAll
balanced accuracy of 'treeAll'
- interpAll
criterion of interpretability of 'treeall' (TRUE / FALSE)
- confAll
confusion matrix of 'treeAll'
- acc1AE
Accuracy of full sample tree on Elements of large class
- acc2AE
Accuracy of full sample tree on Elements of small class
- bamaxAE
balanced accuracy of full sample tree on Elements
- namA1
Names of misclassified Elements by full sample tree of large class
- namA2
Names of misclassified Elements by full sample tree of small class
- acc1E
Accuracy of best tree on Elements of large class
- acc2E
Accuracy of best tree on Elements of small class
- bamaxE
balanced accuracy of best tree on Elements
- nam1
Names of misclassified Elements by best tree of large class
- nam2
Names of misclassified Elements by best tree of small class
- lablarge
Label of large class
- labsmall
Label of small class
- valdat
Validation set after edit
- thr
Threshold for repeated measurements
Examples
datastrat <- PrInDT::data_zero
data <- na.omit(datastrat) # cleaned full data: no NAs
# interpretation restrictions (split exclusions)
ctestv <- rbind('ETH == {C2a, C1a}','MLU == {1, 3}') # split exclusions
N <- 41 # no. of repetitions
conf.level <- 0.99 # 1 - significance level (mincriterion) in ctree
percl <- 0.08 # undersampling percentage of the larger class
percs <- 0.95 # undersampling percentage of the smaller class
# calls of PrInDT
out <- PrInDT(data,"real",ctestv,N,percl,percs,conf.level) # unstratified
out # print best model and ensembles as well as all observations
plot(out)
out <- PrInDT(data,"real",ctestv,N,percl,percs,conf.level,stratvers=1,
strat="SEX") # percentage stratification
out <- PrInDT(data,"real",ctestv,N,percl,percs,conf.level,stratvers=50,
strat="SEX") # stratification with minimum no. of tokens
out <- PrInDT(data,"real",ctestv,N,percl,percs,conf.level,thres=0.4) # threshold = 0.4