PrInDT {PrInDT}R Documentation

The basic undersampling loop for classification

Description

The function PrInDT uses ctrees (conditional inference trees from the package "party") for optimal modeling of the relationship between the two-class factor variable 'classname' and all other factor and numerical variables in the data frame 'datain' by means of 'N' repetitions of undersampling. The optimization citerion is the balanced accuracy on the validation sample 'valdat' (default = full input sample 'datain'). The trees generated from undersampling can be restricted by not accepting trees including split results specified in the character strings of the vector 'ctestv'.
The undersampling percentages are 'percl' for the larger class and 'percs' for the smaller class (default = 1).
The probability threshold 'thres' for the prediction of the smaller class may be specified (default = 0.5).
Undersampling may be stratified in two ways by the feature 'strat'.
The parameters 'conf.level', 'minsplit', and 'minbucket' can be used to control the size of the trees.

In the case of repeated measurements ('indrep=1'), the values of the substructure variable have to be given in 'repvar'. Only one value of 'classname' is allowed for each value of 'repvar'. If for a value of 'repvar' the percentage 'thr' of the observed occurence of a value of 'classname' is not reached by the number of predictions of the value of 'classname', a misclassification is detected.

Usage

PrInDT(datain,classname,ctestv=NA,N,percl,percs=1,conf.level=0.95,thres=0.5,
         stratvers=0,strat=NA,seedl=TRUE,minsplit=NA,minbucket=NA,repvar=NA,indrep=0,
         valdat=datain,thr=0.5)

Arguments

datain

Input data frame with class factor variable 'classname' and the
influential variables, which need to be factors or numericals (transform logicals and character variables to factors)

classname

Name of class variable (character)

ctestv

Vector of character strings of forbidden split results;
Example: ctestv <- rbind('variable1 == {value1, value2}','variable2 <= value3'), where character strings specified in 'value1', 'value2' are not allowed as results of a splitting operation in variable 1 in a tree.
For restrictions of the type 'variable <= xxx', all split results in a tree are excluded with 'variable <= yyy' and yyy <= xxx.
Trees with split results specified in 'ctestv' are not accepted during optimization.
A concrete example is: 'ctestv <- rbind('ETH == {C2a, C1a}','AGE <= 20')' for variables 'ETH' and 'AGE' and values 'C2a','C1a', and '20';
If no restrictions exist, the default = NA is used.

N

Number (> 2) of repetitions (integer)

percl

Undersampling percentage of larger class (numerical, > 0 and <= 1);
if percs = default, default of percl = percentage of large class resulting in the same number of observations as in the small class

percs

Undersampling percentage of smaller class (numerical, > 0 and <= 1);
default = 1

conf.level

(1 - significance level) in function ctree (numerical, > 0 and <= 1);
default = 0.95

thres

Probability threshold for prediction of smaller class (numerical, >= 0 and < 1); default = 0.5

stratvers

Version of stratification;
= 0: none (default),
= 1: stratification according to the percentages of the values of the factor variable 'strat',
> 1: stratification with minimum number "stratvers" of observations per value of "strat"

strat

Name of one (!) stratification variable for undersampling (character);
default = NA (no stratification)

seedl

Should the seed for random numbers be set (TRUE / FALSE)?
default = TRUE

minsplit

Minimum number of elements in a node to be splitted;
default = 20

minbucket

Minimum number of elements in a node;
default = 7

repvar

Values of variable defining the substructure in the case of repeated measurements; default = NA

indrep

Indicator of repeated measurements ('indrep=1'); default = 0

valdat

Validation data; default = datain

thr

threshold for element classification: minimum percentage of correct class entries; default = 0.5

Details

For the optimzation of the trees, we employ a method we call Sumping (Subsampling umbrella of model parameters), a variant of Bumping (Bootstrap umbrella of model parameters) (Tibshirani & Knight, 1999) which uses subsampling instead of bootstrapping. The aim of the optimization is to identify conditional inference trees with maximum predictive power on the full sample under interpretability restrictions.

References
– Tibshirani, R., Knight, K. 1999. Model Search and Inference By Bootstrap "bumping". Journal of Computational and Graphical Statistics, Vol. 8, No. 4 (Dec., 1999), pp. 671-686
– Weihs, C., Buschfeld, S. 2021a. Combining Prediction and Interpretation in Decision Trees (PrInDT) - a Linguistic Example. arXiv:2103.02336

Standard output can be produced by means of print(name) or just name as well as plot(name) where 'name' is the output data frame of the function.
The plot function will produce a series of more than one plot. If you use R, you might want to specify windows(record=TRUE) before plot(name) to save the whole series of plots. In R-Studio this functionality is provided automatically.

Value

tree1st

best tree on validation sample

tree2nd

2nd-best tree on validation sample

tree3rd

3rd-best tree on validation sample

treet1st

best tree on test sample

treet2nd

2nd-best tree on test sample

treet3rd

3rd-best tree on test sample

ba1st

accuracies: largeClass, smallClass, balanced of 'tree1st', both for validation and test sample

ba2nd

accuracies: largeClass, smallClass, balanced of 'tree2nd', both for validation and test sample

ba3rd

accuracies: largeClass, smallClass, balanced of 'tree3rd', both for validation and test sample

baen

accuracies: largeClass, smallClass, balanced of ensemble of all interpretable, 3 best acceptable, and all acceptable trees on validation sample

bafull

vector of balanced accuracies of all trees from undersampling

batest

vector of test accuracies of all trees from undersampling

treeAll

tree based on all observations

baAll

balanced accuracy of 'treeAll'

interpAll

criterion of interpretability of 'treeall' (TRUE / FALSE)

confAll

confusion matrix of 'treeAll'

acc1AE

Accuracy of full sample tree on Elements of large class

acc2AE

Accuracy of full sample tree on Elements of small class

bamaxAE

balanced accuracy of full sample tree on Elements

namA1

Names of misclassified Elements by full sample tree of large class

namA2

Names of misclassified Elements by full sample tree of small class

acc1E

Accuracy of best tree on Elements of large class

acc2E

Accuracy of best tree on Elements of small class

bamaxE

balanced accuracy of best tree on Elements

nam1

Names of misclassified Elements by best tree of large class

nam2

Names of misclassified Elements by best tree of small class

lablarge

Label of large class

labsmall

Label of small class

valdat

Validation set after edit

thr

Threshold for repeated measurements

Examples

datastrat <- PrInDT::data_zero
data <- na.omit(datastrat) # cleaned full data: no NAs
# interpretation restrictions (split exclusions)
ctestv <- rbind('ETH == {C2a, C1a}','MLU == {1, 3}') # split exclusions
N <- 41  # no. of repetitions
conf.level <- 0.99 # 1 - significance level (mincriterion) in ctree
percl <- 0.08  # undersampling percentage of the larger class
percs <- 0.95 # undersampling percentage of the smaller class
# calls of PrInDT
out <- PrInDT(data,"real",ctestv,N,percl,percs,conf.level) # unstratified
out # print best model and ensembles as well as all observations
plot(out)
out <- PrInDT(data,"real",ctestv,N,percl,percs,conf.level,stratvers=1,
              strat="SEX") # percentage stratification
out <- PrInDT(data,"real",ctestv,N,percl,percs,conf.level,stratvers=50,
              strat="SEX") # stratification with minimum no. of tokens
out <- PrInDT(data,"real",ctestv,N,percl,percs,conf.level,thres=0.4) # threshold = 0.4


[Package PrInDT version 2.0.0 Index]