OptPrInDT {PrInDT} | R Documentation |
Optimisation of undersampling percentages for classification
Description
The function OptPrInDT applies an iterative technique for finding optimal undersampling percentages
'percl' for the larger class and 'percs' for the smaller class by a nested grid search for the use of the function PrInDT for
the relationship between the two-class factor variable 'classname' and all other factor and numerical variables
in the data frame 'data' by means of 'N' repetitions of undersampling. The optimization citerion is the balanced accuracy
on the validation sample 'valdat' (default = full sample 'data'). The trees generated from undersampling can be restricted by not accepting trees
including split results specified in the character strings of the vector 'ctestv'.
The inputs plmax and psmax determine the maximal values of the percentages and the inputs distl and dists the
the distances to the next smaller percentage to be tried.
The parameters 'conf.level', 'minsplit', and 'minbucket' can be used to control the size of the trees.
Usage
OptPrInDT(data,classname,ctestv=NA,N=99,plmax=0.09,psmax=0.9,
distl=0.01,dists=0.1,conf.level=0.95,minsplit=NA,minbucket=NA,
valdat=data)
Arguments
data |
Input data frame with class factor variable 'classname' and the |
classname |
Name of class variable (character) |
ctestv |
Vector of character strings of forbidden split results; |
N |
Number (> 7) of repetitions (integer) |
plmax |
Maximal undersampling percentage of larger class (numerical, > 0 and <= 1); |
psmax |
Maximal undersampling percentage of smaller class (numerical, > 0 and <= 1); |
distl |
Distance to the next lower undersampling percentage of larger class (numerical, > 0 and < 1); |
dists |
Distance to the next lower undersampling percentage of smaller class (numerical, > 0 and < 1); |
conf.level |
(1 - significance level) in function |
minsplit |
Minimum number of elements in a node to be splitted; |
minbucket |
Minimum number of elements in a node; |
valdat |
validation data; default = data |
Details
See help("RePrInDT") and help("PrInDT") for further information.
Standard output can be produced by means of print(name$besttree)
or just name$besttree
as well as plot(name$besttree)
where 'name' is the output data
frame of the function.
Value
- besttree
best tree on full sample
- bestba
balanced accuracy of best tree on full sample
- percl
undersampling percentage of large class of best tree on full sample
- percs
undersampling percentage of small class of best tree on full sample
Examples
datastrat <- PrInDT::data_zero
data <- na.omit(datastrat) # cleaned full data: no NAs
# interpretation restrictions (split exclusions)
ctestv <- rbind('ETH == {C2a, C1a}','MLU == {1, 3}') # split exclusions
# call of OptPrInDT
out <- OptPrInDT(data,"real",ctestv,N=24,conf.level=0.995) # unstratified
out # print best model and ensembles as well as all observations
plot(out)