C2SPrInDT {PrInDT} | R Documentation |
Two-stage estimation for classification
Description
The function C2SPrInDT applies two-stage estimation for finding an optimal model for relationships between the two-class factor variables
specified as column indices of 'datain' in the vector 'inddep' and all other factor and numerical variables in the data frame 'datain' by means of
'N' repetitions of random subsampling with percentages 'percl' for the large classes and 'percs' for the small classes. One percentage of observations
for each dependent variable has to be specified for the larger and the smaller class. For example, for three dependent variables, 'percl' consists of
three percentages specified in the order in which the dependent variables appear in 'inddep'.
The dependent variables have to be specified as dummies, i.e. as 0 for 'property absent' or 1 for 'property present' for a certain property the dependent variable is representing.
The indices of the predictors relevant at 1st stage modeling can be specified in the vector 'indind'. If 'indind' is not specied, then all variables in 'datain'
which are not specified in 'inddep' are used as predictors at 1st stage. At 2nd stage, all such variables are used as predictors anyway.
The optimization citerion is the balanced accuracy on the full sample. The trees generated from undersampling can be restricted by not accepting trees
including split results specified in the character strings of the vector 'ctestv'.
The parameters 'conf.level', 'minsplit', and 'minbucket' can be used to control the size of the trees.
Usage
C2SPrInDT(datain,ctestv=NA,conf.level=0.95,percl,percs,N=99,indind=NA,inddep,
minsplit=NA,minbucket=NA)
Arguments
datain |
Input data frame with class factor variables and the influential variables, |
ctestv |
Vector of character strings of forbidden split results; |
conf.level |
(1 - significance level) in function |
percl |
list of undersampling percentages of larger class (numerical, > 0 and <= 1): one per dependent class variable in the same order as in 'inddep' |
percs |
list of undersampling percentage of smaller class (numerical, > 0 and <= 1); one per dependent class variable in the same order as in 'inddep' |
N |
no. of repetitions (integer > 0); default = 99 |
indind |
indices of independent variables at stage 1; default = NA (means all independent variables used) |
inddep |
indices of dependent variables |
minsplit |
Minimum number of elements in a node to be splitted; default = 20 |
minbucket |
Minimum number of elements in a node; default = 7 |
Details
See Buschfeld & Weihs (2025), Optimizing decision trees for the analysis of World Englishes and sociolinguistic data. Cambridge University Press, section 4.5.6.1, for further information.
Standard output can be produced by means of print(name)
or just name
as well as plot(name)
where 'name' is the output data
frame of the function.
Value
- models1
Best trees at stage 1
- models2
Best trees at stage 2
- classnames
names of classification variables
- baAll
balanced accuracies of best trees at both stages
Examples
data <- PrInDT::data_land # load data
dataclean <- data[,c(1:7,23:24,11:13,22,8:10)] # only relevant features
indind <- c(1:9) # original predictors
inddep <- c(14:16) # dependent variables
dataland <- na.omit(dataclean)
ctestv <- NA
perc <- c(0.45,0.05,0.25) # percentages of observations of larger class,
# 1 per dependent class variable
perc2 <- c(0.75,0.95,0.75) # percentages of observations of smaller class,
# 1 per dependent class variable
outland <- C2SPrInDT(dataland,percl=perc,percs=perc2,N=19,indind=indind,inddep=inddep)
outland
plot(outland)