R2SPrInDT {PrInDT} | R Documentation |
Two-stage estimation for regression
Description
The function R2SPrInDT applies 'N' repetitions of subsampling for finding an optimal subsample to model
the relationship between the continuous variables with indices 'inddep' and all other factor and numerical variables
in the data frame 'datain'.
Subsampling of observations and predictors uses the percentages in 'pobs1' and 'ppre1', respectively, at stage 1, and the percentages 'pobs2' and 'ppre2'
at stage 2, accordingly.
The optimization criterion is the goodness of fit R2 on the full sample.
The trees generated from undersampling can be restricted by not accepting trees
including split results specified in the character strings of the vector 'ctestv'.
The parameters 'conf.level', 'minsplit', and 'minbucket' can be used to control the size of the trees.
Usage
R2SPrInDT(data,ctestv=NA,inddep,N=99,pobs1=c(0.90,0.70),ppre1=c(0.90,0.70),
pobs2=pobs1,ppre2=ppre1,conf.level=0.95,minsplit=NA,minbucket=NA)
Arguments
data |
Input data frame with continuous target variable 'regname' and the |
ctestv |
Vector of character strings of forbidden split results; |
inddep |
Column indices of target variables in datain |
N |
Number of repetitions of subsampling from predictors (integer) in versions "b" and "c"; |
pobs1 |
Percentage(s) of observations for subsampling at stage 1; |
ppre1 |
Percentage(s) of predictors for subsampling at stage 1; |
pobs2 |
Percentage(s) of observations for subsampling at stage 2"; |
ppre2 |
Percentage(s) of predictors for subsampling at stage 2; |
conf.level |
(1 - significance level) in function |
minsplit |
Minimum number of elements in a node to be splitted; |
minbucket |
Minimum number of elements in a node; |
Details
See Buschfeld & Weihs (2025), Optimizing decision trees for the analysis of World Englishes and sociolinguistic data. Cambridge University Press, section 4.5.6.1, for further information.
Standard output can be produced by means of print(name)
or just name
as well as plot(name
where 'name' is the output data
frame of the function.
Value
- models1
Best trees at stage 1
- models2
Best trees at stage 2
- depnames
names of dependent variables
- R2both
R2s of best trees at both stages
Examples
data <- PrInDT::data_vowel
data <- na.omit(data)
CHILDvowel <- data$Nickname
data$Nickname <- NULL
syllable <- 3 - data$syllables
data$syllabels <- NULL
data$syllables <- syllable
data$speed <- data$word_duration / data$syllables
names(data)[names(data) == "target"] <- "vowel_length"
# interpretation restrictions (split exclusions)
ctestv <- rbind('ETH == {C2a, C1a}','MLU == {1, 3}') # split exclusions
inddep <- c(13,9)
out2SR <- R2SPrInDT(data,ctestv=ctestv,inddep=inddep,N=9,conf.level=0.99)
out2SR
plot(out2SR)