training_data_checker {Rforestry} | R Documentation |
Training data check
Description
Check the input to forestry constructor
Usage
training_data_checker(
x,
y,
ntree,
replace,
sampsize,
mtry,
nodesizeSpl,
nodesizeAvg,
nodesizeStrictSpl,
nodesizeStrictAvg,
minSplitGain,
maxDepth,
interactionDepth,
splitratio,
nthread,
middleSplit,
doubleTree,
linFeats,
monotonicConstraints,
featureWeights,
deepFeatureWeights,
observationWeights,
linear,
hasNas
)
Arguments
x |
A data frame of all training predictors. |
y |
A vector of all training responses. |
ntree |
The number of trees to grow in the forest. The default value is 500. |
replace |
An indicator of whether sampling of training data is with replacement. The default value is TRUE. |
sampsize |
The size of total samples to draw for the training data. If sampling with replacement, the default value is the length of the training data. If samplying without replacement, the default value is two-third of the length of the training data. |
mtry |
The number of variables randomly selected at each split point. The default value is set to be one third of total number of features of the training data. |
nodesizeSpl |
Minimum observations contained in terminal nodes. The default value is 3. |
nodesizeAvg |
Minimum size of terminal nodes for averaging dataset. The default value is 3. |
nodesizeStrictSpl |
Minimum observations to follow strictly in terminal nodes. The default value is 1. |
nodesizeStrictAvg |
Minimum size of terminal nodes for averaging dataset to follow strictly. The default value is 1. |
minSplitGain |
Minimum loss reduction to split a node further in a tree. |
maxDepth |
Maximum depth of a tree. The default value is 99. |
interactionDepth |
All splits at or above interaction depth must be on variables that are not weighting variables (as provided by the interactionVariables argument) |
splitratio |
Proportion of the training data used as the splitting dataset. It is a ratio between 0 and 1. If the ratio is 1, then essentially splitting dataset becomes the total entire sampled set and the averaging dataset is empty. If the ratio is 0, then the splitting data set is empty and all the data is used for the averaging data set (This is not a good usage however since there will be no data available for splitting). |
nthread |
Number of threads to train and predict the forest. The default number is 0 which represents using all cores. |
middleSplit |
if the split value is taking the average of two feature values. If false, it will take a point based on a uniform distribution between two feature values. (Default = FALSE) |
doubleTree |
if the number of tree is doubled as averaging and splitting data can be exchanged to create decorrelated trees. (Default = FALSE) |
linFeats |
Specify which features to split linearly on when using linear (defaults to use all numerical features) |
monotonicConstraints |
Specifies monotonic relationships between the continuous features and the outcome. Supplied as a vector of length p with entries in 1,0,-1 which 1 indicating an increasing monotonic relationship, -1 indicating a decreasing monotonic relationship, and 0 indicating no relationship. Constraints supplied for categorical will be ignored. |
featureWeights |
weights used when subsampling features for nodes above or at interactionDepth. |
deepFeatureWeights |
weights used when subsampling features for nodes below interactionDepth. |
observationWeights |
These denote the weights for each training observation which determines how likely the observation is to be selected in each bootstrap sample. This option is not allowed when sampling is done without replacement. |
linear |
Fit the model with a ridge regression or not |
hasNas |
indicates if there is any missingness in x. |