getChiSqStat {folda} | R Documentation |
Compute Chi-Squared Statistics for Variables
Description
This function calculates the chi-squared statistic for each column of datX
against the response variable response
. It supports both numerical and
categorical predictors in datX
. For numerical variables, it automatically
discretizes them into factor levels based on standard deviations and mean,
using different splitting criteria depending on the sample size.
Usage
getChiSqStat(datX, response)
Arguments
datX |
A matrix or data frame containing predictor variables. It can consist of both numerical and categorical variables. |
response |
A factor representing the class labels. It must have at least two levels for the chi-squared test to be applicable. |
Details
For each variable in datX
, the function first checks if the
variable is numerical. If so, it is discretized into factor levels using
either two or three split points, depending on the sample size and the
number of levels in the response
. Missing values are handled by assigning
them to a new factor level.
The chi-squared statistic is then computed between each predictor and the
response
. If the chi-squared test has more than one degree of freedom,
the Wilson-Hilferty transformation is applied to adjust the statistic to a
1-degree-of-freedom chi-squared distribution.
Value
A vector of chi-squared statistics, one for each predictor variable
in datX
. For numerical variables, the chi-squared statistic is computed
after binning the variable.
References
Loh, W. Y. (2009). Improving the precision of classification trees. The Annals of Applied Statistics, 1710–1737. JSTOR.
Examples
datX <- data.frame(var1 = rnorm(100), var2 = factor(sample(letters[1:3], 100, replace = TRUE)))
y <- factor(sample(c("A", "B"), 100, replace = TRUE))
getChiSqStat(datX, y)