CASMI.selectFeatures {CASMI}R Documentation

CASMI Feature Selection

Description

Selects features that are associated with an outcome while taking into account a sample coverage penalty and feature redundancy. It automatically determines the number of features to be selected, and the chosen features are ranked. (Synonyms for "feature" in this document: "independent variable," "factor," and "predictor.")
For additional information, please refer to the publication: Shi, J., Zhang, J. and Ge, Y. (2019), "CASMI—An Entropic Feature Selection Method in Turing’s Perspective" <doi:10.3390/e21121179>

Usage

CASMI.selectFeatures(
  data,
  NA.handle = "stepwise",
  alpha = 0.05,
  alpha.ind = 0.1,
  intermediate.steps = FALSE,
  kappa.star.cap = 1,
  feature.num.cap = ncol(data)
)

Arguments

data

data frame with variables as columns and observations as rows. The data MUST include at least one feature (a.k.a., independent variable, predictor, factor) and only one outcome variable (Y). The outcome variable MUST BE THE LAST COLUMN. Both the features and the outcome MUST be categorical or discrete. If variables are not naturally discrete, you may preprocess them using the 'autoBin.binary()' function in the same package.

NA.handle

options for handling NA values in the data. There are three options: 'NA.handle = "stepwise"' (default), 'NA.handle = "na.omit"', and 'NA.handle = "NA as a category"'. (1) 'NA.handle = "stepwise"' excludes NA rows only when a particular variable is being used in a sub-step. For example, suppose we have data (Feature1: A, NA, B; Feature2: C, D, E; Feature3: F, G, H; Outcome: O, P, Q); the second observation will be excluded only when a particular step includes Feature1, but will not be excluded when a step is analyzing only Feature2, Feature3, and the Outcome. This option is designed to take advantage of the maximum possible number of observations. (2) 'NA.handle = "na.omit"' excludes observations with any NA values at the beginning of the analysis. (3) 'NA.handle = "NA as a category"' treats the NA value as a new category. This is designed to be used when NA values in the data have a consistent meaning instead of being missing values. For example, in survey data asking for comments, each NA value might consistently mean "no opinion."

alpha

level of significance for the confidence intervals in final results. By default, 'alpha = 0.05'.

alpha.ind

level of significance for the mutual information test of independence in step 1 of the features selection (for an initial screening). The smaller the 'alpha.ind', the fewer features are sent to step 2 (<doi:10.3390/e21121179>). By default, 'alpha.ind = 0.1'.

intermediate.steps

setting for outputting intermediate steps while awaiting the final results. There are two possible settings: 'intermediate.steps = TRUE' or 'intermediate.steps = FALSE'.

kappa.star.cap

a threshold of 'kappa*' for halting the feature selection process. The program will automatically terminate at the first feature whose cumulative 'kappa*' value exceeds the 'kappa.star.cap' threshold. By default, 'kappa.star.cap = 1.0', which is the maximum possible value. A lower value may result in fewer final features but reduced computing time.

feature.num.cap

the maximum number of features to be selected. A lower value may result in fewer final features but less computing time.

Value

'CASMI.selectFeatures()' returns the following components:

Examples

# ---- Generate a toy dataset for usage examples: "data" ----
set.seed(123)
n <- 200
x1 <- sample(c("A", "B", "C", "D"), size = n, replace = TRUE, prob = c(0.1, 0.2, 0.3, 0.4))
x2 <- sample(c("W", "X", "Y", "Z"), size = n, replace = TRUE, prob = c(0.4, 0.3, 0.2, 0.1))
x3 <- sample(c("E", "F", "G", "H", "I"), size = n,
             replace = TRUE, prob = c(0.2, 0.3, 0.2, 0.2, 0.1))
x4 <- sample(c("A", "B", "C", "D"), size = n, replace = TRUE)
x5 <- sample(c("L", "M", "N"), size = n, replace = TRUE)
x6 <- sample(c("E", "F", "G", "H", "I"), size = n, replace = TRUE)

# Generate y variable dependent on x1 to x3
x1_num <- as.numeric(factor(x1, levels = c("A", "B", "C", "D")))
x2_num <- as.numeric(factor(x2, levels = c("W", "X", "Y", "Z")))
x3_num <- as.numeric(factor(x3, levels = c("E", "F", "G", "H", "I")))
# Calculate y with added noise
y_numeric <- 3*x1_num + 2*x2_num - 2*x3_num + rnorm(n,mean=0,sd=2)
# Discretize y into categories
y <- cut(y_numeric, breaks = 10, labels = paste0("Category", 1:10))

# Combine into a dataframe
data <- data.frame(x1, x2, x3, x4, x5, x6, y)

# The outcome of the toy dataset is dependent on x1, x2, and x3
# but is independent of x4, x5, and x6.
head(data)


# ---- Usage Examples ----

## Select features and provide relevant results:
CASMI.selectFeatures(data)

## Adjust 'feature.num.cap' for including fewer features:
## (Note: A lower 'feature.num.cap' value may result in fewer
## final features but less computing time.)
CASMI.selectFeatures(data, feature.num.cap = 2)



[Package CASMI version 2.0.0 Index]