data_process {PIE} | R Documentation |
data_process: process tabular data into the format for the PIE model.
Description
This function take tabular dataset and meta-data (such as numerical columns and categorical columns), then output k fold cross validation dataset with splines on numerical features in order to capture the non-linear relationship among numerical features. Within this function, numerical features and target variable are normalized and reorganize into order: (numerical features, categorical features, target).
Usage
data_process(
X,
y,
num_col,
cat_col,
y_col,
k = 5,
validation_rate = 0.2,
spline_num = 5,
random_seed = 1
)
Arguments
X |
Feature columns in dataset |
y |
Target column in dataset |
num_col |
Index of the columns that are numerical features |
cat_col |
Index of the columns that are categorical features. |
y_col |
Index of the column that is the response. |
k |
Number of fold for cross validation dataset setup. By default 'k = 5'. |
validation_rate |
Validation ratio within training dataset. By default 'validation_rate = 0.2' |
spline_num |
The degree of freedom for natural splines. By default 'spline_num = 5' |
random_seed |
Random seed for cross validation data split. By default 'random_seed = 1' |
Details
The function generates a suitable cross-validation dataset for PIE model. It contains training dataset, validation dataset, testing dataset and also group indicator for group lasso. When 'k=5', the training testing splits in 80/20. When 'validation_rate=0.2', 20 Setting 'validation_rate=0' will only generate training and testing data without validation data.
Value
A list containing:
spl_train_X |
A list of splined training dataset where all numerical features are splined into 'spline_num' columns. The number of element in list equals 'k' the number of fold. |
orig_train_X |
A list of original training dataset where the numerical features remains the original format. The number of element in list equals 'k' the number of fold. |
train_y |
A list of vectors representing target variable for training dataset. The number of element in list equals 'k' the number of fold. |
spl_validation_X |
A list of splined validation dataset where all numerical features are splined into 'spline_num' columns. The number of element in list equals 'k' the number of fold. It could be None, when 'validation_rate == 0' |
orig_validation_X |
A list of original validation dataset where the numerical features remains the original format. The number of element in list equals 'k' the number of fold. It could be None, when 'validation_rate == 0' |
validation_y |
A list of vectors representing target variable for validation dataset. The number of element in list equals 'k' the number of fold. It could be None, when 'validation_rate == 0' |
spl_test_X |
A list of splined testing dataset where all numerical features are splined into 'spline_num' columns. The number of element in list equals 'k' the number of fold. |
orig_test_X |
A list of original testing dataset where the numerical features remains the original format. The number of element in list equals 'k' the number of fold. |
test_y |
A list of vectors representing target variable for testing dataset. The number of element in list equals 'k' the number of fold. |
lasso_group |
A vector of consecutive integers describing the grouping of the coefficients |
Examples
# Load the training data
data("winequality")
# Which columns are numerical?
num_col <- 1:11
# Which columns are categorical?
cat_col <- 12
# Which column is the response?
y_col <- ncol(winequality)
# Data Processing (the first 200 rows are sampled for demonstration)
dat <- data_process(X = as.matrix(winequality[1:200, -y_col]),
y = winequality[1:200, y_col],
num_col = num_col, cat_col = cat_col, y_col = y_col)