data_process {PIE}R Documentation

data_process: process tabular data into the format for the PIE model.

Description

This function take tabular dataset and meta-data (such as numerical columns and categorical columns), then output k fold cross validation dataset with splines on numerical features in order to capture the non-linear relationship among numerical features. Within this function, numerical features and target variable are normalized and reorganize into order: (numerical features, categorical features, target).

Usage

data_process(
  X,
  y,
  num_col,
  cat_col,
  y_col,
  k = 5,
  validation_rate = 0.2,
  spline_num = 5,
  random_seed = 1
)

Arguments

X

Feature columns in dataset

y

Target column in dataset

num_col

Index of the columns that are numerical features

cat_col

Index of the columns that are categorical features.

y_col

Index of the column that is the response.

k

Number of fold for cross validation dataset setup. By default 'k = 5'.

validation_rate

Validation ratio within training dataset. By default 'validation_rate = 0.2'

spline_num

The degree of freedom for natural splines. By default 'spline_num = 5'

random_seed

Random seed for cross validation data split. By default 'random_seed = 1'

Details

The function generates a suitable cross-validation dataset for PIE model. It contains training dataset, validation dataset, testing dataset and also group indicator for group lasso. When 'k=5', the training testing splits in 80/20. When 'validation_rate=0.2', 20 Setting 'validation_rate=0' will only generate training and testing data without validation data.

Value

A list containing:

spl_train_X

A list of splined training dataset where all numerical features are splined into 'spline_num' columns. The number of element in list equals 'k' the number of fold.

orig_train_X

A list of original training dataset where the numerical features remains the original format. The number of element in list equals 'k' the number of fold.

train_y

A list of vectors representing target variable for training dataset. The number of element in list equals 'k' the number of fold.

spl_validation_X

A list of splined validation dataset where all numerical features are splined into 'spline_num' columns. The number of element in list equals 'k' the number of fold. It could be None, when 'validation_rate == 0'

orig_validation_X

A list of original validation dataset where the numerical features remains the original format. The number of element in list equals 'k' the number of fold. It could be None, when 'validation_rate == 0'

validation_y

A list of vectors representing target variable for validation dataset. The number of element in list equals 'k' the number of fold. It could be None, when 'validation_rate == 0'

spl_test_X

A list of splined testing dataset where all numerical features are splined into 'spline_num' columns. The number of element in list equals 'k' the number of fold.

orig_test_X

A list of original testing dataset where the numerical features remains the original format. The number of element in list equals 'k' the number of fold.

test_y

A list of vectors representing target variable for testing dataset. The number of element in list equals 'k' the number of fold.

lasso_group

A vector of consecutive integers describing the grouping of the coefficients

Examples


# Load the training data
data("winequality")

# Which columns are numerical?
num_col <- 1:11
# Which columns are categorical?
cat_col <- 12
# Which column is the response?
y_col <- ncol(winequality)

# Data Processing (the first 200 rows are sampled for demonstration)
dat <- data_process(X = as.matrix(winequality[1:200, -y_col]), 
  y = winequality[1:200, y_col], 
  num_col = num_col, cat_col = cat_col, y_col = y_col)


[Package PIE version 1.0.0 Index]