extract_basic_description_features {featForge} | R Documentation |
Extract Basic Description Features
Description
This function processes a vector of text descriptions (such as transaction descriptions) and computes a set of basic text features. These features include counts of digits, special characters, punctuation, words, characters, unique characters, and letter cases, as well as word length statistics and the Shannon entropy of the text.
Usage
extract_basic_description_features(descriptions)
Arguments
descriptions |
A character vector of text descriptions to be processed. |
Details
The extracted features are:
- has_digits
A binary indicator (0/1) showing whether the description contains any digit.
- n_digits
The total count of digit characters in the description.
- n_special
The number of special characters (non-alphanumeric and non-whitespace) present.
- n_punct
The count of punctuation marks found in the description.
- n_words
The number of words in the description.
- n_chars
The total number of characters in the description.
- n_unique_chars
The count of unique characters in the description.
- n_upper
The count of uppercase letters in the description.
- n_letters
The total count of alphabetic characters (both uppercase and lowercase) in the description.
- prop_caps
The proportion of letters in the description that are uppercase.
- n_whitespace
The number of whitespace characters (spaces) in the description.
- avg_word_length
The average word length within the description.
- min_word_length
The length of the shortest word in the description.
- max_word_length
The length of the longest word in the description.
- entropy
The Shannon entropy of the description, indicating its character diversity.
The function uses vectorized string operations (e.g., grepl
, gregexpr
, and nchar
) for efficiency,
which makes it suitable for processing large datasets. The resulting numeric features can then be used directly for
further statistical analysis or machine learning, or they can be aggregated to higher levels.
Value
A data frame where each row corresponds to an element in descriptions
and each column represents a computed feature.
Examples
# Example 1: Extract features from a vector of sample descriptions.
descs <- c("KappaCredit#101",
"Transferred funds for service fee 990",
"Mighty remittance code 99816 casino")
extract_basic_description_features(descs)
# Example 2: Aggregate the maximum word length per application.
# Load the sample transactions data.
data(featForge_transactions)
# Combine the transactions data with extracted basic description features.
trans <- cbind(featForge_transactions,
extract_basic_description_features(featForge_transactions$description))
# Aggregate the maximum word length on the application level.
aggregated <- aggregate_applications(
trans,
id_col = "application_id",
amount_col = "max_word_length",
ops = list(max_description_word_length = max),
period = "all"
)
# Display the aggregated results.
aggregated