bind_tf_idf_dt {tidyfst} | R Documentation |
Compute TF–IDF Using data.table with Optional Counting and Grouping
Description
This function computes term frequency–inverse document frequency (tf–idf) on a dataset with one row per term occurrence (or pre-counted). It preserves original column names and returns new columns: - 'n': raw count (computed or user-supplied) - 'tf': term frequency per document - 'idf': inverse document frequency per group (or corpus) - 'tf_idf': tf × idf If 'group_col' is 'NULL', all documents are treated as a single group.
Usage
bind_tf_idf_dt(.data, group_col = NULL, doc_col, term_col, n_col = NULL)
Arguments
.data |
A data.frame or data.table of text data. |
group_col |
Character name of grouping column, or 'NULL' for no grouping. |
doc_col |
Character name of document identifier column. |
term_col |
Character name of term/word column. |
n_col |
(Optional) Character name of pre-counted term-frequency column. If 'NULL' (default), counts are computed via '.N'. |
Value
A data.table containing: - Original grouping, document, and term columns - 'n', 'tf', 'idf', and 'tf_idf'
See Also
Examples
# With groups
df <- data.frame(
category = rep(c("A","B"), each = 6),
doc_id = rep(c("d1","d2","d3"), times = 4),
word = c("apple","banana","apple","banana","cherry","apple",
"dog","cat","dog","mouse","cat","dog"),
stringsAsFactors = FALSE
)
result <- bind_tf_idf_dt(df, "category", "doc_id", "word")
result
# Without groups
df %>%
filter_dt(category == "A") %>%
bind_tf_idf_dt(doc_col = "doc_id",term_col = "word")
# With counts provided
df %>%
filter_dt(category == "A") %>%
count_dt() %>%
bind_tf_idf_dt(doc_col = "doc_id",term_col = "word",n_col = "n")
df %>%
count_dt() %>%
bind_tf_idf_dt(group_col = "category",
doc_col = "doc_id",
term_col = "word",n_col = "n")