bind_tf_idf_dt {tidyfst}R Documentation

Compute TF–IDF Using data.table with Optional Counting and Grouping

Description

This function computes term frequency–inverse document frequency (tf–idf) on a dataset with one row per term occurrence (or pre-counted). It preserves original column names and returns new columns: - 'n': raw count (computed or user-supplied) - 'tf': term frequency per document - 'idf': inverse document frequency per group (or corpus) - 'tf_idf': tf × idf If 'group_col' is 'NULL', all documents are treated as a single group.

Usage

bind_tf_idf_dt(.data, group_col = NULL, doc_col, term_col, n_col = NULL)

Arguments

.data

A data.frame or data.table of text data.

group_col

Character name of grouping column, or 'NULL' for no grouping.

doc_col

Character name of document identifier column.

term_col

Character name of term/word column.

n_col

(Optional) Character name of pre-counted term-frequency column. If 'NULL' (default), counts are computed via '.N'.

Value

A data.table containing: - Original grouping, document, and term columns - 'n', 'tf', 'idf', and 'tf_idf'

See Also

bind_tf_idf

Examples


# With groups
df <- data.frame(
  category = rep(c("A","B"), each = 6),
  doc_id   = rep(c("d1","d2","d3"), times = 4),
  word     = c("apple","banana","apple","banana","cherry","apple",
               "dog","cat","dog","mouse","cat","dog"),
  stringsAsFactors = FALSE
)
result <- bind_tf_idf_dt(df, "category", "doc_id", "word")
result

# Without groups
df %>%
  filter_dt(category == "A") %>%
  bind_tf_idf_dt(doc_col = "doc_id",term_col = "word")

# With counts provided
df %>%
  filter_dt(category == "A") %>%
  count_dt() %>%
  bind_tf_idf_dt(doc_col = "doc_id",term_col = "word",n_col = "n")
df %>%
  count_dt() %>%
  bind_tf_idf_dt(group_col = "category",
                 doc_col = "doc_id",
                 term_col = "word",n_col = "n")


[Package tidyfst version 1.8.2 Index]