pattern2id {quanteda} | R Documentation |
Match patterns against token types
Description
Developer function to match regex, fixed or glob patterns against token types. This allows C++ function to perform fast searches in tokens object. C++ functions use a list of type IDs to construct a hash table, against which sub-vectors of tokens object are matched. This function constructs an index of glob patterns for faster matching.
pattern2fixed
converts regex and glob patterns to fixed patterns.
Usage
pattern2id(
pattern,
types,
valuetype = c("glob", "fixed", "regex"),
case_insensitive = TRUE,
keep_nomatch = FALSE,
use_index = TRUE
)
pattern2fixed(
pattern,
types,
valuetype = c("glob", "fixed", "regex"),
case_insensitive = TRUE,
keep_nomatch = FALSE,
use_index = TRUE
)
Arguments
pattern |
a character vector, list of character vectors, dictionary, or collocations object. See pattern for details. |
types |
token types against which patterns are matched |
valuetype |
the type of pattern matching: |
case_insensitive |
logical; if |
keep_nomatch |
keep patterns that did not match |
use_index |
construct index of types for quick search |
Value
a list of integer vectors containing indices of matched types
pattern2fixed
returns a list of character vectors containing
types
Examples
types <- c("A", "AA", "B", "BB", "BBB", "C", "CC")
pats_regex <- list(c("^a$", "^b"), c("c"), c("d"))
pattern2id(pats_regex, types, "regex", case_insensitive = TRUE)
pats_glob <- list(c("a*", "b*"), c("c"), c("d"))
pattern2id(pats_glob, types, "glob", case_insensitive = TRUE)
pattern <- list(c("^a$", "^b"), c("c"), c("d"))
types <- c("A", "AA", "B", "BB", "BBB", "C", "CC")
pattern2fixed(pattern, types, "regex", case_insensitive = TRUE)