.wp_tokenize_single_string {wordpiece} | R Documentation |
Tokenize an Input Word-by-word
Description
Tokenize an Input Word-by-word
Usage
.wp_tokenize_single_string(words, vocab, unk_token, max_chars)
Arguments
words |
Character; a vector of words (generated by space-tokenizing a single input). |
vocab |
Character vector of vocabulary tokens. The tokens are assumed to be in order of index, with the first index taken as zero to be compatible with Python implementations. |
unk_token |
Token to represent unknown words. |
max_chars |
Maximum length of word recognized. |
Value
A named integer vector of tokenized words.
[Package wordpiece version 2.1.3 Index]