ragnar_chunk {ragnar} | R Documentation |
Chunk text
Description
These functions are deprecated in favor of markdown_chunk()
, which is more
flexible, supports overlapping chunks, enables deoverlapping or rechunking
downstream by ragnar_retrieve()
, and automatically builds a context
string of in-scope markdown headings for each chunk instead of requiring
manual string interpolation from extracted headings.
Usage
ragnar_chunk(
x,
max_size = 1600L,
boundaries = c("paragraph", "sentence", "line_break", "word", "character"),
...,
trim = TRUE,
simplify = TRUE
)
ragnar_segment(x, boundaries = "sentence", ..., trim = FALSE, simplify = TRUE)
ragnar_chunk_segments(x, max_size = 1600L, ..., simplify = TRUE, trim = TRUE)
Arguments
x |
A character vector, list of character vectors, or data frame containing a |
max_size |
Integer. The maximum number of characters in each chunk.
Defaults to |
boundaries |
A sequence of boundary types to use in order until
|
... |
Additional arguments passed to internal functions.
tokenizer to use |
trim |
logical, whether to trim leading and trailing whitespace from
strings. Default |
simplify |
Logical. If |
Details
Functions for chunking text into smaller pieces while preserving meaningful semantics. These functions provide flexible ways to split text based on various boundaries (sentences, words, etc.) while controlling chunk sizes and overlap.
Chunking is the combination of two fundamental operations:
identifying boundaries: finding character positions where it makes sense to split a string.
extracting slices: extracting substrings using the candidate boundaries to produce chunks that match the requested
chunk_size
andchunk_overlap
ragnar_chunk()
is a higher-level function that does both, identifies boundaries and extracts slices.
If you need lower-level control, you can alternatively use the lower-level functions
ragnar_segment()
in combination with ragnar_chunk_segments()
.
ragnar_segment()
: Splits text at semantic boundaries.
ragnar_chunk_segments()
: Combines text segments into chunks.
For most usecases, these two are equivalent:
x |> ragnar_chunk() x |> ragnar_segment() |> ragnar_chunk_segments()
When working with data frames, these functions preserve all columns and use
tidyr::unchop()
to handle the resulting list-columns when simplify = TRUE
.
Value
For character input with
simplify = FALSE
: A list of character vectorsFor character input with
simplify = TRUE
: A character vector of chunksFor data frame input with
simplify = FALSE
: A data frame with the same number of rows as the input, where thetext
column transformed into a list of chararacter vectors.For data frame input with
simplify = TRUE
: Same as a data frame input withsimplify=FALSE
, with thetext
column expanded bytidyr::unchop()
Examples
# Basic chunking with max size
text <- "This is a long piece of text. It has multiple sentences.
We want to split it into chunks. Here's another sentence."
ragnar_chunk(text, max_size = 40) # splits at sentences
# smaller chunk size: first splits at sentence boundaries, then word boundaries
ragnar_chunk(text, max_size = 20)
# only split at sentence boundaries. Note, some chunks are oversized
ragnar_chunk(text, max_size = 20, boundaries = c("sentence"))
# only consider word boundaries when splitting:
ragnar_chunk(text, max_size = 20, boundaries = c("word"))
# first split at sentence boundaries, then word boundaries,
# as needed to satisfy `max_chunk`
ragnar_chunk(text, max_size = 20, boundaries = c("sentence", "word"))
# Use a stringr pattern to find semantic boundaries
ragnar_chunk(text, max_size = 10, boundaries = stringr::fixed(". "))
ragnar_chunk(text, max_size = 10, boundaries = list(stringr::fixed(". "), "word"))
# Working with data frames
df <- data.frame(
id = 1:2,
text = c("First sentence. Second sentence.", "Another sentence here.")
)
ragnar_chunk(df, max_size = 20, boundaries = "sentence")
ragnar_chunk(df$text, max_size = 20, boundaries = "sentence")
# Chunking pre-segmented text
segments <- c("First segment. ", "Second segment. ", "Third segment. ", "Fourth segment. ")
ragnar_chunk_segments(segments, max_size = 20)
ragnar_chunk_segments(segments, max_size = 40)
ragnar_chunk_segments(segments, max_size = 60)