ragnar_chunk {ragnar}R Documentation

Chunk text

Description

[Deprecated]

These functions are deprecated in favor of markdown_chunk(), which is more flexible, supports overlapping chunks, enables deoverlapping or rechunking downstream by ragnar_retrieve(), and automatically builds a context string of in-scope markdown headings for each chunk instead of requiring manual string interpolation from extracted headings.

Usage

ragnar_chunk(
  x,
  max_size = 1600L,
  boundaries = c("paragraph", "sentence", "line_break", "word", "character"),
  ...,
  trim = TRUE,
  simplify = TRUE
)

ragnar_segment(x, boundaries = "sentence", ..., trim = FALSE, simplify = TRUE)

ragnar_chunk_segments(x, max_size = 1600L, ..., simplify = TRUE, trim = TRUE)

Arguments

x

A character vector, list of character vectors, or data frame containing a text column.

max_size

Integer. The maximum number of characters in each chunk. Defaults to 1600, which typically is approximately 400 tokens, or 1 page of text.

boundaries

A sequence of boundary types to use in order until max_size is satisfied. Valid values are "sentence", "word", "line_break", "character", "paragraph", or a stringr_pattern object like stringr::fixed().

...

Additional arguments passed to internal functions. tokenizer to use tokens instead of characters as the count (not fully implemented yet)

trim

logical, whether to trim leading and trailing whitespace from strings. Default TRUE.

simplify

Logical. If TRUE, the output is simplified. If FALSE, returns a vector that has the same length as x. If TRUE, character strings are unlist()ed, and dataframes are tidyr::unchop()ed.

Details

Functions for chunking text into smaller pieces while preserving meaningful semantics. These functions provide flexible ways to split text based on various boundaries (sentences, words, etc.) while controlling chunk sizes and overlap.

Chunking is the combination of two fundamental operations:

ragnar_chunk() is a higher-level function that does both, identifies boundaries and extracts slices.

If you need lower-level control, you can alternatively use the lower-level functions ragnar_segment() in combination with ragnar_chunk_segments().

ragnar_segment(): Splits text at semantic boundaries.

ragnar_chunk_segments(): Combines text segments into chunks.

For most usecases, these two are equivalent:

x |> ragnar_chunk()
x |> ragnar_segment() |> ragnar_chunk_segments()

When working with data frames, these functions preserve all columns and use tidyr::unchop() to handle the resulting list-columns when simplify = TRUE.

Value

Examples

# Basic chunking with max size
text <- "This is a long piece of text. It has multiple sentences.
         We want to split it into chunks. Here's another sentence."
ragnar_chunk(text, max_size = 40) # splits at sentences

# smaller chunk size: first splits at sentence boundaries, then word boundaries
ragnar_chunk(text, max_size = 20)

# only split at sentence boundaries. Note, some chunks are oversized
ragnar_chunk(text, max_size = 20, boundaries = c("sentence"))

# only consider word boundaries when splitting:
ragnar_chunk(text, max_size = 20, boundaries = c("word"))

# first split at sentence boundaries, then word boundaries,
# as needed to satisfy `max_chunk`
ragnar_chunk(text, max_size = 20, boundaries = c("sentence", "word"))

# Use a stringr pattern to find semantic boundaries
ragnar_chunk(text, max_size = 10, boundaries = stringr::fixed(". "))
ragnar_chunk(text, max_size = 10, boundaries = list(stringr::fixed(". "), "word"))


# Working with data frames
df <- data.frame(
  id = 1:2,
  text = c("First sentence. Second sentence.", "Another sentence here.")
)
ragnar_chunk(df, max_size = 20, boundaries = "sentence")
ragnar_chunk(df$text, max_size = 20, boundaries = "sentence")

# Chunking pre-segmented text
segments <- c("First segment. ", "Second segment. ", "Third segment. ", "Fourth segment. ")
ragnar_chunk_segments(segments, max_size = 20)
ragnar_chunk_segments(segments, max_size = 40)
ragnar_chunk_segments(segments, max_size = 60)


[Package ragnar version 0.2.0 Index]