markdown_chunk {ragnar}R Documentation

Chunk a Markdown document

Description

markdown_chunk() splits a single Markdown string into shorter optionally overlapping chunks while nudging cut points to the nearest sensible boundary (heading, paragraph, sentence, line, word, or character). It returns a tibble recording the character ranges, headings context, and text for each chunk.

Usage

markdown_chunk(
  md,
  target_size = 1600L,
  target_overlap = 0.5,
  ...,
  max_snap_dist = target_size * (1 - target_overlap)/3,
  segment_by_heading_levels = integer(),
  context = TRUE,
  text = TRUE
)

Arguments

md

A MarkdownDocument, or a length-one character vector containing Markdown.

target_size

Integer. Target chunk size in characters. Default: 1600 (\approx 400 tokens, or 1 page of text). Actual chunk size may differ from the target by up to 2 * max_snap_dist. When set to NULL, NA or Inf and used with segment_by_heading_levels, chunk size is unbounded and each chunk corresponds to a segment.

target_overlap

Numeric in ⁠[0, 1)⁠. Fraction of desired overlap between successive chunks. Default: 0.5. Even when 0, some overlap can occur because the last chunk is anchored to the document end.

...

These dots are for future extensions and must be empty.

max_snap_dist

Integer. Furthest distance (in characters) a cut point may move to reach a semantic boundary. Defaults to one third of the stride size between target chunk starts. Chunks that end up on identical boundaries are merged.

segment_by_heading_levels

Integer vector with possible values 1:6. Headings at these levels are treated as segment boundaries; chunking is performed independently for each segment. No chunk will overlap a segment boundary, and any future deoverlapping will not combine segments. Each segment will have a chunk that starts at the segment start and a chunk that ends at the segment end (these may be the same chunk or overlap substantially if the segment is short). Default: disabled.

context

Logical. Add a context column containing the Markdown headings in scope at each chunk start. Default: TRUE.

text

Logical. If TRUE, include a text column with the chunk contents. Default: TRUE.

Value

A MarkdownDocumentChunks object, which is a tibble (data.frame) with with columns start end, and optionally context and text. It also has a ⁠@document⁠ property, which is the input md document (potentially normalized and converted to a MarkdownDocument).

See Also

ragnar_chunks_view() to interactively inspect the output of markdown_chunk(). See also MarkdownDocumentChunks() and MarkdownDocument(), where the input and return value of markdown_chunk() are described more fully.

Examples

md <- "
# Title

## Section 1

Some text that is long enough to be chunked.

A second paragraph to make the text even longer.

## Section 2

More text here.

### Section 2.1

Some text under a level three heading.

#### Section 2.1.1

Some text under a level four heading.

## Section 3

Even more text here.
"

markdown_chunk(md, target_size = 40)
markdown_chunk(md, target_size = 40, target_overlap = 0)
markdown_chunk(md, target_size = NA, segment_by_heading_levels = c(1, 2))
markdown_chunk(md, target_size = 40, max_snap_dist = 100)

[Package ragnar version 0.2.0 Index]