markdown_chunk {ragnar} | R Documentation |
Chunk a Markdown document
Description
markdown_chunk()
splits a single Markdown string into shorter optionally
overlapping chunks while nudging cut points to the nearest sensible boundary
(heading, paragraph, sentence, line, word, or character). It returns a tibble
recording the character ranges, headings context, and text for each chunk.
Usage
markdown_chunk(
md,
target_size = 1600L,
target_overlap = 0.5,
...,
max_snap_dist = target_size * (1 - target_overlap)/3,
segment_by_heading_levels = integer(),
context = TRUE,
text = TRUE
)
Arguments
md |
A |
target_size |
Integer. Target chunk size in characters. Default: 1600
( |
target_overlap |
Numeric in |
... |
These dots are for future extensions and must be empty. |
max_snap_dist |
Integer. Furthest distance (in characters) a cut point may move to reach a semantic boundary. Defaults to one third of the stride size between target chunk starts. Chunks that end up on identical boundaries are merged. |
segment_by_heading_levels |
Integer vector with possible values |
context |
Logical. Add a |
text |
Logical. If |
Value
A MarkdownDocumentChunks
object, which is a tibble (data.frame) with with
columns start
end
, and optionally context
and text
. It also has a
@document
property, which is the input md
document (potentially
normalized and converted to a MarkdownDocument
).
See Also
ragnar_chunks_view()
to interactively inspect the output of
markdown_chunk()
. See also MarkdownDocumentChunks()
and
MarkdownDocument()
, where the input and return value of
markdown_chunk()
are described more fully.
Examples
md <- "
# Title
## Section 1
Some text that is long enough to be chunked.
A second paragraph to make the text even longer.
## Section 2
More text here.
### Section 2.1
Some text under a level three heading.
#### Section 2.1.1
Some text under a level four heading.
## Section 3
Even more text here.
"
markdown_chunk(md, target_size = 40)
markdown_chunk(md, target_size = 40, target_overlap = 0)
markdown_chunk(md, target_size = NA, segment_by_heading_levels = c(1, 2))
markdown_chunk(md, target_size = 40, max_snap_dist = 100)