read_as_markdown {ragnar} | R Documentation |
Convert files to Markdown
Description
Convert files to Markdown
Usage
read_as_markdown(
path,
...,
html_extract_selectors = c("main"),
html_zap_selectors = c("nav")
)
Arguments
path |
[string] A filepath or URL. Accepts a wide variety of file types, including PDF, PowerPoint, Word, Excel, images (EXIF metadata and OCR), audio (EXIF metadata and speech transcription), HTML, text-based formats (CSV, JSON, XML), ZIP files (iterates over contents), YouTube URLs, and EPUBs. |
... |
Passed on to |
html_extract_selectors |
Character vector of CSS selectors. If a match for a selector is found in the document, only the matched node's contents are converted. Unmatched extract selectors have no effect. |
html_zap_selectors |
Character vector of CSS selectors. Elements
matching these selectors will be excluded ("zapped") from the HTML document
before conversion to markdown. This is useful for removing navigation bars,
sidebars, headers, footers, or other unwanted elements. By default,
navigation elements ( |
Details
Converting HTML
When converting HTML, you might want to omit certain elements, like sidebars, headers, footers, etc. You can pass CSS selector strings to either extract nodes or exclude nodes during conversion.
The easiest way to make selectors is to use SelectorGadget: https://rvest.tidyverse.org/articles/selectorgadget.html
You can also right-click on a page and select "Inspect Element" in a browser to better understand an HTML page's structure.
For comprehensive or advanced usage of CSS selectors, consult https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors-through-the-css-property and https://facelessuser.github.io/soupsieve/selectors/
Value
A MarkdownDocument
object, which is a single string of Markdown
with an @origin
property.
Examples
## Not run:
# Convert HTML
md <- read_as_markdown("https://r4ds.hadley.nz/base-R.html")
md
cat_head <- \(md, n = 10) writeLines(head(strsplit(md, "\n")[[1L]], n))
cat_head(md)
## Using selector strings
# By default, this output includes the sidebar and other navigational elements
url <- "https://duckdb.org/code_of_conduct"
read_as_markdown(url) |> cat_head(15)
# To extract just the main content, use a selector
read_as_markdown(url, html_extract_selectors = "#main_content_wrap") |>
cat_head()
# Alternative approach: zap unwanted nodes
read_as_markdown(
url,
html_zap_selectors = c(
"header", # name
".sidenavigation", # class
".searchoverlay", # class
"#sidebar" # ID
)
) |> cat_head()
# Quarto example
read_as_markdown(
"https://quarto.org/docs/computations/python.html",
html_extract_selectors = "main",
html_zap_selectors = c(
"#quarto-sidebar",
"#quarto-margin-sidebar",
"header",
"footer",
"nav"
)
) |> cat_head()
## Convert PDF
pdf <- file.path(R.home("doc"), "NEWS.pdf")
read_as_markdown(pdf) |> cat_head(15)
## Alternative:
# pdftools::pdf_text(pdf) |> cat_head()
# Convert images to markdown descriptions using OpenAI
jpg <- file.path(R.home("doc"), "html", "logo.jpg")
if (Sys.getenv("OPENAI_API_KEY") != "") {
# if (xfun::is_macos()) system("brew install ffmpeg")
reticulate::py_require("openai")
llm_client <- reticulate::import("openai")$OpenAI()
read_as_markdown(jpg, llm_client = llm_client, llm_model = "gpt-4.1-mini") |>
writeLines()
# # Description:
# The image displays the logo of the R programming language. It features a
# large, stylized capital letter "R" in blue, positioned prominently in the
# center. Surrounding the "R" is a gray oval shape that is open on the right
# side, creating a dynamic and modern appearance. The R logo is commonly
# associated with statistical computing, data analysis, and graphical
# representation in various scientific and professional fields.
}
# Alternative approach to image conversion:
if (
Sys.getenv("OPENAI_API_KEY") != "" &&
rlang::is_installed("ellmer") &&
rlang::is_installed("magick")
) {
chat <- ellmer::chat_openai(echo = TRUE)
chat$chat("Describe this image", ellmer::content_image_file(jpg))
}
## End(Not run)