standardize_dates {cleanepi} | R Documentation |
Standardize date variables
Description
When the format of the values in a column and/or the target columns are not
defined, we strongly recommend checking a few converted dates manually to
make sure that the dates extracted from a character
vector or a factor
are correct.
Usage
standardize_dates(
data,
target_columns = NULL,
format = NULL,
timeframe = NULL,
error_tolerance = 0.4,
orders = list(world_named_months = c("Ybd", "dby"), world_digit_months = c("dmy",
"Ymd"), US_formats = c("Omdy", "YOmd"))
)
Arguments
data |
The input |
target_columns |
A |
format |
A |
timeframe |
A |
error_tolerance |
A |
orders |
A list( quarter_partial_dates = c("Y", "Ym", "Yq"), world_digit_months = c("Yq", "ymd", "ydm", "dmy", "mdy", "myd", "dym", "Ymd", "Ydm", "dmY", "mdY", "mYd", "dYm"), world_named_months = c("dby", "dyb", "bdy", "byd", "ybd", "ydb", "dbY", "dYb", "bdY", "bYd", "Ybd", "Ydb"), us_format = c("Omdy", "YOmd") ) |
Details
Check for the presence of date values that could have multiple formats
from the $multi_format_dates
element of the report
.
Converting ambiguous character strings to dates is difficult for many reasons:
dates may not use the standard Ymd format
within the same variable, dates may follow different formats
dates may be mixed with things that are not dates
the behavior of
as.Date
in the presence of non-date is hard to predict, sometimes returningNA
, sometimes issuing an error.
This function tries to address all the above issues. Dates with the following format should be automatically detected, irrespective of separators (e.g. "-", " ", "/") and surrounding text:
"19 09 2018"
"2018 09 19"
"19 Sep 2018"
"2018 Sep 19"
"Sep 19 2018"
How it works
This function relies heavily on lubridate::parse_date_time()
, which is an
extremely flexible date parser that works well for consistent date formats,
but can quickly become unwieldy and may produce spurious results.
standardize_dates()
will use a list of formats in the orders
argument to
run parse_date_time()
with each format vector separately and take the first
correctly parsed date from all the trials.
With the default orders shown above, the dates 03 Jan 2018, 07/03/1982, and
08/20/85 are correctly interpreted as 2018-01-03, 1982-03-07, and 1985-08-20.
The examples section will show how you can manipulate the orders
to be
customized for your situation.
Value
The input dataset where the date columns have been standardized. The date values that are out of the specified timeframe will be reported in the report. Similarly, date values that comply with multiple formats will also be featured in the report object.
Examples
x <- c("03 Jan 2018", "07/03/1982", "08/20/85")
# The below will coerce values where the month is written in letters only
# into Date.
as.Date(lubridate::parse_date_time(x, orders = c("Ybd", "dby")))
# coerce values where the month is written in letters or numbers into Date.
as.Date(lubridate::parse_date_time(x, orders = c("dmy", "Ymd")))
# How to use standardize_dates()
data <- readRDS(system.file("extdata", "test_df.RDS", package = "cleanepi"))
# convert values in the 'date.of.admission' column into "%Y-%m-%d"
# format
dat <- standardize_dates(
data = data,
target_columns = "date.of.admission",
format = NULL,
timeframe = as.Date(c("2021-01-01", "2021-12-01")),
error_tolerance = 0.4,
orders = list(
world_named_months = c("Ybd", "dby"),
world_digit_months = c("dmy", "Ymd"),
US_format = c("Omdy", "YOmd")
)
)
# print the report
print_report(dat, "date_standardization")