remove_duplicates {cleanepi}R Documentation

Remove duplicates

Description

When removing duplicates, users can specify a set columns to consider with the target_columns argument.

Usage

remove_duplicates(data, target_columns = NULL)

Arguments

data

The input <data.frame> or <linelist>.

target_columns

A <vector> of column names to use when looking for duplicates. When the input data is a linelist object, this parameter can be set to linelist_tags if you wish to look for duplicates on tagged columns only. Default is NULL.

Details

Caveat: In many epidemiological datasets, multiple rows may share the same value in one or more columns without being true duplicates. For example, several individuals might have the same symptom onset date and admission date. Be cautious when using this function—especially when applying it to a single target column—to avoid incorrect identification or removal of valid entries.

Value

The input data <data.frame> or <linelist> without the duplicated rows identified from all or the specified columns.

Examples

data <- readRDS(
  system.file("extdata", "test_linelist.RDS", package = "cleanepi")
)
no_dups <- remove_duplicates(
  data = data,
  target_columns = "linelist_tags"
)

# print the removed duplicates
print_report(no_dups, "removed_duplicates")

# print the detected duplicates
print_report(no_dups, "found_duplicates")

[Package cleanepi version 1.1.1 Index]