filterHC {doblin}R Documentation

Filter Hierarchical Clusters Based on Size and Dominance

Description

This function filters the results of hierarchical clustering by retaining only clusters that contain at least n_members unique lineages. To avoid excluding potentially dominant but small clusters, the user may also provide a minimum average frequency threshold to retain small clusters that include a dominant member.

Usage

filterHC(
  series_filtered,
  clusters,
  n_members,
  min_freq_ignored_clusters = NULL
)

Arguments

series_filtered

A data frame preprocessed using filterData(), containing lineage frequencies and metadata.

clusters

A data frame containing hierarchical clustering assignments (e.g., from cutree()), possibly across multiple thresholds.

n_members

An integer specifying the minimum number of members (lineages) required for a cluster to be retained.

min_freq_ignored_clusters

Optional. A numeric value specifying the minimum average frequency required to retain small clusters (i.e., those with fewer than n_members). If NULL, small clusters are not rescued.

Value

A data frame containing the filtered clusters, including both large clusters and optionally small clusters with at least one dominant member (based on the min_freq_ignored_clusters threshold).

Examples

# Load demo barcode count data (installed with the package)
demo_file <- system.file("extdata", "demo_input.csv", package = "doblin")
input_dataframe <- readr::read_csv(demo_file, show_col_types = FALSE)

# Filter data to retain dominant and persistent barcodes
filtered_df <- filterData(
  input_df = input_dataframe,
  freq_threshold = 0.00005,
  time_threshold = 5,
  output_directory = tempdir(),
  input_name = "demo"
)

# Perform hierarchical clustering using Pearson correlation
cluster_assignments <- performHClustering(
  filtered_data = filtered_df,
  agglomeration_method = "average",
  similarity_metric = "pearson",
  output_directory = tempdir(),
  input_name = "demo",
  missing_values = "pairwise.complete.obs",
  dtw_norm = NULL
)

# Filter clusters: keep only clusters with at least 8 members.
filtered_clusters <- filterHC(
  series_filtered = filtered_df,
  clusters = cluster_assignments,
  n_members = 8,
  min_freq_ignored_clusters = 0.0001
)

[Package doblin version 0.1.1 Index]