detect_damage {DamageDetective} | R Documentation |
detect_damage
Description
Quality control function to identify and filter damaged cells from an input count matrix, where 'damage' is defined by the loss of cytoplasmic RNA.
Usage
detect_damage(
count_matrix,
ribosome_penalty = 0.01,
organism = "Hsap",
annotated_celltypes = FALSE,
target_damage = c(0.1, 0.8),
damage_distribution = "right_skewed",
distribution_steepness = "moderate",
beta_shape_parameters = NULL,
damage_levels = 5,
damage_proportion = 0.15,
seed = 7,
mito_quantile = 0.75,
kN = NULL,
generate_plot = TRUE,
display_plot = TRUE,
palette = c("grey", "#7023FD", "#E60006"),
filter_threshold = 0.7,
filter_counts = FALSE,
verbose = TRUE
)
Arguments
count_matrix |
Matrix or dgCMatrix containing the counts from single cell RNA sequencing data. |
ribosome_penalty |
Numeric specifying the factor by which the probability of loosing a transcript from a ribosomal gene is multiplied by. Here, values closer to 0 represent a greater penalty.
|
organism |
String specifying the organism of origin of the input data where there are two standard options,
If a user wishes to use a non-standard organism they must input a list containing strings for the patterns to match mitochondrial and ribosomal genes of the organism. If available, nuclear-encoded genes that are likely retained in the nucleus, such as in nuclear speckles, must also be specified. An example for humans is below,
|
annotated_celltypes |
Boolean specifying whether input matrix has cell type information stored.
|
target_damage |
Numeric vector specifying the upper and lower range of the level of damage that will be introduced. Here, damage refers to the amount of cytoplasmic RNA lost by a cell where values closer to 1 indicate more loss and therefore more heavily damaged cells.
|
damage_distribution |
String specifying whether the distribution of damage levels among the damaged cells should be shifted towards the upper or lower range of damage specified in 'target_damage' or follow a symmetric distribution between them. There are three valid options:
|
distribution_steepness |
String specifying how concentrated the spread of damaged cells are about the mean of the target distribution specified in 'target_damage'. Here, an increase in steepness manifests in a more apparent skewness.There are three valid options:
|
beta_shape_parameters |
Numeric vector that allows for the shape parameters of the beta distribution to defined explicitly. This offers greater flexibility than allowed by the 'damage_distribution' and 'distribution_steepness' parameters and will override the defaults they offer.
|
damage_levels |
Numeric specifying the number of distinct sets of artificial damaged cells simulated, each with a defined range of loss. Default ptions include,
A user can also provide a list specifying sets with their own ranges of loss,
By introducing more sets of damage a user can improve the accuracy of loss estimations (scaled_pANN) as they are found through scaling the pANN within each set according to the lower and upper boundary of the set's damage level. However, introducing more sets increases the computational time for the function.
|
damage_proportion |
Numeric describing what proportion of the input data should be altered to resemble damaged data.
|
seed |
Numeric specifying the random seed to ensure reproducibility of the function's output. Setting a seed ensures that the random sampling and perturbation processes produce the same results when the function is run multiple times with the same input data and parameters.
|
mito_quantile |
Numeric between 0 and 1 specifying below what level of mitochondrial proportion cells are sampled for simulations. This step is done to protect against simulating damaged cell profiles from cells that are likely damaged.
|
kN |
Numeric describing how many nearest neighbours are considered for pANN calculations. kN cannot exceed the total cell number.
|
generate_plot |
Boolean specifying whether the QC plot should be outputted. QC plots will be generated by default as we recommend verifying the perturbed data retains characteristics of true single cell data.
|
display_plot |
Boolean specifying whether the output QC plot should be displayed in the global environment. Naturally, this is only relevant when generate_plot is TRUE.
|
palette |
String specifying the three colours that will be used to create the continuous colour palette for colouring the 'damage_column'.
|
filter_threshold |
Numeric specifying the proportion of RNA loss above which a cell should be considered damaged.
|
filter_counts |
Boolean specifying whether the output matrix should be filtered, returned containing only cells that fall below the filter threshold. Alternatively, a data frame containing cell barcodes and their associated label as either 'damaged' or 'cell' is returned.
|
verbose |
Boolean specifying whether messages and function progress should be displayed in the console.
|
Details
Using the simulation framework of simulate_counts()
, detect_damage()
generates artificially damaged cell profiles by introducing defined levels
of RNA loss into the input data. True and artificial cells are then
merged and pre-processed to compute the following quality control metrics:
Log-normalized feature count
Log-normalized total counts
Mitochondrial proportion
Ribosomal proportion
Log-normalized MALAT1 gene expression
Principal component analysis (PCA) is performed on these metrics,
and a Euclidean distance matrix is constructed from the PC embeddings.
For each true cell, the proportion of nearest neighbours that are
artificial cells (pANN) is calculated across all damage levels and the
damage level with the highest pANN is assigned to the true cell.
Finally, cells exceeding a specified damage threshold, filter_threshold
,
are marked as damaged.
This filtering method is inspired by approaches developed for DoubletFinder (McGinnis et al., 2019) to detect doublets in single-cell data.
Value
Filtered matrix or data frame containing damage labels.
References
McGinnis, C. S., Murrow, L. M., & Gartner, Z. J. (2019). DoubletFinder: Doublet Detection in Single-Cell RNA Sequencing Data Using Artificial Nearest neighbours. Cell Systems, 8(4), 329-337.e4. doi:10.1016/j.cels.2019.03.003
Examples
data("test_counts", package = "DamageDetective")
test <- detect_damage(
count_matrix = test_counts,
ribosome_penalty = 0.001,
damage_levels = 3,
damage_proportion = 0.1,
generate_plot = FALSE,
seed = 7
)