rowRmMadOutliers {jamba} | R Documentation |
Remove outlier points per row by MAD factor threshold
Description
Remove outlier points per row by MAD factor threshold
Usage
rowRmMadOutliers(
x,
madFactor = 5,
na.rm = TRUE,
minDiff = 0,
minReps = 3,
includeAttributes = FALSE,
rowMadValues = NULL,
verbose = FALSE,
...
)
Arguments
x |
numeric matrix |
madFactor |
|
na.rm |
|
minDiff |
|
minReps |
|
includeAttributes |
|
rowMadValues |
|
verbose |
|
... |
additional parameters are ignored. |
Details
This function applies outlier detection and removal per row of the input numeric matrix.
It first calculates MAD per row.
The MAD threshold cutoff is a multiple of the MAD value, defined by
madFactor
, multiplying the per-row MAD by themadFactor
.The absolute difference from median is calculated for each point.
Outlier points are defined:
Points with MAD above the MAD threshold, and
Points with difference from median at or above
minDiff
The minDiff
parameter affects cases such as 3 replicates,
where all replicates are well within a known threshold
indicating low variance, but where two replicates might
be nearly identical. Consider:
Three numeric values:
c(10.0001, 10.0002, 10.001)
.The third value differs from median by only 0.0008.
The third value
10.001
is 5x MAD factor away from median.-
minDiff = 0.01
would require the minimum difference from median to be at least 0.01 to be eligible to be an outlier point.
One option to define minDiff
from the data is to use:
minDiff <- stats::median(rowMads(x))
In this case, the threshold is defined by the median difference from median across all rows. This type of threshold will only be reasonable if the variance across all rows is expected to be fairly similar.
This function is substantially faster when the
matrixStats
package is installed, but will use the
apply(x, 1, mad)
format as a last option.
Assumptions
This function assumes the input data is appropriate for the use of MAD as a summary statistic.
Specifically, numeric values per row are expected to be roughly normally distributed.
Outlier points are assumed to be present in less than half overall non-NA data.
Outlier points are assumed to be technical outliers, and therefore not the direct result of the experimental measurements being studied. Technical outliers are often caused by some instrument measurement, methodological failure, or other upstream protocol failure.
The default threshold of 5x MAD factor is a fairly lenient criteria, above which the data may even be assumed not to conform to most downstream statistical techniques.
For measurements considered to be more robust, or required to be more robust, the threshold 2x MAD is applied. This criteria is usually a reasonable expectation of housekeeper gene expression across replicates within each sample group.
Value
numeric
matrix with the same dimensions
as the input x
matrix. Outliers are replaced with NA
.
If includeAttributes=TRUE
then attributes will be
included:
-
outlierDF
which is adata.frame
with colnamesrowMedians:
numeric
median on each rowrowMadValues:
numeric
MAD for each rowrowThresholds:
numeric
threshold after applyingmadFactor
andminDiff
rowReps:
integer
number of non-NA values in the input datarowTypes:
factor
indicating the type of threshold:"madFactor"
means the row applied the normalMAD * madFactor
threshold;"minDiff"
means the row applied theminDiff
threshold which was the larger threshold.
-
minDiff
with thenumeric
value supplied -
madFactor
with thenumeric
MAD factor threshold supplied -
outliersRemoved
with theinteger
total number of new NA values produced by the outlier removal process.
See Also
Other jam numeric functions:
deg2rad()
,
noiseFloor()
,
normScale()
,
rad2deg()
,
rowGroupMeans()
,
warpAroundZero()
Examples
set.seed(123);
x <- matrix(ncol=5, stats::rnorm(25))*5 + 10;
## Define some outlier points
x[1:2,3] <- x[1:2,3]*5 + 50;
x[2:3,2] <- x[2:3,2]*5 - 100;
rownames(x) <- head(letters, nrow(x));
rowRmMadOutliers(x, madFactor=5);
x2 <- rowRmMadOutliers(x, madFactor=2,
includeAttributes=TRUE);
x2
x3 <- rowRmMadOutliers(x2,
madFactor=2,
rowMadValues=attr(x2, "outlierDF")$rowMadValues,
includeAttributes=TRUE);
x3