check_group_variation {performance}R Documentation

Check variables for within- and/or between-group variation

Description

Checks if variables vary within and/or between levels of grouping variables. This function can be used to infer the hierarchical Design of a given dataset, or detect any predictors that might cause heterogeneity bias (Bell and Jones, 2015). Use summary() on the output if you are mainly interested if and which predictors are possibly affected by heterogeneity bias.

Usage

check_group_variation(x, ...)

## Default S3 method:
check_group_variation(x, ...)

## S3 method for class 'data.frame'
check_group_variation(
  x,
  select = NULL,
  by = NULL,
  include_by = FALSE,
  numeric_as_factor = FALSE,
  tolerance_numeric = 1e-04,
  tolerance_factor = "crossed",
  ...
)

## S3 method for class 'check_group_variation'
summary(object, flatten = FALSE, ...)

Arguments

x

A data frame or a mixed model. See details and examples.

...

Arguments passed to other methods

select

Character vector (or formula) with names of variables to select that should be checked. If NULL, selects all variables (except those in by).

by

Character vector (or formula) with the name of the variable that indicates the group- or cluster-ID. For cross-classified or nested designs, by can also identify two or more variables as group- or cluster-IDs.

include_by

When there is more than one grouping variable, should they be check against each other?

numeric_as_factor

Should numeric variables be tested as factors?

tolerance_numeric

The minimal percent of variation (observed icc) that is tolerated to indicate no within- or no between-effect.

tolerance_factor

How should a non-numeric variable be identified as varying only "within" a grouping variable? Options are:

  • "crossed" - if all groups have all unique values of X.

  • "balanced" - if all groups have all unique values of X, with equal frequency.

object

result from check_group_variation()

flatten

Logical, if TRUE, the values are returned as character vector, not as list. Duplicated values are removed.

Details

This function attempt to identify the variability of a set of variables (select) with respect to one or more grouping variables (by). If x is a (mixed effect) model, the variability of the fixed effects predictors are checked with respect to the random grouping variables.

Generally, a variable is considered to vary between groups if is correlated with those groups, and to vary within groups if it not a constant within at least one group.

Numeric variables

Numeric variables are partitioned via datawizard::demean() to their within- and between-group components. Then, the variance for each of these two component is calculated. Variables with within-group variance larger than tolerance_numeric are labeled as within, variables with a between-group variance larger than tolerance_numeric are labeled as between, and variables with both variances larger than tolerance_numeric are labeled as both.

Setting numeric_as_factor = TRUE causes numeric variables to be tested using the following criteria.

Non-numeric variables

These variables can have one of the following three labels:

Additionally, the design of non-numeric variables is also checked to see if they are nested within the groups or is they are crossed. This is indicated by the Design column.

Heterogeneity bias

Variables that vary both within and between groups can cause a heterogeneity bias (Bell and Jones, 2015). It is recommended to center (person-mean centering) those variables to avoid this bias. See datawizard::demean() for further details. Use summary() to get a short text result that indicates if and which predictors are possibly affected by heterogeneity bias.

Value

A data frame with Group, Variable, Variation and Design columns.

References

See Also

For further details, read the vignette https://easystats.github.io/parameters/articles/demean.html and also see documentation for datawizard::demean().

Examples

data(npk)
check_group_variation(npk, by = "block")

data(iris)
check_group_variation(iris, by = "Species")

data(ChickWeight)
check_group_variation(ChickWeight, by = "Chick")

# A subset of mlmRev::egsingle
egsingle <- data.frame(
  schoolid = factor(rep(c("2020", "2820"), times = c(18, 6))),
  lowinc = rep(c(TRUE, FALSE), times = c(18, 6)),
  childid = factor(rep(
    c("288643371", "292020281", "292020361", "295341521"),
    each = 6
  )),
  female = rep(c(TRUE, FALSE), each = 12),
  year = rep(1:6, times = 4),
  math = c(
    -3.068, -1.13, -0.921, 0.463, 0.021, 2.035,
    -2.732, -2.097, -0.988, 0.227, 0.403, 1.623,
    -2.732, -1.898, -0.921, 0.587, 1.578, 2.3,
    -2.288, -2.162, -1.631, -1.555, -0.725, 0.097
  )
)

result <- check_group_variation(
  egsingle,
  by = c("schoolid", "childid"),
  include_by = TRUE
)
result

summary(result)



data(sleepstudy, package = "lme4")
check_group_variation(sleepstudy, select = "Days", by = "Subject")

# Or
mod <- lme4::lmer(Reaction ~ Days + (Days | Subject), data = sleepstudy)
result <- check_group_variation(mod)
result

summary(result)


[Package performance version 0.15.0 Index]