distfreereg {distfreereg}R Documentation

Distribution-free parametric regression testing

Description

Conduct distribution-free parametric regression testing using the process introduced in Khmaladze (2021). A parametric model for the conditional mean (specified by test_mean) is checked against the data by fitting the model, transforming the resulting residuals, and then calculating a statistic on the empirical partial sum process of the transformed residuals. The statistic's null distribution can be simulated in a straight-forward way, thereby producing a p-value.

Using f to denote the mean function being tested, the specific test has the following null and alternative hypotheses:

H_0\colon\ \exists\theta\in\Theta\subseteq\mathbb R^p \mathrel{\bigl|} \textrm{E}(Y| X)=f(X;\theta) \quad\hbox{against}\quad H_1\colon\ \forall\theta\in\Theta\subseteq\mathbb R^p \mathrel{\bigl|} \textrm{E}(Y| X)\neq f(X;\theta).

This assumes a known or adequately accurately estimated covariance matrix. See the An Introduction to the distfreereg Package vignette for an introduction.

Usage

	distfreereg(test_mean, ordering = "simplex", group = TRUE,
	stat = c("KS", "CvM"), B = 1e4, control = NULL, override = NULL, verbose = TRUE,
	...)

	## Default S3 method:
distfreereg(test_mean = NULL, ordering = "simplex", group = TRUE,
	stat = c("KS", "CvM"), B = 1e4, control = NULL, override = NULL, verbose = TRUE,
	..., Y, X = NULL, covariance, J, fitted_values)

	## S3 method for class 'formula'
distfreereg(test_mean, ordering = "simplex", group = TRUE,
	stat = c("KS", "CvM"), B = 1e4, control = NULL, override = NULL, verbose = TRUE,
	..., data, method, method_args = NULL)

	## S3 method for class 'function'
distfreereg(test_mean, ordering = "simplex", group = TRUE,
	stat = c("KS", "CvM"), B = 1e4, control = NULL, override = NULL, verbose = TRUE,
	..., Y, X = NULL, covariance, theta_init)

	## S3 method for class 'glm'
distfreereg(test_mean, ordering = "simplex", group = TRUE,
	stat = c("KS", "CvM"), B = 1e4, control = NULL, override = NULL, verbose = TRUE,
	...)

	## S3 method for class 'lm'
distfreereg(test_mean, ordering = "simplex", group = TRUE,
	stat = c("KS", "CvM"), B = 1e4, control = NULL, override = NULL, verbose = TRUE,
	...)

	## S3 method for class 'lmerMod'
distfreereg(test_mean, ordering = "simplex", group = TRUE,
	stat = c("KS", "CvM"), B = 1e4, control = NULL, override = NULL, verbose = TRUE,
	...)

	## S3 method for class 'nls'
distfreereg(test_mean, ordering = "simplex", group = TRUE,
	stat = c("KS", "CvM"), B = 1e4, control = NULL, override = NULL, verbose = TRUE,
	...)

Arguments

test_mean

A specification of the mean function to be tested. Methods exist for objects of classes function, formula, glm, lm, lmerMod (from the lme4 package), and nls. See details.

covariance

Named list; specifies the covariance structure of the model's error distribution. Valid element names are "Sigma", "SqrtSigma", "P", and "Q", corresponding to the covariance matrix, the square root of the covariance matrix, the precision matrix, and the square root of the precision matrix, respectively. Each element must be one of the following:

  • a numeric matrix.

  • a list of numeric matrices.

  • a numeric vector whose length is the sample size.

  • a numeric vector of length 1.

See details.

ordering

A character string or a list; specifies how to order the residuals to form the empirical partial sum process. Valid character strings are:

  • "asis": leaves the order unchanged (that is, in the order in which the observations appear in the supplied data).

  • "natural": orders residuals using column-wise ordering of the covariates.

  • "optimal": orders the residuals by ordering the observations using optimal transport on the covariates. The solution is estimated using the Hungarian method as implemented by solve_LSAP. This option can be very slow for large sets of covariates.

  • "simplex": (the default) orders residuals in order of increasing row sums of the covariates after each column has been scaled to the interval [0,1].

If ordering is a list, then its elements specify names columns of X or data to use to determine the order.

group

Logical; if TRUE, then residuals with the same rank (as determined by ordering) are grouped when calculating the empirical partial sum process.

J

Numeric matrix; specifies the Jacobian of the function evaluated at the covariates and the estimated parameters.

fitted_values

Numeric vector; specifies the model's fitted values.

stat

Character vector; specifies the names of the functions used to calculate the desired statistics. By default, a Kolmogorov–Smirnov statistic and a Cramer–von Mises-like statistic are calculated:

\hbox{KS} = \max_{i}|a_i|\qquad\hbox{and}\qquad\hbox{CvM} = {1\over n}\sum_{i=1}^na_i^2

where a_i is term i in the empirical partial sum process and n is the sample size.

B

Numeric vector of length one; specifies the Monte Carlo sample size used when simulating statistics. Silently converted to integer.

control

Optional named list of elements that control the details of the algorithm's computations. The following elements are accepted for all methods:

  • symmetric: Optional named list or FALSE; if a named list, its elements are passed as arguments to isSymmetric when testing elements of covariance for symmetry. If FALSE, then this test is skipped.

  • matsqrt_tol: Numeric; specifies the threshold for considering an eigenvalue "too negative" when calculating the square root of a matrix. Must be non-positive. The default value is -.Machine$double.eps^0.25.

  • solve_tol: Numeric; passed as tol argument to solve, used in particular to invert Sigma. The default value is .Machine$double.eps.

  • qr_tol: Numeric; passed as tol argument to qr. The default value is sqrt(.Machine$double.eps). This value might need to be decreased when the dimensions of the Jacobian are sufficiently large.

  • orth_tol: Numeric; passed as tolerance argument to all.equal when testing whether or not r^Tr and \mu^T\mu are the identity matrix. The default value is sqrt(.Machine$double.eps).

  • trans_tol: Numeric; passed as tolerance argument to all.equal to internal transformation function when determining whether the normalizing scalar is non-zero. The default value is sqrt(.Machine$double.eps).

  • sym_tol, sym_tol1: Numeric; passed as tol and tol1 arguments, respectively, to isSymmetric for checking Sigma and P elements of covariance. The default values are sqrt(.Machine$double.eps) and 8*sqrt(.Machine$double.eps), respectively. See Warnings.

  • return_on_error: Logical; determines whether or not partial output is returned when an error is encountered. Default value is TRUE.

The following named elements, all but the first of which control the process of calculating the generalized least squares estimation of the parameter vector, are accepted for the function method:

  • jacobian_args: Optional list; specifies arguments to pass to jacobian.

  • optimization_fun: Optional function; specifies the function used to estimate the parameters. If not specified, optim is used with method = "BFGS".

  • fun_to_optimize_arg: Optional character string, required when optimization_fun is specified; specifies the name of the argument of optimization_fun that is assigned the function to optimize. For example, optim uses "fn".

  • theta_init_arg: Optional character string, required when optimization_fun is specified; specifies the name of the argument of optimization_fun that is assigned the initial parameter values for optimization. For example, optim uses "par".

  • theta_hat_name: Optional character string, required when optimization_fun is specified; specifies the name of the element of the output of optimization_fun that contains the estimated parameters. For example, optim uses "par" (the same character string that specifies the argument containing the initial values). See Warnings.

  • optimization_args: Optional list; specifies additional arguments to pass to optimization_fun.

override

Optional named list of arguments that override internally calculated values. Used primarily by update.distfreereg, but can be accessed directly. The following named elements are accepted:

  • J: Numeric matrix. Overrides the calculation of J by non-default methods.

  • fitted_values: Numeric vector of fitted values.

  • res_order: Integer vector specifying the ordering used to order the residuals for the computation of the empirical partial sum process. Overrides ordering. See details.

  • theta_hat: Numeric vector. Used as the estimated parameter vector when test_mean has class function, overriding internal computation that would have used optimization_fun.

  • r: Orthogonal (numeric) matrix. Overrides the construction of the transformation anchor matrix.

  • mcsim_stats: List. Overrides the creation of the list of simulated statistics.

verbose

Logical; if TRUE, progress messages are printed. Default value is TRUE.

...

Additional arguments to pass to various methods; should be empty except in a call to the generic function.

data

Data frame of covariate values used when test_mean is a formula.

method

Character vector; specifies the function to use for fitting the model when test_mean is a formula. Possible values are "glm", "lm", "lmer", and "nls".

method_args

Optional list of argument values to be passed to the function specified by the method argument.

Y

Numeric vector of observations. A non-vector value is converted to a vector.

X

Optional numeric matrix of covariates. A non-matrix value is converted to a single-column matrix.

theta_init

Numeric vector; specifies the starting parameter values passed to the optimizing function to be used to estimate the parameter vector.

Details

This function implements distribution-free parametric regression testing. The model is specified by a mean structure and a covariance structure.

The mean structure is specified by the argument test_mean. This can be an object of class function, formula, glm, lm, lmerMod, or nls. It can also be NULL.

If test_mean is a function, then it must have one or two arguments: either theta only, or theta and either X (uppercase) or x (lowercase). An uppercase X is interpreted in the function definition as a matrix, while a lowercase x is interpreted as a vector. (See examples and this vignette.) The primary reason to use a lowercase x is to allow for a function definition using an R function that is not vectorized. In general, an uppercase X should be preferred for speed.

If test_mean is a formula, then it must be a formula that can be passed to glm, lm, lmer, or nls, and the data argument must be specified. The appropriate model is created and is then sent to distfreereg() for method dispatch.

The function method estimates parameter values, and then uses those to evaluate the Jacobian of the mean function and to calculate fitted values.

All of these methods create a Jacobian matrix and a vector of fitted values. These, along with the covariance structure, are sent the default method. The default method also allows the user to implement the algorithm even when the mean structure is not specified in R, as long as the Jacobian, fitted values, and covariance structure can be imported. (This is useful if a particularly complicated function is defined in another language and cannot easily be copied into R.)

The covariance structure for Y|X must be specified using the covariance argument for the function and default methods. For other methods, the covariance is estimated automatically.

Any element of covariance can be a numeric matrix, a list of numeric matrices, or a numeric vector. If it is a vector, its length must be either 1 or the sample size. This option is mathematically equivalent to setting a covariance list element to a diagonal matrix with the specified value(s) along the diagonal. Using vectors, when possible, is more efficient than using the corresponding matrix. If an element of covariance is a list of numeric matrices, then these matrices are interpreted as the blocks of a block diagonal matrix.

Internally, distfreereg() only needs Q, so some efficiency can be gained by supplying that directly when available. When Q is not specified, it is calculated using whichever element is specified. When more than one of the other elements are specified, Q is calculated using the least expensive path, with no warning given if the specified elements are incompatible. (For example, if both Sigma and SqrtSigma elements are supplied to covariance, then Q is calculated using SqrtSigma with no attempt to verify that SqrtSigma is the matrix square root ofSigma.)

The override argument is used primarily by update.distfreereg to avoid unnecessary and potentially computationally expensive recomputation. This update method imports appropriate values automatically from a previously created object of class distfreereg, and therefore validation is not always done. Use manually with caution.

The res_order element of override must be a vector of integers from 1 to n (the sample size) that determines the order of the residuals to use when forming the empitical partial sum process. Elements of the vector can be repeated, in which case the residuals corresponding to matching res_order values are grouped when group is TRUE.

Value

An object of class distfreereg with the following components:

call

The matched call.

data

A list containing the data argument value, if present, and Y and X.

test_mean

The value supplied to the argument test_mean.

model

The model built when using the formula method; present if and only if using that method.

covariance

The list of covariance matrices, containing at least Q.

theta_hat

The estimated parameter vector. When the model being tested has class lmerMod, this is the vector of fixed effect parameters as returned by fixef.

optimization_output

The output of optimization_fun or nls from calculating theta_hat.

fitted_values

The vector of fitted values, f\bigl(X,\hat\theta\bigr). When testing a generalized linear model, these are the fitted values on the response scale.

J

The Jacobian matrix.

mu

The mu matrix.

r

The matrix of transformation anchor vectors.

r_tilde

The matrix of modified transformation anchor vectors.

residuals

A named list of three vectors containing raw, sphered, and transformed residuals.

res_order

A numeric vector indicating the ranking of the residuals used to form the empirical partial sum process, in a format to be used as the input of order.

grouping_matrix

The matrix used to group residuals; present if group is TRUE and some covariate combinations are repeated.

epsp

The empirical partial sum process formed by calculating the scaled partial sums of the transformed residuals ordered according to res_order.

observed_stat

A named list of the observed statistic(s) corresponding to the transformed residuals.

mcsim_stats

A named list, each element of which contains the values of a simulated statistic.

p

A named list with two elements: value, which contains the p-values for each observed statistic, and mcse, which contains the Monte Carlo standard errors for the p-values.

Warnings

Methods for model objects (e.g., lm objects) are intended to be used with objects created using a data argument that contains all variables used by the model. This data argument is assumed to be defined in the same environment as the model object, and this is assumed to be the environment in which distfreereg() is called.

Consistency between test_mean and theta_init is verified only indirectly. Uninformative errors can occur when, for example, theta_init does not have the correct length. The most common error message that arises in this case is "f_out cannot have NA values", which occurs when theta_init is too short. To be safe, always define test_mean to use every element of theta.

No verification of consistency is done when multiple elements of coviariance are specified. For example, if P and Sigma are both specified, then the code will use only one of these, and will not verify that P is the inverse of Sigma.

When using the control argument element optimization_fun to specify an optimization function other than optim, the verification that theta_hat_name actually matches the name of an element of the optimization function's output is done only after the optimization has been done. If this optimization will likely take a long time, it is important to verify the value of theta_hat_name before running distfreereg().

The default values of sym_tol and sym_tol1 are intended to check for substantial asymmetry such as would be caused by the user inputting the wrong matrix. Therefore, they do not check for symmetry with high precision. In particular, these default values are orders of magnitude larger than the default values in isSymmetric.

Non-null values for weights argument in glm are not supported.

Author(s)

Jesse Miller

References

Khmaladze, Estate V. Distribution-free testing in linear and parametric regression, 2021-03, Annals of the Institute of Statistical Mathematics, Vol. 73, No. 6, p. 1063–1087. doi:10.1007/s10463-021-00786-3

See Also

coef.distfreereg, confint.distfreereg, fitted.distfreereg, formula.distfreereg, plot.distfreereg, predict.distfreereg, print.distfreereg, residuals.distfreereg, update.distfreereg, vcov.distfreereg

Examples

set.seed(20240218)
n <- 1e2
func <- function(X, theta) X[,1]^theta[1] + theta[2]*X[,2]
Sig <- runif(n, min = 1, max = 3)
theta <- c(2,5)
X <- matrix(runif(2*n, min = 1, max = 5), nrow = n)
Y <- X[,1]^theta[1] + theta[2]*X[,2] + rnorm(n, sd = sqrt(Sig))
(dfr <- distfreereg(Y = Y, X = X, test_mean = func,
                    covariance = list(Sigma = Sig),
                    theta_init = c(1,1)))

# Same test with lowercase "x" for reference; use uppercase whenever possible
func_lower <- function(x, theta) x[1]^theta[1] + theta[2]*x[2]
(dfr_lower <- distfreereg(Y = Y, X = X, test_mean = func_lower,
                          covariance = list(Sigma = Sig),
                          theta_init = c(1,1)))

[Package distfreereg version 1.1 Index]