distfreereg {distfreereg} | R Documentation |
Distribution-free parametric regression testing
Description
Conduct distribution-free parametric regression testing using the process introduced in Khmaladze (2021). A parametric model for the conditional mean (specified by test_mean
) is checked against the data by fitting the model, transforming the resulting residuals, and then calculating a statistic on the empirical partial sum process of the transformed residuals. The statistic's null distribution can be simulated in a straight-forward way, thereby producing a p-value.
Using f
to denote the mean function being tested, the specific test has the following null and alternative hypotheses:
H_0\colon\ \exists\theta\in\Theta\subseteq\mathbb R^p \mathrel{\bigl|} \textrm{E}(Y| X)=f(X;\theta)
\quad\hbox{against}\quad
H_1\colon\ \forall\theta\in\Theta\subseteq\mathbb R^p \mathrel{\bigl|} \textrm{E}(Y| X)\neq f(X;\theta).
This assumes a known or adequately accurately estimated covariance matrix. See the An Introduction to the distfreereg
Package vignette for an introduction.
Usage
distfreereg(test_mean, ordering = "simplex", group = TRUE,
stat = c("KS", "CvM"), B = 1e4, control = NULL, override = NULL, verbose = TRUE,
...)
## Default S3 method:
distfreereg(test_mean = NULL, ordering = "simplex", group = TRUE,
stat = c("KS", "CvM"), B = 1e4, control = NULL, override = NULL, verbose = TRUE,
..., Y, X = NULL, covariance, J, fitted_values)
## S3 method for class 'formula'
distfreereg(test_mean, ordering = "simplex", group = TRUE,
stat = c("KS", "CvM"), B = 1e4, control = NULL, override = NULL, verbose = TRUE,
..., data, method, method_args = NULL)
## S3 method for class 'function'
distfreereg(test_mean, ordering = "simplex", group = TRUE,
stat = c("KS", "CvM"), B = 1e4, control = NULL, override = NULL, verbose = TRUE,
..., Y, X = NULL, covariance, theta_init)
## S3 method for class 'glm'
distfreereg(test_mean, ordering = "simplex", group = TRUE,
stat = c("KS", "CvM"), B = 1e4, control = NULL, override = NULL, verbose = TRUE,
...)
## S3 method for class 'lm'
distfreereg(test_mean, ordering = "simplex", group = TRUE,
stat = c("KS", "CvM"), B = 1e4, control = NULL, override = NULL, verbose = TRUE,
...)
## S3 method for class 'lmerMod'
distfreereg(test_mean, ordering = "simplex", group = TRUE,
stat = c("KS", "CvM"), B = 1e4, control = NULL, override = NULL, verbose = TRUE,
...)
## S3 method for class 'nls'
distfreereg(test_mean, ordering = "simplex", group = TRUE,
stat = c("KS", "CvM"), B = 1e4, control = NULL, override = NULL, verbose = TRUE,
...)
Arguments
test_mean |
A specification of the mean function to be tested. Methods exist for objects of classes |
covariance |
Named list; specifies the covariance structure of the model's error distribution. Valid element names are "
See details. |
ordering |
A character string or a list; specifies how to order the residuals to form the empirical partial sum process. Valid character strings are:
If |
group |
Logical; if |
J |
Numeric matrix; specifies the Jacobian of the function evaluated at the covariates and the estimated parameters. |
fitted_values |
Numeric vector; specifies the model's fitted values. |
stat |
Character vector; specifies the names of the functions used to calculate the desired statistics. By default, a Kolmogorov–Smirnov statistic and a Cramer–von Mises-like statistic are calculated:
where |
B |
Numeric vector of length one; specifies the Monte Carlo sample size used when simulating statistics. Silently converted to integer. |
control |
Optional named list of elements that control the details of the algorithm's computations. The following elements are accepted for all methods:
The following named elements, all but the first of which control the process of calculating the generalized least squares estimation of the parameter vector, are accepted for the
|
override |
Optional named list of arguments that override internally calculated values. Used primarily by
|
verbose |
Logical; if |
... |
Additional arguments to pass to various methods; should be empty except in a call to the generic function. |
data |
Data frame of covariate values used when |
method |
Character vector; specifies the function to use for fitting the model when |
method_args |
Optional list of argument values to be passed to the function specified by the |
Y |
Numeric vector of observations. A non-vector value is converted to a vector. |
X |
Optional numeric matrix of covariates. A non-matrix value is converted to a single-column matrix. |
theta_init |
Numeric vector; specifies the starting parameter values passed to the optimizing function to be used to estimate the parameter vector. |
Details
This function implements distribution-free parametric regression testing. The model is specified by a mean structure and a covariance structure.
The mean structure is specified by the argument test_mean
. This can be an object of class function
, formula
, glm
, lm
, lmerMod
, or nls
. It can also be NULL
.
If test_mean
is a function, then it must have one or two arguments: either theta
only, or theta
and either X
(uppercase) or x
(lowercase). An uppercase X
is interpreted in the function definition as a matrix, while a lowercase x
is interpreted as a vector. (See examples and this vignette.) The primary reason to use a lowercase x
is to allow for a function definition using an R
function that is not vectorized. In general, an uppercase X
should be preferred for speed.
If test_mean
is a formula, then it must be a formula that can be passed to glm
, lm
, lmer
, or nls
, and the data
argument must be specified. The appropriate model is created and is then sent to distfreereg()
for method dispatch.
The function
method estimates parameter values, and then uses those to evaluate the Jacobian of the mean function and to calculate fitted values.
All of these methods create a Jacobian matrix and a vector of fitted values. These, along with the covariance structure, are sent the default method. The default method also allows the user to implement the algorithm even when the mean structure is not specified in R
, as long as the Jacobian, fitted values, and covariance structure can be imported. (This is useful if a particularly complicated function is defined in another language and cannot easily be copied into R
.)
The covariance structure for Y|X
must be specified using the covariance
argument for the function
and default methods. For other methods, the covariance is estimated automatically.
Any element of covariance
can be a numeric matrix, a list of numeric matrices, or a numeric vector. If it is a vector, its length must be either 1 or the sample size. This option is mathematically equivalent to setting a covariance list element to a diagonal matrix with the specified value(s) along the diagonal. Using vectors, when possible, is more efficient than using the corresponding matrix. If an element of covariance
is a list of numeric matrices, then these matrices are interpreted as the blocks of a block diagonal matrix.
Internally, distfreereg()
only needs Q
, so some efficiency can be gained by supplying that directly when available. When Q
is not specified, it is calculated using whichever element is specified. When more than one of the other elements are specified, Q
is calculated using the least expensive path, with no warning given if the specified elements are incompatible. (For example, if both Sigma
and SqrtSigma
elements are supplied to covariance
, then Q
is calculated using SqrtSigma
with no attempt to verify that SqrtSigma
is the matrix square root ofSigma
.)
The override
argument is used primarily by update.distfreereg
to avoid unnecessary and potentially computationally expensive recomputation. This update
method imports appropriate values automatically from a previously created object of class distfreereg
, and therefore validation is not always done. Use manually with caution.
The res_order
element of override
must be a vector of integers from 1 to n
(the sample size) that determines the order of the residuals to use when forming the empitical partial sum process. Elements of the vector can be repeated, in which case the residuals corresponding to matching res_order
values are grouped when group
is TRUE
.
Value
An object of class distfreereg
with the following components:
call |
The matched call. |
data |
A list containing the |
test_mean |
The value supplied to the argument |
model |
The model built when using the |
covariance |
The list of covariance matrices, containing at least |
theta_hat |
The estimated parameter vector. When the model being tested has class |
optimization_output |
The output of |
fitted_values |
The vector of fitted values, |
J |
The Jacobian matrix. |
mu |
The mu matrix. |
r |
The matrix of transformation anchor vectors. |
r_tilde |
The matrix of modified transformation anchor vectors. |
residuals |
A named list of three vectors containing raw, sphered, and transformed residuals. |
res_order |
A numeric vector indicating the ranking of the residuals used to form the empirical partial sum process, in a format to be used as the input of |
grouping_matrix |
The matrix used to group residuals; present if |
epsp |
The empirical partial sum process formed by calculating the scaled
partial sums of the transformed residuals ordered according to |
observed_stat |
A named list of the observed statistic(s) corresponding to the transformed residuals. |
mcsim_stats |
A named list, each element of which contains the values of a simulated statistic. |
p |
A named list with two elements: |
Warnings
Methods for model objects (e.g., lm
objects) are intended to be used with
objects created using a data
argument that contains all variables used by
the model. This data argument is assumed to be defined in the same environment
as the model object, and this is assumed to be the environment in which
distfreereg()
is called.
Consistency between test_mean
and theta_init
is verified only
indirectly. Uninformative errors can occur when, for example, theta_init
does not have the correct length. The most common error message that arises
in this case is "f_out cannot have NA values
", which occurs when
theta_init
is too short. To be safe, always define test_mean
to
use every element of theta
.
No verification of consistency is done when multiple elements of
coviariance
are specified. For example, if P
and Sigma
are
both specified, then the code will use only one of these, and will not verify
that P
is the inverse of Sigma
.
When using the control
argument element optimization_fun
to
specify an optimization function other than optim
, the
verification that theta_hat_name
actually matches the name of an element
of the optimization function's output is done only after the optimization has
been done. If this optimization will likely take a long time, it is important to
verify the value of theta_hat_name
before running distfreereg()
.
The default values of sym_tol
and sym_tol1
are intended to check
for substantial asymmetry such as would be caused by the user inputting the
wrong matrix. Therefore, they do not check for symmetry with high precision. In
particular, these default values are orders of magnitude larger than the default
values in isSymmetric
.
Non-null values for weights
argument in glm
are not
supported.
Author(s)
Jesse Miller
References
Khmaladze, Estate V. Distribution-free testing in linear and parametric regression, 2021-03, Annals of the Institute of Statistical Mathematics, Vol. 73, No. 6, p. 1063–1087. doi:10.1007/s10463-021-00786-3
See Also
coef.distfreereg
, confint.distfreereg
, fitted.distfreereg
, formula.distfreereg
, plot.distfreereg
, predict.distfreereg
, print.distfreereg
, residuals.distfreereg
, update.distfreereg
, vcov.distfreereg
Examples
set.seed(20240218)
n <- 1e2
func <- function(X, theta) X[,1]^theta[1] + theta[2]*X[,2]
Sig <- runif(n, min = 1, max = 3)
theta <- c(2,5)
X <- matrix(runif(2*n, min = 1, max = 5), nrow = n)
Y <- X[,1]^theta[1] + theta[2]*X[,2] + rnorm(n, sd = sqrt(Sig))
(dfr <- distfreereg(Y = Y, X = X, test_mean = func,
covariance = list(Sigma = Sig),
theta_init = c(1,1)))
# Same test with lowercase "x" for reference; use uppercase whenever possible
func_lower <- function(x, theta) x[1]^theta[1] + theta[2]*x[2]
(dfr_lower <- distfreereg(Y = Y, X = X, test_mean = func_lower,
covariance = list(Sigma = Sig),
theta_init = c(1,1)))