gkwreg {gkwreg} | R Documentation |
Fit Generalized Kumaraswamy Regression Models
Description
Fits regression models using the Generalized Kumaraswamy (GKw) family of
distributions for response variables strictly bounded in the interval (0, 1).
The function allows modeling parameters from all seven submodels of the GKw
family as functions of predictors using appropriate link functions. Estimation
is performed using Maximum Likelihood via the TMB (Template Model Builder) package.
Requires the Formula
and TMB
packages.
Usage
gkwreg(
formula,
data,
family = c("gkw", "bkw", "kkw", "ekw", "mc", "kw", "beta"),
link = NULL,
link_scale = NULL,
start = NULL,
fixed = NULL,
method = c("nlminb", "BFGS", "Nelder-Mead", "CG", "SANN", "L-BFGS-B"),
hessian = TRUE,
plot = TRUE,
conf.level = 0.95,
optimizer.control = list(),
subset = NULL,
weights = NULL,
offset = NULL,
na.action = getOption("na.action"),
contrasts = NULL,
x = FALSE,
y = TRUE,
model = TRUE,
silent = TRUE,
...
)
Arguments
formula |
An object of class |
data |
A data frame containing the variables specified in the |
family |
A character string specifying the desired distribution family.
Defaults to
|
link |
Either a single character string specifying the same link function
for all relevant parameters, or a named list specifying the link function for each
modeled parameter (e.g.,
|
link_scale |
Either a single numeric value specifying the same scale for all
link functions, or a named list specifying the scale for each parameter's link
function (e.g., |
start |
An optional named list providing initial values for the regression
coefficients. Parameter names should match the distribution parameters (alpha,
beta, etc.), and values should be vectors corresponding to the coefficients
in the respective linear predictors (including intercept). If |
fixed |
An optional named list specifying parameters or coefficients to be held fixed at specific values during estimation. Currently not fully implemented. |
method |
Character string specifying the optimization algorithm to use.
Options are |
hessian |
Logical. If |
plot |
Logical. If |
conf.level |
Numeric. The confidence level (1 - alpha) for constructing
confidence intervals for the parameters. Default is 0.95. Used only if
|
optimizer.control |
A list of control parameters passed directly to the
chosen optimizer ( |
subset |
An optional vector specifying a subset of observations from |
weights |
An optional vector of prior weights (e.g., frequency weights)
to be used in the fitting process. Should be |
offset |
An optional numeric vector or matrix specifying an a priori known
component to be included on the scale of the linear predictor for each parameter.
If a vector, it's applied to the predictor of the first parameter in the standard
order ( |
na.action |
A function which indicates what should happen when the data
contain |
contrasts |
An optional list specifying the contrasts to be used for factor
variables in the model. See the |
x |
Logical. If |
y |
Logical. If |
model |
Logical. If |
silent |
Logical. If |
... |
Additional arguments, currently unused or passed down to internal methods (potentially). |
Details
The gkwreg
function provides a regression framework for the Generalized
Kumaraswamy (GKw) family and its submodels, extending density estimation
to include covariates. The response variable must be strictly bounded in the
(0, 1) interval.
Model Specification:
The extended Formula
syntax is crucial for specifying
potentially different linear predictors for each relevant distribution parameter.
The parameters (\alpha, \beta, \gamma, \delta, \lambda
) correspond sequentially
to the parts of the formula separated by |
. For subfamilies where some
parameters are fixed by definition (see family
argument), the corresponding
parts of the formula are automatically ignored. For example, in a family = "kw"
model, only the first two parts (for \alpha
and \beta
) are relevant.
Parameter Constraints and Link Functions:
The parameters \alpha, \beta, \gamma, \lambda
are constrained to be positive,
while \delta
is constrained to the interval (0, 1). The default link functions
("log"
for positive parameters, "logit"
for \delta
) ensure these
constraints during estimation. Users can specify alternative link functions suitable
for the parameter's domain via the link
argument.
Link Scales:
The link_scale
parameter allows users to control how aggressively the link
function transforms the linear predictor. For probability-type links (logit, probit,
cloglog, cauchy), smaller values (e.g., 1) produce more extreme transformations,
while larger values (e.g., 10) produce more gradual transformations. For continuous
parameters, scale values control the sensitivity of the transformation.
Families and Parameters:
The function automatically handles parameter fixing based on the chosen family
:
-
GKw: All 5 parameters (
\alpha, \beta, \gamma, \delta, \lambda
) modeled. -
BKw: Models
\alpha, \beta, \gamma, \delta
; fixes\lambda = 1
. -
KKw: Models
\alpha, \beta, \delta, \lambda
; fixes\gamma = 1
. -
EKw: Models
\alpha, \beta, \lambda
; fixes\gamma = 1, \delta = 0
. -
Mc (McDonald): Models
\gamma, \delta, \lambda
; fixes\alpha = 1, \beta = 1
. -
Kw (Kumaraswamy): Models
\alpha, \beta
; fixes\gamma = 1, \delta = 0, \lambda = 1
. -
Beta: Models
\gamma, \delta
; fixes\alpha = 1, \beta = 1, \lambda = 1
. This parameterization corresponds to the standard Beta distribution with shape1 =\gamma
and shape2 =\delta
.
Estimation Engine:
Maximum Likelihood Estimation (MLE) is performed using C++ templates via the
TMB
package, which provides automatic differentiation and efficient
optimization capabilities. The specific TMB template used depends on the chosen family
.
Optimizer Method (method
argument):
-
"nlminb"
: Uses R's built-instats::nlminb
optimizer. Good for problems with box constraints. Default option. -
"Nelder-Mead"
: Uses R'sstats::optim
with the Nelder-Mead simplex algorithm, which doesn't require derivatives. -
"BFGS"
: Uses R'sstats::optim
with the BFGS quasi-Newton method for unconstrained optimization. -
"CG"
: Uses R'sstats::optim
with conjugate gradients method for unconstrained optimization. -
"SANN"
: Uses R'sstats::optim
with simulated annealing, a global optimization method useful for problems with multiple local minima. -
"L-BFGS-B"
: Uses R'sstats::optim
with the limited-memory BFGS method with box constraints.
Value
An object of class gkwreg
. This is a list containing the
following components:
call |
The matched function call. |
family |
The specified distribution family string. |
formula |
The |
coefficients |
A named vector of estimated regression coefficients. |
fitted.values |
Vector of estimated means (expected values) of the response. |
residuals |
Vector of response residuals (observed - fitted mean). |
fitted_parameters |
List containing the estimated mean value for each distribution parameter ( |
parameter_vectors |
List containing vectors of the estimated parameters ( |
link |
List of link functions used for each parameter. |
param_names |
Character vector of names of the parameters modeled by the family. |
fixed_params |
Named list indicating which parameters are fixed by the family definition. |
loglik |
The maximized log-likelihood value. |
aic |
Akaike Information Criterion. |
bic |
Bayesian Information Criterion. |
deviance |
The deviance ( -2 * loglik ). |
df.residual |
Residual degrees of freedom (nobs - npar). |
nobs |
Number of observations used in the fit. |
npar |
Total number of estimated parameters (coefficients). |
vcov |
The variance-covariance matrix of the coefficients (if |
se |
Standard errors of the coefficients (if |
convergence |
Convergence code from the optimizer (0 typically indicates success). |
message |
Convergence message from the optimizer. |
iterations |
Number of iterations used by the optimizer. |
rmse |
Root Mean Squared Error of response residuals. |
efron_r2 |
Efron's pseudo R-squared. |
mean_absolute_error |
Mean Absolute Error of response residuals. |
x |
List of model matrices (if |
y |
The response vector (if |
model |
The model frame (if |
tmb_object |
The raw object returned by |
Author(s)
Lopes, J. E.
References
Kumaraswamy, P. (1980). A generalized probability density function for double-bounded random processes. Journal of Hydrology, 46(1-2), 79-88.
Cordeiro, G. M., & de Castro, M. (2011). A new family of generalized distributions. Journal of Statistical Computation and Simulation, 81(7), 883-898.
Ferrari, S. L. P., & Cribari-Neto, F. (2004). Beta regression for modelling rates and proportions. Journal of Applied Statistics, 31(7), 799-815.
Kristensen, K., Nielsen, A., Berg, C. W., Skaug, H., & Bell, B. M. (2016). TMB: Automatic Differentiation and Laplace Approximation. Journal of Statistical Software, 70(5), 1-21. (Underlying TMB package)
Zeileis, A., Kleiber, C., Jackman, S. (2008). Regression Models for Count Data in R. Journal of Statistical Software, 27(8), 1-25.
Smithson, M., & Verkuilen, J. (2006). A Better Lemon Squeezer? Maximum-Likelihood Regression with Beta-Distributed Dependent Variables. Psychological Methods, 11(1), 54–71.
See Also
summary.gkwreg
, predict.gkwreg
,
plot.gkwreg
, coef.gkwreg
, vcov.gkwreg
,
logLik
, AIC
,
Formula
, MakeADFun
,
sdreport
Examples
## -------------------------------------------------------------------------
## 1. Real-world Case Studies
## -------------------------------------------------------------------------
## Example 1: Food Expenditure Data
# Load required package
require(gkwreg)
# Get FoodExpenditure data and create response variable 'y' as proportion of income spent on food
food_data <- get_bounded_datasets("FoodExpenditure")
food_data <- within(food_data, {
y <- food / income
})
# Define formula: y depends on 'persons' with 'income' as predictor for second parameter
formu_fe <- y ~ persons | income
# Fit Kumaraswamy model with log link for both parameters
kw_model <- gkwreg(formu_fe, food_data,
family = "kw",
link = rep("log", 2), method = "nlminb"
)
# Display model summary and diagnostics
summary(kw_model)
plot(kw_model, use_ggplot = TRUE, arrange_plots = TRUE, sub.caption = "")
## Example 2: Gasoline Yield Data
# Load GasolineYield dataset
gasoline_data <- get_bounded_datasets("GasolineYield")
# Formula: yield depends on batch and temperature
# First part (for alpha/gamma) includes batch and temp
# Second part (for beta/delta/phi) includes only temp
formu_gy <- yield ~ batch + temp | temp
# Fit Kumaraswamy model with log link and BFGS optimization
kw_model_gas <- gkwreg(formu_gy, gasoline_data,
family = "kw",
link = rep("log", 2), method = "BFGS"
)
# Display results
summary(kw_model_gas)
plot(kw_model_gas, use_ggplot = TRUE, arrange_plots = TRUE, sub.caption = "")
## Example 3: SDAC Cancer Data
# Load cancer survival dataset
sdac_data <- get_bounded_datasets("sdac")
# Formula: relative cumulative density ~ age adjustment + chemotherapy
formu_sd <- rcd ~ ageadj + chemo
# Fit Extended Kumaraswamy model
ekw_model_gas <- gkwreg(formu_sd, sdac_data, family = "ekw", method = "BFGS")
summary(ekw_model_gas)
plot(ekw_model_gas, use_ggplot = TRUE, arrange_plots = TRUE, sub.caption = "")
## Example 4: Retinal Data
# Load retinal dataset
retinal_data <- get_bounded_datasets("retinal")
# Formula for three parameters with different predictors
# alpha ~ LogT + LogT2 + Level
# beta ~ LogT + Level
# gamma ~ Time
formu_rt <- Gas ~ LogT + LogT2 + Level | LogT + Level | Time
# Fit Extended Kumaraswamy model
ekw_model_ret <- gkwreg(formu_rt, retinal_data, family = "ekw", method = "nlminb")
summary(ekw_model_ret)
plot(ekw_model_ret, use_ggplot = TRUE, arrange_plots = TRUE, sub.caption = "")
## Example 5: Weather Task Agreement Data
# Load the WeatherTask dataset
df_weather <- get_bounded_datasets("WeatherTask")
# Fit all seven distribution families to the 'agreement' variable
fitall_weather <- gkwfitall(df_weather$agreement, method = "BFGS")
# Compare model performance
summary(fitall_weather) # Displays the comparison table
# Identify the best family based on AIC
best_family_code <- fitall_weather$comparison$Family[1]
# Refit the best model for detailed analysis
fit_best_weather <- gkwfit(
df_weather$agreement,
family = best_family_code,
method = "BFGS", profile = TRUE, plot = TRUE, silent = TRUE
)
# Generate Goodness-of-Fit report
gof_report <- gkwgof(
fit_best_weather,
theme = ggplot2::theme_classic(),
plot = TRUE, print_summary = FALSE, verbose = FALSE
)
summary(gof_report) # Display GoF statistics
# Extract fit statistics for all families
results_weathertask_df <- do.call(rbind, lapply(fitall_weather$fits, function(f) {
extract_gof_stats(gkwgof(f,
plot = FALSE,
print_summary = FALSE, verbose = FALSE
))
}))
results_weathertask_df <- results_weathertask_df[order(results_weathertask_df$AIC), ]
row.names(results_weathertask_df) <- NULL
# Generate diagnostic plots for best model
plot(gkwgof(fit_best_weather, theme = ggplot2::theme_classic()), title = "")
# Display formatted comparison table
results_weathertask_df[
,
c(
"family", "n_params", "logLik", "AIC", "BIC",
"KS", "AD", "RMSE", "pseudo_R2"
)
]
## -------------------------------------------------------------------------
## 2. Simulation Studies
## -------------------------------------------------------------------------
## Example 1: Simple Kumaraswamy Regression Model
# Set seed for reproducibility
set.seed(123)
n <- 1000
x1 <- runif(n, -2, 2)
x2 <- rnorm(n)
# Define true regression coefficients
alpha_coef <- c(0.8, 0.3, -0.2) # Intercept, x1, x2
beta_coef <- c(1.2, -0.4, 0.1) # Intercept, x1, x2
# Generate linear predictors and transform using exponential link
eta_alpha <- alpha_coef[1] + alpha_coef[2] * x1 + alpha_coef[3] * x2
eta_beta <- beta_coef[1] + beta_coef[2] * x1 + beta_coef[3] * x2
alpha_true <- exp(eta_alpha)
beta_true <- exp(eta_beta)
# Generate responses from Kumaraswamy distribution
y <- rkw(n, alpha = alpha_true, beta = beta_true)
df1 <- data.frame(y = y, x1 = x1, x2 = x2)
# Fit Kumaraswamy regression model with formula notation
# Model: alpha ~ x1 + x2 and beta ~ x1 + x2
kw_reg <- gkwreg(y ~ x1 + x2 | x1 + x2, data = df1, family = "kw", silent = TRUE)
# Alternative model with custom link scales
kw_reg2 <- gkwreg(y ~ x1 + x2 | x1 + x2,
data = df1, family = "kw",
link_scale = list(alpha = 5, beta = 8), silent = TRUE
)
# Display model summary
summary(kw_reg)
## Example 2: Generalized Kumaraswamy Regression
# Set seed for reproducibility
set.seed(456)
n <- 1000
x1 <- runif(n, -1, 1)
x2 <- rnorm(n)
x3 <- factor(rbinom(n, 1, 0.5), labels = c("A", "B")) # Factor variable
# Define true regression coefficients for all parameters
alpha_coef <- c(0.5, 0.2) # Intercept, x1
beta_coef <- c(0.8, -0.3, 0.1) # Intercept, x1, x2
gamma_coef <- c(0.6, 0.4) # Intercept, x3B
delta_coef <- c(0.0, 0.2) # Intercept, x3B (logit scale)
lambda_coef <- c(-0.2, 0.1) # Intercept, x2
# Create design matrices
X_alpha <- model.matrix(~x1, data = data.frame(x1 = x1))
X_beta <- model.matrix(~ x1 + x2, data = data.frame(x1 = x1, x2 = x2))
X_gamma <- model.matrix(~x3, data = data.frame(x3 = x3))
X_delta <- model.matrix(~x3, data = data.frame(x3 = x3))
X_lambda <- model.matrix(~x2, data = data.frame(x2 = x2))
# Generate parameters through linear predictors and appropriate link functions
alpha <- exp(X_alpha %*% alpha_coef)
beta <- exp(X_beta %*% beta_coef)
gamma <- exp(X_gamma %*% gamma_coef)
delta <- plogis(X_delta %*% delta_coef) # logit link for delta
lambda <- exp(X_lambda %*% lambda_coef)
# Generate response from Generalized Kumaraswamy distribution
y <- rgkw(n, alpha = alpha, beta = beta, gamma = gamma, delta = delta, lambda = lambda)
df2 <- data.frame(y = y, x1 = x1, x2 = x2, x3 = x3)
# Fit GKw regression with parameter-specific formulas
gkw_reg <- gkwreg(y ~ x1 | x1 + x2 | x3 | x3 | x2, data = df2, family = "gkw")
# Alternative model with custom link scales
gkw_reg2 <- gkwreg(y ~ x1 | x1 + x2 | x3 | x3 | x2,
data = df2, family = "gkw",
link_scale = list(
alpha = 12, beta = 12, gamma = 12,
delta = 0.8, lambda = 12
)
)
# Compare true vs. estimated coefficients
print("Estimated Coefficients (GKw):")
print(coef(gkw_reg))
print("True Coefficients (approx):")
print(list(
alpha = alpha_coef, beta = beta_coef, gamma = gamma_coef,
delta = delta_coef, lambda = lambda_coef
))
## Example 3: Beta Regression for Comparison
# Set seed for reproducibility
set.seed(789)
n <- 1000
x1 <- runif(n, -1, 1)
# True coefficients for Beta parameters (gamma = shape1, delta = shape2)
gamma_coef <- c(1.0, 0.5) # Intercept, x1 (log scale)
delta_coef <- c(1.5, -0.7) # Intercept, x1 (log scale)
# Generate parameters through linear predictors and log link
X_beta_eg <- model.matrix(~x1, data.frame(x1 = x1))
gamma_true <- exp(X_beta_eg %*% gamma_coef)
delta_true <- exp(X_beta_eg %*% delta_coef)
# Generate response from Beta distribution
y <- rbeta_(n, gamma_true, delta_true)
df_beta <- data.frame(y = y, x1 = x1)
# Fit Beta regression model using gkwreg
beta_reg <- gkwreg(y ~ x1 | x1,
data = df_beta, family = "beta",
link = list(gamma = "log", delta = "log")
)
## Example 4: Model Comparison using AIC/BIC
# Fit an alternative model (Kumaraswamy) to the same beta-generated data
kw_reg2 <- try(gkwreg(y ~ x1 | x1, data = df_beta, family = "kw"))
# Compare models using information criteria
print("AIC Comparison (Beta vs Kw):")
c(AIC(beta_reg), AIC(kw_reg2))
print("BIC Comparison (Beta vs Kw):")
c(BIC(beta_reg), BIC(kw_reg2))
## Example 5: Prediction with Fitted Models
# Create new data for predictions
newdata <- data.frame(x1 = seq(-1, 1, length.out = 20))
# Predict expected response (mean of the Beta distribution)
pred_response <- predict(beta_reg, newdata = newdata, type = "response")
# Predict parameters on the scale of the link function
pred_link <- predict(beta_reg, newdata = newdata, type = "link")
# Predict parameters on the original scale
pred_params <- predict(beta_reg, newdata = newdata, type = "parameter")
# Visualize fitted model and data
plot(df_beta$x1, df_beta$y,
pch = 20, col = "grey", xlab = "x1", ylab = "y",
main = "Beta Regression Fit (using gkwreg)"
)
lines(newdata$x1, pred_response, col = "red", lwd = 2)
legend("topright", legend = "Predicted Mean", col = "red", lty = 1, lwd = 2)