gkwreg {gkwreg}R Documentation

Fit Generalized Kumaraswamy Regression Models

Description

Fits regression models using the Generalized Kumaraswamy (GKw) family of distributions for response variables strictly bounded in the interval (0, 1). The function allows modeling parameters from all seven submodels of the GKw family as functions of predictors using appropriate link functions. Estimation is performed using Maximum Likelihood via the TMB (Template Model Builder) package. Requires the Formula and TMB packages.

Usage

gkwreg(
  formula,
  data,
  family = c("gkw", "bkw", "kkw", "ekw", "mc", "kw", "beta"),
  link = NULL,
  link_scale = NULL,
  start = NULL,
  fixed = NULL,
  method = c("nlminb", "BFGS", "Nelder-Mead", "CG", "SANN", "L-BFGS-B"),
  hessian = TRUE,
  plot = TRUE,
  conf.level = 0.95,
  optimizer.control = list(),
  subset = NULL,
  weights = NULL,
  offset = NULL,
  na.action = getOption("na.action"),
  contrasts = NULL,
  x = FALSE,
  y = TRUE,
  model = TRUE,
  silent = TRUE,
  ...
)

Arguments

formula

An object of class Formula (or one that can be coerced to that class). It should be structured as y ~ model_alpha | model_beta | model_gamma | model_delta | model_lambda, where y is the response variable and each model_* part specifies the linear predictor for the corresponding parameter (\alpha, \beta, \gamma, \delta, \lambda). If a part is omitted or specified as ~ 1 or ., an intercept-only model is used for that parameter. See Details for parameter correspondence in subfamilies.

data

A data frame containing the variables specified in the formula.

family

A character string specifying the desired distribution family. Defaults to "gkw". Supported families are:

  • "gkw": Generalized Kumaraswamy (5 parameters: \alpha, \beta, \gamma, \delta, \lambda)

  • "bkw": Beta-Kumaraswamy (4 parameters: \alpha, \beta, \gamma, \delta; \lambda = 1 fixed)

  • "kkw": Kumaraswamy-Kumaraswamy (4 parameters: \alpha, \beta, \delta, \lambda; \gamma = 1 fixed)

  • "ekw": Exponentiated Kumaraswamy (3 parameters: \alpha, \beta, \lambda; \gamma = 1, \delta = 0 fixed)

  • "mc": McDonald / Beta Power (3 parameters: \gamma, \delta, \lambda; \alpha = 1, \beta = 1 fixed)

  • "kw": Kumaraswamy (2 parameters: \alpha, \beta; \gamma = 1, \delta = 0, \lambda = 1 fixed)

  • "beta": Beta distribution (2 parameters: \gamma, \delta; \alpha = 1, \beta = 1, \lambda = 1 fixed)

link

Either a single character string specifying the same link function for all relevant parameters, or a named list specifying the link function for each modeled parameter (e.g., list(alpha = "log", beta = "log", delta = "logit")). Defaults are "log" for \alpha, \beta, \gamma, \lambda (parameters > 0) and "logit" for \delta (parameter in (0, 1)). Supported link functions are:

  • "log": logarithmic link, maps (0, \infty) to (-\infty, \infty)

  • "identity": identity link, no transformation

  • "inverse": inverse link, maps x to 1/x

  • "sqrt": square root link, maps x to \sqrt{x}

  • "inverse-square": inverse squared link, maps x to 1/x^2

  • "logit": logistic link, maps (0, 1) to (-\infty, \infty)

  • "probit": probit link, using normal CDF

  • "cloglog": complementary log-log

  • "cauchy": Cauchy link, using Cauchy CDF

link_scale

Either a single numeric value specifying the same scale for all link functions, or a named list specifying the scale for each parameter's link function (e.g., list(alpha = 10, beta = 5, delta = 1)). The scale affects how the link function transforms the linear predictor. Default is 10 for most parameters and 1 for parameters using probability-type links (such as delta). For probability-type links (logit, probit, cloglog, cauchy), smaller values produce more extreme transformations.

start

An optional named list providing initial values for the regression coefficients. Parameter names should match the distribution parameters (alpha, beta, etc.), and values should be vectors corresponding to the coefficients in the respective linear predictors (including intercept). If NULL (default), suitable starting values are automatically determined based on global parameter estimates.

fixed

An optional named list specifying parameters or coefficients to be held fixed at specific values during estimation. Currently not fully implemented.

method

Character string specifying the optimization algorithm to use. Options are "nlminb" (default, using nlminb), "BFGS", "Nelder-Mead", "CG", "SANN", or "L-BFGS-B". If "nlminb" is selected, R's nlminb function is used; otherwise, R's optim function is used with the specified method.

hessian

Logical. If TRUE (default), the Hessian matrix is computed via sdreport to obtain standard errors and the covariance matrix of the estimated coefficients. Setting to FALSE speeds up fitting but prevents calculation of standard errors and confidence intervals.

plot

Logical. If TRUE (default), enables the generation of diagnostic plots when calling the generic plot() function on the fitted object. Actual plotting is handled by the plot.gkwreg method.

conf.level

Numeric. The confidence level (1 - alpha) for constructing confidence intervals for the parameters. Default is 0.95. Used only if hessian = TRUE.

optimizer.control

A list of control parameters passed directly to the chosen optimizer (nlminb or optim). See their respective documentation for details.

subset

An optional vector specifying a subset of observations from data to be used in the fitting process.

weights

An optional vector of prior weights (e.g., frequency weights) to be used in the fitting process. Should be NULL or a numeric vector of non-negative values.

offset

An optional numeric vector or matrix specifying an a priori known component to be included on the scale of the linear predictor for each parameter. If a vector, it's applied to the predictor of the first parameter in the standard order (\alpha). If a matrix, columns must correspond to parameters in the order \alpha, \beta, \gamma, \delta, \lambda.

na.action

A function which indicates what should happen when the data contain NAs. The default (na.fail) stops if NAs are present. Other options include na.omit or na.exclude.

contrasts

An optional list specifying the contrasts to be used for factor variables in the model. See the contrasts.arg of model.matrix.default.

x

Logical. If TRUE, the list of model matrices (one for each modeled parameter) is returned as component x of the fitted object. Default FALSE.

y

Logical. If TRUE (default), the response variable (after processing by na.action, subset) is returned as component y.

model

Logical. If TRUE (default), the model frame (containing all variables used from data) is returned as component model.

silent

Logical. If TRUE (default), suppresses progress messages from TMB compilation and optimization. Set to FALSE for verbose output.

...

Additional arguments, currently unused or passed down to internal methods (potentially).

Details

The gkwreg function provides a regression framework for the Generalized Kumaraswamy (GKw) family and its submodels, extending density estimation to include covariates. The response variable must be strictly bounded in the (0, 1) interval.

Model Specification: The extended Formula syntax is crucial for specifying potentially different linear predictors for each relevant distribution parameter. The parameters (\alpha, \beta, \gamma, \delta, \lambda) correspond sequentially to the parts of the formula separated by |. For subfamilies where some parameters are fixed by definition (see family argument), the corresponding parts of the formula are automatically ignored. For example, in a family = "kw" model, only the first two parts (for \alpha and \beta) are relevant.

Parameter Constraints and Link Functions: The parameters \alpha, \beta, \gamma, \lambda are constrained to be positive, while \delta is constrained to the interval (0, 1). The default link functions ("log" for positive parameters, "logit" for \delta) ensure these constraints during estimation. Users can specify alternative link functions suitable for the parameter's domain via the link argument.

Link Scales: The link_scale parameter allows users to control how aggressively the link function transforms the linear predictor. For probability-type links (logit, probit, cloglog, cauchy), smaller values (e.g., 1) produce more extreme transformations, while larger values (e.g., 10) produce more gradual transformations. For continuous parameters, scale values control the sensitivity of the transformation.

Families and Parameters: The function automatically handles parameter fixing based on the chosen family:

Estimation Engine: Maximum Likelihood Estimation (MLE) is performed using C++ templates via the TMB package, which provides automatic differentiation and efficient optimization capabilities. The specific TMB template used depends on the chosen family.

Optimizer Method (method argument):

Value

An object of class gkwreg. This is a list containing the following components:

call

The matched function call.

family

The specified distribution family string.

formula

The Formula object used.

coefficients

A named vector of estimated regression coefficients.

fitted.values

Vector of estimated means (expected values) of the response.

residuals

Vector of response residuals (observed - fitted mean).

fitted_parameters

List containing the estimated mean value for each distribution parameter (\alpha, \beta, \gamma, \delta, \lambda).

parameter_vectors

List containing vectors of the estimated parameters (\alpha, \beta, \gamma, \delta, \lambda) for each observation, evaluated on the link scale.

link

List of link functions used for each parameter.

param_names

Character vector of names of the parameters modeled by the family.

fixed_params

Named list indicating which parameters are fixed by the family definition.

loglik

The maximized log-likelihood value.

aic

Akaike Information Criterion.

bic

Bayesian Information Criterion.

deviance

The deviance ( -2 * loglik ).

df.residual

Residual degrees of freedom (nobs - npar).

nobs

Number of observations used in the fit.

npar

Total number of estimated parameters (coefficients).

vcov

The variance-covariance matrix of the coefficients (if hessian = TRUE).

se

Standard errors of the coefficients (if hessian = TRUE).

convergence

Convergence code from the optimizer (0 typically indicates success).

message

Convergence message from the optimizer.

iterations

Number of iterations used by the optimizer.

rmse

Root Mean Squared Error of response residuals.

efron_r2

Efron's pseudo R-squared.

mean_absolute_error

Mean Absolute Error of response residuals.

x

List of model matrices (if x = TRUE).

y

The response vector (if y = TRUE).

model

The model frame (if model = TRUE).

tmb_object

The raw object returned by MakeADFun.

Author(s)

Lopes, J. E.

References

Kumaraswamy, P. (1980). A generalized probability density function for double-bounded random processes. Journal of Hydrology, 46(1-2), 79-88.

Cordeiro, G. M., & de Castro, M. (2011). A new family of generalized distributions. Journal of Statistical Computation and Simulation, 81(7), 883-898.

Ferrari, S. L. P., & Cribari-Neto, F. (2004). Beta regression for modelling rates and proportions. Journal of Applied Statistics, 31(7), 799-815.

Kristensen, K., Nielsen, A., Berg, C. W., Skaug, H., & Bell, B. M. (2016). TMB: Automatic Differentiation and Laplace Approximation. Journal of Statistical Software, 70(5), 1-21. (Underlying TMB package)

Zeileis, A., Kleiber, C., Jackman, S. (2008). Regression Models for Count Data in R. Journal of Statistical Software, 27(8), 1-25.

Smithson, M., & Verkuilen, J. (2006). A Better Lemon Squeezer? Maximum-Likelihood Regression with Beta-Distributed Dependent Variables. Psychological Methods, 11(1), 54–71.

See Also

summary.gkwreg, predict.gkwreg, plot.gkwreg, coef.gkwreg, vcov.gkwreg, logLik, AIC, Formula, MakeADFun, sdreport

Examples


## -------------------------------------------------------------------------
## 1. Real-world Case Studies
## -------------------------------------------------------------------------

## Example 1: Food Expenditure Data
# Load required package
require(gkwreg)

# Get FoodExpenditure data and create response variable 'y' as proportion of income spent on food
food_data <- get_bounded_datasets("FoodExpenditure")
food_data <- within(food_data, {
  y <- food / income
})

# Define formula: y depends on 'persons' with 'income' as predictor for second parameter
formu_fe <- y ~ persons | income

# Fit Kumaraswamy model with log link for both parameters
kw_model <- gkwreg(formu_fe, food_data,
  family = "kw",
  link = rep("log", 2), method = "nlminb"
)

# Display model summary and diagnostics
summary(kw_model)
plot(kw_model, use_ggplot = TRUE, arrange_plots = TRUE, sub.caption = "")

## Example 2: Gasoline Yield Data
# Load GasolineYield dataset
gasoline_data <- get_bounded_datasets("GasolineYield")

# Formula: yield depends on batch and temperature
# First part (for alpha/gamma) includes batch and temp
# Second part (for beta/delta/phi) includes only temp
formu_gy <- yield ~ batch + temp | temp

# Fit Kumaraswamy model with log link and BFGS optimization
kw_model_gas <- gkwreg(formu_gy, gasoline_data,
  family = "kw",
  link = rep("log", 2), method = "BFGS"
)

# Display results
summary(kw_model_gas)
plot(kw_model_gas, use_ggplot = TRUE, arrange_plots = TRUE, sub.caption = "")

## Example 3: SDAC Cancer Data
# Load cancer survival dataset
sdac_data <- get_bounded_datasets("sdac")

# Formula: relative cumulative density ~ age adjustment + chemotherapy
formu_sd <- rcd ~ ageadj + chemo

# Fit Extended Kumaraswamy model
ekw_model_gas <- gkwreg(formu_sd, sdac_data, family = "ekw", method = "BFGS")
summary(ekw_model_gas)
plot(ekw_model_gas, use_ggplot = TRUE, arrange_plots = TRUE, sub.caption = "")

## Example 4: Retinal Data
# Load retinal dataset
retinal_data <- get_bounded_datasets("retinal")

# Formula for three parameters with different predictors
# alpha ~ LogT + LogT2 + Level
# beta  ~ LogT + Level
# gamma ~ Time
formu_rt <- Gas ~ LogT + LogT2 + Level | LogT + Level | Time

# Fit Extended Kumaraswamy model
ekw_model_ret <- gkwreg(formu_rt, retinal_data, family = "ekw", method = "nlminb")
summary(ekw_model_ret)
plot(ekw_model_ret, use_ggplot = TRUE, arrange_plots = TRUE, sub.caption = "")

## Example 5: Weather Task Agreement Data
# Load the WeatherTask dataset
df_weather <- get_bounded_datasets("WeatherTask")

# Fit all seven distribution families to the 'agreement' variable
fitall_weather <- gkwfitall(df_weather$agreement, method = "BFGS")

# Compare model performance
summary(fitall_weather) # Displays the comparison table

# Identify the best family based on AIC
best_family_code <- fitall_weather$comparison$Family[1]

# Refit the best model for detailed analysis
fit_best_weather <- gkwfit(
  df_weather$agreement,
  family = best_family_code,
  method = "BFGS", profile = TRUE, plot = TRUE, silent = TRUE
)

# Generate Goodness-of-Fit report
gof_report <- gkwgof(
  fit_best_weather,
  theme = ggplot2::theme_classic(),
  plot = TRUE, print_summary = FALSE, verbose = FALSE
)
summary(gof_report) # Display GoF statistics

# Extract fit statistics for all families
results_weathertask_df <- do.call(rbind, lapply(fitall_weather$fits, function(f) {
  extract_gof_stats(gkwgof(f,
    plot = FALSE,
    print_summary = FALSE, verbose = FALSE
  ))
}))
results_weathertask_df <- results_weathertask_df[order(results_weathertask_df$AIC), ]
row.names(results_weathertask_df) <- NULL

# Generate diagnostic plots for best model
plot(gkwgof(fit_best_weather, theme = ggplot2::theme_classic()), title = "")

# Display formatted comparison table
results_weathertask_df[
  ,
  c(
    "family", "n_params", "logLik", "AIC", "BIC",
    "KS", "AD", "RMSE", "pseudo_R2"
  )
]

## -------------------------------------------------------------------------
## 2. Simulation Studies
## -------------------------------------------------------------------------

## Example 1: Simple Kumaraswamy Regression Model
# Set seed for reproducibility
set.seed(123)
n <- 1000
x1 <- runif(n, -2, 2)
x2 <- rnorm(n)

# Define true regression coefficients
alpha_coef <- c(0.8, 0.3, -0.2) # Intercept, x1, x2
beta_coef <- c(1.2, -0.4, 0.1) # Intercept, x1, x2

# Generate linear predictors and transform using exponential link
eta_alpha <- alpha_coef[1] + alpha_coef[2] * x1 + alpha_coef[3] * x2
eta_beta <- beta_coef[1] + beta_coef[2] * x1 + beta_coef[3] * x2
alpha_true <- exp(eta_alpha)
beta_true <- exp(eta_beta)

# Generate responses from Kumaraswamy distribution
y <- rkw(n, alpha = alpha_true, beta = beta_true)
df1 <- data.frame(y = y, x1 = x1, x2 = x2)

# Fit Kumaraswamy regression model with formula notation
# Model: alpha ~ x1 + x2 and beta ~ x1 + x2
kw_reg <- gkwreg(y ~ x1 + x2 | x1 + x2, data = df1, family = "kw", silent = TRUE)

# Alternative model with custom link scales
kw_reg2 <- gkwreg(y ~ x1 + x2 | x1 + x2,
  data = df1, family = "kw",
  link_scale = list(alpha = 5, beta = 8), silent = TRUE
)

# Display model summary
summary(kw_reg)

## Example 2: Generalized Kumaraswamy Regression
# Set seed for reproducibility
set.seed(456)
n <- 1000
x1 <- runif(n, -1, 1)
x2 <- rnorm(n)
x3 <- factor(rbinom(n, 1, 0.5), labels = c("A", "B")) # Factor variable

# Define true regression coefficients for all parameters
alpha_coef <- c(0.5, 0.2) # Intercept, x1
beta_coef <- c(0.8, -0.3, 0.1) # Intercept, x1, x2
gamma_coef <- c(0.6, 0.4) # Intercept, x3B
delta_coef <- c(0.0, 0.2) # Intercept, x3B (logit scale)
lambda_coef <- c(-0.2, 0.1) # Intercept, x2

# Create design matrices
X_alpha <- model.matrix(~x1, data = data.frame(x1 = x1))
X_beta <- model.matrix(~ x1 + x2, data = data.frame(x1 = x1, x2 = x2))
X_gamma <- model.matrix(~x3, data = data.frame(x3 = x3))
X_delta <- model.matrix(~x3, data = data.frame(x3 = x3))
X_lambda <- model.matrix(~x2, data = data.frame(x2 = x2))

# Generate parameters through linear predictors and appropriate link functions
alpha <- exp(X_alpha %*% alpha_coef)
beta <- exp(X_beta %*% beta_coef)
gamma <- exp(X_gamma %*% gamma_coef)
delta <- plogis(X_delta %*% delta_coef) # logit link for delta
lambda <- exp(X_lambda %*% lambda_coef)

# Generate response from Generalized Kumaraswamy distribution
y <- rgkw(n, alpha = alpha, beta = beta, gamma = gamma, delta = delta, lambda = lambda)
df2 <- data.frame(y = y, x1 = x1, x2 = x2, x3 = x3)

# Fit GKw regression with parameter-specific formulas
gkw_reg <- gkwreg(y ~ x1 | x1 + x2 | x3 | x3 | x2, data = df2, family = "gkw")

# Alternative model with custom link scales
gkw_reg2 <- gkwreg(y ~ x1 | x1 + x2 | x3 | x3 | x2,
  data = df2, family = "gkw",
  link_scale = list(
    alpha = 12, beta = 12, gamma = 12,
    delta = 0.8, lambda = 12
  )
)

# Compare true vs. estimated coefficients
print("Estimated Coefficients (GKw):")
print(coef(gkw_reg))
print("True Coefficients (approx):")
print(list(
  alpha = alpha_coef, beta = beta_coef, gamma = gamma_coef,
  delta = delta_coef, lambda = lambda_coef
))

## Example 3: Beta Regression for Comparison
# Set seed for reproducibility
set.seed(789)
n <- 1000
x1 <- runif(n, -1, 1)

# True coefficients for Beta parameters (gamma = shape1, delta = shape2)
gamma_coef <- c(1.0, 0.5) # Intercept, x1 (log scale)
delta_coef <- c(1.5, -0.7) # Intercept, x1 (log scale)

# Generate parameters through linear predictors and log link
X_beta_eg <- model.matrix(~x1, data.frame(x1 = x1))
gamma_true <- exp(X_beta_eg %*% gamma_coef)
delta_true <- exp(X_beta_eg %*% delta_coef)

# Generate response from Beta distribution
y <- rbeta_(n, gamma_true, delta_true)
df_beta <- data.frame(y = y, x1 = x1)

# Fit Beta regression model using gkwreg
beta_reg <- gkwreg(y ~ x1 | x1,
  data = df_beta, family = "beta",
  link = list(gamma = "log", delta = "log")
)

## Example 4: Model Comparison using AIC/BIC
# Fit an alternative model (Kumaraswamy) to the same beta-generated data
kw_reg2 <- try(gkwreg(y ~ x1 | x1, data = df_beta, family = "kw"))

# Compare models using information criteria
print("AIC Comparison (Beta vs Kw):")
c(AIC(beta_reg), AIC(kw_reg2))
print("BIC Comparison (Beta vs Kw):")
c(BIC(beta_reg), BIC(kw_reg2))

## Example 5: Prediction with Fitted Models
# Create new data for predictions
newdata <- data.frame(x1 = seq(-1, 1, length.out = 20))

# Predict expected response (mean of the Beta distribution)
pred_response <- predict(beta_reg, newdata = newdata, type = "response")

# Predict parameters on the scale of the link function
pred_link <- predict(beta_reg, newdata = newdata, type = "link")

# Predict parameters on the original scale
pred_params <- predict(beta_reg, newdata = newdata, type = "parameter")

# Visualize fitted model and data
plot(df_beta$x1, df_beta$y,
  pch = 20, col = "grey", xlab = "x1", ylab = "y",
  main = "Beta Regression Fit (using gkwreg)"
)
lines(newdata$x1, pred_response, col = "red", lwd = 2)
legend("topright", legend = "Predicted Mean", col = "red", lty = 1, lwd = 2)



[Package gkwreg version 1.0.10 Index]