gen.dat {BayesPIM} | R Documentation |
gen.dat: Simulate Screening Data for a Prevalence-Incidence Mixture Model
Description
Generates synthetic data according to the Bayesian prevalence-incidence mixture (PIM) framework of Klausch et al. (2025) with interval-censored screening outcomes. The function simulates continuous or discrete baseline covariates, event times from one of several parametric families, and irregular screening schedules, yielding interval-censored observations suitable for testing or demonstrating PIM-based or other interval-censored survival methods.
Usage
gen.dat(
kappa = 0.7,
n = 1000,
p = 2,
p.discrete = 0,
r = 0,
s = 1,
sigma.X = 1/2,
mu.X = 4,
beta.X = NULL,
beta.W = NULL,
theta = 0.15,
v.min = 1,
v.max = 6,
mean.rc = 40,
dist.X = "weibull",
k = 1,
sel.mod = "probit",
prob.r = 0
)
Arguments
kappa |
Numeric. Test sensitivity parameter |
n |
Integer. Sample size. |
p |
Integer. Number of continuous baseline covariates to simulate. |
p.discrete |
Integer. If |
r |
Numeric. Correlation coefficient(s) used to build the covariance matrix of continuous covariates. If |
s |
Numeric. Standard deviation(s) of the continuous covariates. If |
sigma.X |
Numeric. Scale parameter |
mu.X |
Numeric. Intercept |
beta.X |
Numeric vector. The coefficients |
beta.W |
Numeric vector. The coefficients |
theta |
Numeric. Baseline prevalence parameter on the probability scale. Under:
|
v.min |
Numeric. Minimum spacing for irregular screening intervals. |
v.max |
Numeric. Maximum spacing for irregular screening intervals. |
mean.rc |
Numeric. Mean of the exponential distribution controlling a random right-censoring time |
dist.X |
Character. Distribution for survival times |
k |
Numeric. Shape parameter for |
sel.mod |
Character. Either |
prob.r |
Numeric. Probability that a baseline test is performed ( |
Details
The data-generating process includes:
-
Covariates
Z
: Continuous covariates are simulated using a correlation structure specified byr
and a common standard deviations
. Ifp.discrete = 1
, a single discrete covariate is added, drawn from\mathrm{Bernoulli}(0.5)
. -
Event Times
X
: An Accelerated Failure Time (AFT) model is used:\log(x_i) = \beta_{x0} + \beta_{x}^\top z_{xi} + \sigma_X \,\epsilon_i,
where
\beta_{x0}
is the intercept (set bymu.X
) and\beta_{x}
are the other regression coefficients (provided viabeta.X
). The error term\epsilon_i
is drawn from the distribution chosen bydist.X
:"weibull"
,"lognormal"
,"loglog"
(log-logistic), or"gengamma"
(generalized gamma). For"gengamma"
, the shape parameterk
is additionally used. -
Irregular Screening Schedules
V_i
: Each individual has multiple screening times generated randomly betweenv.min
andv.max
, ending in right censoring or the time of detection. These screening times (including a 0 for baseline andInf
for censoring) are returned inVobs
. -
Prevalence Indicator
g_i
: Baseline prevalence is modeled via either a probit or logit link, consistent with:w_i = \beta_{w0} + \beta_{w}^\top z_{wi} + \psi_i,
where
\beta_{w0}
is determined bytheta
, and\beta_{w}
bybeta.W
. Specifically:If
sel.mod = "probit"
, then\beta_{w0} = \mathrm{qnorm}(\theta)
.If
sel.mod = "logit"
, then\beta_{w0} = \log(\theta / (1-\theta))
.
We set
g_i = 1
ifw_i > 0
, andg_i = 0
otherwise. -
Baseline Test Missingness
r_i
: A baseline test indicatorr_i \in \{0,1\}
is generated via\mathrm{Bernoulli}(\text{prob.r})
, sor_i = 1
means the baseline test is performed andr_i = 0
means it is missing. -
Test Sensitivity
\kappa
: A misclassification parameter\kappa
(test sensitivity) can be specified viakappa
. If\kappa < 1
, some truly positive cases are missed.
Value
A list with the following elements:
Vobs
A list of length
n
, each entry containing screening times. The first element is 0 (baseline), andInf
may indicate right censoring.X.true
Numeric vector of length
n
giving the true (latent) event timesx_i
.Z
Numeric matrix of dimension
n \times p
(plus an extra column ifp.discrete = 1
) containing the covariates.C
Binary vector of length
n
, indicating whether an individual is truly positive at baseline (g_i = 1
).r
Binary vector of length
n
, indicating whether the baseline test was performed (r_i = 1
) or missing (r_i = 0
).p.W
Numeric vector of length
n
giving the true prevalence probabilities,P(g_i = 1)
.
References
T. Klausch, B. I. Lissenberg-Witte, and V. M. Coupé, “A Bayesian prevalence-incidence mixture model for screening outcomes with misclassification,” arXiv:2412.16065.
Examples
# Generate a small dataset for testing
set.seed(2025)
sim_data <- gen.dat(n = 100, p = 1, p.discrete = 1,
sigma.X = 0.5, mu.X = 2,
beta.X = c(0.2, 0.2), beta.W = c(0.5, -0.2),
theta = 0.2,
dist.X = "weibull", sel.mod = "probit")
str(sim_data)