cachar_sample {riskdiff} | R Documentation |
Synthetic Cancer Risk Factor Study Data
Description
A synthetic dataset inspired by cancer screening and risk factor patterns observed during an opportunistic screening program conducted at the Cachar Cancer Hospital and Research Centre in Northeast India, specifically designed to reflect authentic epidemiological relationships without using real patient data.
Usage
cachar_sample
Format
A data frame with 2,500 rows and 12 variables:
- id
Participant identifier (1 to 2500)
- age
Age in years (continuous, range 18-84)
- sex
Biological sex: "male" or "female"
- residence
Residence type: "rural", "urban", or "urban slum"
- smoking
Current smoking status: "No" or "Yes"
- tobacco_chewing
Current tobacco chewing: "No" or "Yes"
- areca_nut
Current areca nut use: "No" or "Yes"
- alcohol
Current alcohol use: "No" or "Yes"
- abnormal_screen
Binary outcome: 1 = abnormal screening (precancerous lesions or cancer), 0 = normal
- head_neck_abnormal
Binary outcome: 1 = head/neck abnormality detected, 0 = normal
- age_group
Age categories: "Under 40", "40-60", "Over 60"
- tobacco_areca_both
Combined exposure: "Yes" if both tobacco_chewing and areca_nut are "Yes", "No" otherwise
Details
This synthetic dataset was designed to reflect authentic epidemiological patterns observed in Northeast India, particularly the distinctive tobacco and areca nut use patterns of the region. All data points are mathematically generated rather than collected from real individuals.
Key epidemiological features modeled:
-
Areca nut use: Very high prevalence (~69%) reflecting regional cultural practices
-
Tobacco chewing: Moderate to high prevalence (~53%), often used with areca nut
-
Smoking: Lower prevalence (~13%) with strong male predominance
-
Cancer outcomes: Realistic prevalence (~3.5%) for population-based screening, including both precancerous lesions and invasive cancers
-
Geographic patterns: Predominantly rural population (~87%)
Synthetic Data Advantages: The synthetic approach preserves authentic statistical relationships while:
Avoiding any privacy or ethical concerns
Ensuring reproducible examples and tests
Providing controlled demonstration scenarios
Maintaining cultural authenticity for educational purposes
Risk Factor Relationships: The data models realistic dose-response relationships between multiple tobacco exposures and cancer outcomes, with particularly strong associations for areca nut use and head/neck abnormalities, reflecting authentic epidemiological patterns from this region.
Note
This synthetic dataset is designed for educational and software demonstration purposes. While the statistical relationships reflect authentic epidemiological patterns, the data should not be used for research conclusions about real populations. The cultural patterns represented (high areca nut use, specific tobacco consumption practices) are authentic to Northeast India.
Source
Synthetic dataset created for the riskdiff package. Inspired by cancer screening patterns observed in Northeast India but contains no real patient data. Statistical relationships designed to reflect authentic epidemiological patterns from this region for educational and methodological purposes.
References
Epidemiological patterns modeled after studies of tobacco use and cancer risk in Northeast India. For research involving actual populations from this region, consult published literature on areca nut and tobacco-related cancer risks in South Asian populations.
Warnakulasuriya S, Trivedy C, Peters TJ (2002). "Areca nut use: an independent risk factor for oral cancer." BMJ, 324(7341), 799-800.
Gupta PC, Ray CS (2004). "Epidemiology of betel quid use." Annals of the Academy of Medicine, Singapore, 33(4 Suppl), 31-36.
Examples
data(cachar_sample)
head(cachar_sample)
# Basic descriptive statistics
table(cachar_sample$areca_nut, cachar_sample$abnormal_screen)
# Regional tobacco use patterns
with(cachar_sample, table(areca_nut, tobacco_chewing))
# Simple risk difference for areca nut and abnormal screening
rd_areca <- calc_risk_diff(
data = cachar_sample,
outcome = "abnormal_screen",
exposure = "areca_nut"
)
print(rd_areca)
# Age-adjusted analysis
rd_adjusted <- calc_risk_diff(
data = cachar_sample,
outcome = "abnormal_screen",
exposure = "areca_nut",
adjust_vars = "age"
)
print(rd_adjusted)
# Stratified by sex
rd_stratified <- calc_risk_diff(
data = cachar_sample,
outcome = "head_neck_abnormal",
exposure = "smoking",
strata = "sex"
)
print(rd_stratified)
# Multiple tobacco exposures comparison
rd_smoking <- calc_risk_diff(cachar_sample, "abnormal_screen", "smoking")
rd_chewing <- calc_risk_diff(cachar_sample, "abnormal_screen", "tobacco_chewing")
rd_areca <- calc_risk_diff(cachar_sample, "abnormal_screen", "areca_nut")
# Compare risk differences
cat("Risk differences for abnormal screening:\n")
cat("Smoking:", sprintf("%.1f%%", rd_smoking$rd * 100), "\n")
cat("Tobacco chewing:", sprintf("%.1f%%", rd_chewing$rd * 100), "\n")
cat("Areca nut:", sprintf("%.1f%%", rd_areca$rd * 100), "\n")
# Create summary table
cat(create_simple_table(rd_areca, "Abnormal Screening Risk by Areca Nut Use"))