Diabetes Risk from Survey Data

The Diabetes Health Indicators Dataset is a phone survey of ~254K Americans: BMI, smoking, exercise, income, education but no blood tests or biomarkers. After removing 24K duplicate rows and collapsing the near-empty prediabetes class (2% of samples), I had 229,712 records with a 4.79:1 class imbalance, split 70/15/15 into train/calibration/test sets.

The calibration set is deliberately separate from both as it’s where I’d later fit the probability calibrator without contaminating the held-out test evaluation.

Feature Engineering

EDA showed that individuals with both high blood pressure and high cholesterol had 31.4% diabetes prevalence vs. 6.2% for those with neither. There’s a multiplicative effect leading me to create two interaction features:

Feature	Formula	MI Score
`cvd_bmi_interaction`	cardiovascular_risk × (BMI/30)	0.0601
`bmi_age_interaction`	(BMI/30) × (Age/13)	0.0470

SHAP later confirmed both ranked 2nd and 3rd globally (behind GenHlth). A side benefit: continuous interaction features give SMOTEENN’s synthetic samples somewhere meaningful to land in feature space as purely binary features make interpolation nearly meaningless.

SHAP’s fourth-ranked feature was Income. Not a direct health variable but a proxy for healthcare access, diet quality, and a dozen mediating factors. The model is partially using socioeconomic position as a risk signal which has direct fairness implications.

Model Selection and the 0.81 Ceiling

Six models, five-fold CV, SMOTEENN applied inside each fold (not before because applying it outside leaks synthetic samples into validation folds). All six landed in a tight 0.80–0.81 ROC-AUC band regardless of architecture. CatBoost won on F1 (0.491) and Average Precision (0.455). I chose it over logistic regression’s higher recall (0.83 vs 0.65) because AP better captures the precision-recall tradeoff under imbalance and flooding patients with false positives has real costs.

The convergence to 0.81 across all architectures is a data signal, not a modeling failure. The 1,542 false negatives in the test set had mean BMI 28.0 vs 33.2 for true positives, HighBP prevalence 37% vs 86%. The model misses diabetics who don’t fit the classic metabolic profile which are people whose risk comes from family history and HbA1c, neither of which BRFSS collects.

Post-tuning (Optuna, 50 TPE trials): 0.8135 ROC-AUC on test.

Calibration and Threshold

Isotonic regression on the held-out calibration set dropped the Brier score from 0.157 to 0.115 (27% improvement over uncalibrated). Threshold set at 0.184 via Youden’s J statistic. Final test results: 74% recall, 36% precision, 0.486 F1. Since it’s a screening application priority was maximizing recall to catch as many true cases as possible.

Clustering: Four Population Segments

K-Means (k=4, Silhouette 0.533) on PCA-reduced features from a 50K training subset. I chose k=4 over k=2 (which gives a better Silhouette of 0.67) because k=2 collapses the elderly-comorbidity and obesity profiles into one blob.

Cluster	Size	Diabetes Rate	Risk
Low-Risk Active Majority	77%	15.0%	LOW
High-Risk Obese with CVD	6.6%	25.3%	HIGH
High-Risk Elderly with Comorbidities	9.2%	29.7%	HIGH
Moderate-Risk Younger Adults	7.2%	18.8%	MODERATE

A 2× difference between the lowest and highest diabetes rates (15% to 29.7%) gives health systems a clear target for intervention. Cluster 0 is 77% of the population and still carries a non-trivial 15% diabetes rate.

The Fairness Problem

Fairness Problem

74% aggregate recall. Disaggregated:

Group	Recall	ROC-AUC
Young adults (18–39)	33.5%	0.852
Middle-aged (40–59)	64.8%	0.821
Seniors (60+)	80.5%	0.761
Low income	86.8%	0.783
Middle income	76.3%	0.797
High income	62.4%	0.815

The model misses two-thirds of diabetic young adults. These patients may spend decades with unmanaged disease. The complications of undiagnosed diabetes compound over time, and early detection matters most for this group.

Low-income patients have the highest recall (86.8%) but the worst ROC-AUC (0.783). More diabetics caught but also proportionally more false alarms. Each false positive for a low-income patient means time away from work, transportation costs, and the psychological burden of a label that turns out to be wrong.

The model relies on comorbidity features to identify diabetics. Young adults with diabetes typically haven’t developed hypertension or CVD markers yet. Their risk is invisible to BRFSS.

Optimizing for aggregate recall redistributed classification errors onto the most socioeconomically vulnerable populations.

Cluster-Model Integration

Running the trained model within each cluster shows the issue:

Cluster	Recall	FN Rate
Low-Risk Majority (77%)	0.680	32.0%
High-Risk Obese (6.6%)	0.887	11.3%
High-Risk Elderly (9.2%)	0.881	11.9%
Moderate-Risk (7.2%)	0.765	23.5%

The model works well in high-risk clusters because those patients have the comorbidities the engineered features capture. In Cluster 0 — 77% of the population, diabetics are heterogeneous and don’t fit the classic profile. One in three gets missed.

The global threshold of 0.184 is doing very different things in different segments. A cluster-aware threshold strategy that is lower for Cluster 0 to catch atypical cases could improve equity without degrading overall performance. The 32% false negative rate in the majority population is hard to accept.

Retrospective

Nested CV would give more reliable performance estimates; the current pipeline risks overfitting to the validation signal during model selection.
Clinical variables have to be updated. Family history and HbA1c would push past 0.81 as BRFSS was never designed as a diagnostic instrument.
Longitudinal data would let you predict progression to diabetes rather than current status which is a more useful clinical question.
Per-subgroup calibration is a prerequisite for trustworthy individual risk communication. The aggregate Brier score hides miscalibration at the group level.