Machine Learning applied to cardiovascular health
Systemic arterial hypertension is one of the greatest public health challenges worldwide. Known as the "silent killer," it often shows no symptoms until significant cardiovascular damage occurs. Early detection through predictive Machine Learning models can help health professionals identify patients at risk, enabling preventive interventions that save lives and reduce treatment costs.
Hypertension affects approximately 1.28 billion adults worldwide
Source: WHO Global Report 2023 [6]Nearly half of people with hypertension are unaware of their condition
Source: WHO/Lancet Study 2021 [7]Leading preventable risk factor for cardiovascular disease, causing 10.8 million deaths per year
Source: WHO Fact Sheet 2024 [8]End-to-end Machine Learning pipeline following best practices
Descriptive statistics, correlations, outliers, multicollinearity (VIF)
Imputation, SMOTE (train only), scaling, stratified split
10 models, 5-fold cross-validation, ensemble methods
Grid Search, Random Search, clinical threshold analysis
Feature importance, SHAP values, clinical validation
Dataset used to train and validate the predictive model
A dataset is the collection of information used to "teach" the Machine Learning model.
In this project, we use real data from 4,240 patients with clinical and demographic information.
A 31% prevalence indicates that about 1 in 3 patients in the study has hypertension,
which represents an imbalance handled with specific techniques (SMOTE) during training.
Dataset: The dataset used in this project is publicly available on Kaggle.
Access the link: Kaggle -
hypertension-risk-model-main
| Variable | Type | Description |
|---|---|---|
| pressao_sistolica | Continuous | Systolic blood pressure (mmHg) |
| pressao_diastolica | Continuous | Diastolic blood pressure (mmHg) |
| idade | Continuous | Age in years |
| imc | Continuous | Body Mass Index (kg/m^2) |
| colesterol_total | Continuous | Total cholesterol (mg/dL) |
| glicose | Continuous | Blood glucose (mg/dL) |
| frequencia_cardiaca | Continuous | Heart rate (bpm) |
| cigarros_por_dia | Continuous | Cigarettes per day |
| sexo | Categorical | 0 = Female, 1 = Male |
| fumante_atualmente | Categorical | 0 = No, 1 = Yes |
| medicamento_pressao | Categorical | 0 = No, 1 = Yes |
| diabetes | Categorical | 0 = No, 1 = Yes |
Comparison of 10 models with stratified cross-validation
We tested 10 different Machine Learning algorithms to find the most suitable for predicting hypertension.
The Random Forest was selected as the best model because it offers the best balance between
correctly detecting hypertensive patients (high sensitivity) and not generating unnecessary false alarms
(good specificity).
In a clinical context, the priority is not to miss positive cases
(false negatives), so we use F2-Score as the primary decision metric.
n_estimators=100, max_depth=10, min_samples_split=2
AUC: 95.96%
F2: 84.80%
AUC: 95.34%
F2: 84.30%
AUC: 95.47%
F2: 82.81%
What it shows: Compares the performance of all tested models using different metrics. Use the dropdown to switch metrics. The taller the bar, the better the model performs on that metric.
What it shows: The ROC curve illustrates the model's ability to distinguish between patients with and without hypertension. The closer the curve is to the upper-left corner, the better the model. The area under the curve (AUC) summarizes this ability: values close to 100% indicate excellent discrimination.
What it shows: Summarizes the model's correct and incorrect predictions. True Positives (TP): hypertensive patients correctly identified. True Negatives (TN): healthy patients correctly identified. False Negatives (FN): hypertensive patients not detected (most critical in healthcare). False Positives (FP): false alarms in healthy patients.
What it shows: Comparative visualization of multiple metrics at once for the main models. Each axis represents a different metric. An ideal model would have all metrics close to 100%, forming a large polygon. Useful to identify strengths and weaknesses of each model.
Understanding model decisions for clinical validation
In medical applications, it is not enough for the model to make accurate predictions ? it is essential to understand
why it reached that conclusion. Interpretability allows clinicians to validate whether the model is using
clinically relevant criteria, increasing trust in the system and supporting shared decision-making with the
patient.
"Black-box" models that do not explain their decisions are less
accepted in clinical practice.
What it shows: How much each variable contributes to the model's decision. Larger bars indicate greater influence on the prediction. The colors identify the clinical category of each variable. Hover over each bar to see additional details.
What it shows: Grouping of variables by clinical category. The size of each circle represents the total contribution of that category to the model's predictions. Note how Blood Pressure dominates, which is aligned with established clinical knowledge about hypertension.
What it shows: The threshold (decision cutoff) determines from which probability the model classifies a patient as "at risk." A lower threshold (e.g., 0.30) detects more cases but generates more false alarms. A higher threshold (e.g., 0.80) is more precise but may miss some cases. The choice depends on the clinical context: population screening vs. diagnostic confirmation.
Sensitivity: 95.5%
Specificity: 86.5%
Minimize false negatives
Sensitivity: 91.5%
Specificity: 91.0%
General use
Sensitivity: 79.7%
Specificity: 95.7%
Minimize false positives
What it shows: Interactive version of the feature importance ranking. Hover over the bars to see details for each variable, including its clinical category and description. Blood pressure variables dominate, validating established clinical knowledge.
What it shows: How sensitivity and specificity change across thresholds. The dashed vertical lines indicate the three recommended clinical scenarios. Note the trade-off: increasing sensitivity reduces specificity and vice versa. The crossing point represents the "optimal" threshold for balancing both.
Production pipeline on AWS with high availability
A Machine Learning model only creates value when it is available for real use. Deploying on AWS allows
the system to be accessed from anywhere, 24/7, with high availability and low latency.
The serverless architecture (no dedicated servers) reduces costs and
scales automatically with demand, making the system viable even for institutions with limited resources.
Prof. Dr. Anderson Henrique Rodrigues Ferreira
Advisor and Developer
CEUNSP - Centro Universitario Nossa Senhora do Patrocinio
Artigos científicos que fundamentam a metodologia utilizada
M. Kivrak, U. Avci, H. Uzun, and C. Ardic, "The impact of the SMOTE method on
machine learning and ensemble learning performance results in addressing class imbalance in data
used for predicting total testosterone deficiency in type 2 diabetes patients,"
Diagnostics, vol. 14, no. 23, Art. no. 2634, Nov. 2024.
Link
de Acesso: https://doi.org/10.3390/diagnostics14232634
A. Fernández, S. García, F. Herrera, and N. V. Chawla, "SMOTE for learning from
imbalanced data: Progress and challenges, marking the 15-year anniversary," J. Artif.
Intell. Res., vol. 61, pp. 863–905, 2018.
Access link: https://doi.org/10.1613/jair.1.11192
M. Talebi Moghaddam, Y. Jahani, Z. Arefzadeh, A. Dehghan, M. Khaleghi, M.
Sharafi, and G. Nikfar, "Predicting diabetes in adults: Identifying important features in
unbalanced data over a 5-year cohort study using machine learning algorithm," BMC Med. Res.
Methodol., vol. 24, Art. no. 220, Sep. 2024.
Access link:
https://doi.org/10.1186/s12874-024-02341-z
Y. Li, Y. Yang, P. Song, L. Duan, and R. Ren, "An improved SMOTE algorithm for
enhanced imbalanced data classification by expanding sample generation space," Sci.
Rep., vol. 15, Art. no. 23521, Jul. 2025.
Access link: https://doi.org/10.1038/s41598-025-09506-w
J. Zhu, S. Pu, J. He, D. Su, W. Cai, X. Xu, and H. Liu, "Processing imbalanced
medical data at the data level with assisted-reproduction data as an example," BioData
Mining, vol. 17, Art. no. 29, Sep. 2024.
Access link: https://doi.org/10.1186/s13040-024-00384-y
World Health Organization, "Global report on hypertension: The race against a
silent killer," Geneva: WHO, Sep. 2023.
Access link: https://www.who.int/teams/noncommunicable-diseases/hypertension-report
NCD Risk Factor Collaboration (NCD-RisC), "Worldwide trends in hypertension
prevalence and progress in treatment and control from 1990 to 2019: a pooled analysis of 1201
population-representative studies with 104 million participants," The Lancet, vol. 398,
no. 10304, pp. 957–980, Sep. 2021.
Access link: https://doi.org/10.1016/S0140-6736(21)01330-1
World Health Organization, "Hypertension: Key facts," WHO Fact Sheets, Mar.
2023. [Online].
Access link: https://www.who.int/news-room/fact-sheets/detail/hypertension