Preprint / Version 3

An Intelligent System for Early Prediction of Cardiovascular Disease using Machine Learning


  • Aarush Kachhawa Saint Francis High School



machine learning, classification models, cardiovascular disease prediction, supervised machine learning


Cardiovascular disease (CVD) remains the leading cause of death, responsible for 18.6 million deaths globally in 2019. Given the wide availability of several effective therapeutic treatment options, early diagnosis of CVD is critical for timely intervention and slowing down the progression of the disease. CVD is associated with a multitude of risk markers with non-linear interactions among them, making accurate diagnosis of CVD quite challenging, especially for non-specialized clinicians and under-resourced facilities in developing countries. In recent years, machine learning based computational techniques have shown great promise in becoming a great diagnostic tool. The goal of this research is to leverage multiple machine learning methods such as random forest, gradient boosting, logistic regression and artificial neural network and evaluate their prediction efficacy. This study also evaluates the feasibility of combining multiple UCI datasets in order to improve the prediction accuracy of the models. On a merged dataset of over 700 patients from the UCI machine learning repository, the most accurate model was found to be the random forest classifier, showing an accuracy and F1 score of 94% and AUC of 0.98. It was found that ensemble learning methodologies along with data optimization and hyperparameter tuning techniques were able to achieve higher accuracy relative to prior published studies on these datasets. Finally, this study also proposes how these machine learning workloads can be incorporated into a distributed cloud connected healthcare system to make them widely accessible to practicing doctors and enable them to assess CVD risk of their patients.


2021 Heart Disease and Stroke statistics update fact sheet at-a-glance. (n.d.). Retrieved June 1, 2022, from

Machine learning: What it is and why it matters. SAS. (n.d.). Retrieved May 31, 2022, from

Nasteski, V. (2017). An overview of the supervised machine learning methods. HORIZONS.B, 4, 51-62.

Diabetes prediction using support Vector Machines. Sisense. (2022, March 18). Retrieved May 31, 2022, from

What is logistic regression? Master's in Data Science. (n.d.). Retrieved May 31, 2022, from tic-regression/

Yıldırım, S. (2020, February 17). Gradient boosted decision trees-explained. Medium. Retrieved May 31, 2022, from

Brownlee, J. (2020, December 2). Bagging and Random Forest Ensemble algorithms for Machine Learning. Machine Learning Mastery. Retrieved May 31, 2022, from

Bhoyar, S., Wagholikar, N., Bakshi, K., & Chaudhari, S. (2021). Real-time heart disease prediction system using Multilayer Perceptron. 2021 2nd International Conference for Emerging Technology (INCET).

Whisker plot. Whisker Plot - an overview | ScienceDirect Topics. (n.d.). Retrieved May 31, 2022, from

Pal, M., & Parija, S. (2021). Prediction of heart diseases using Random Forest. Journal of Physics: Conference Series, 1817(1), 012009.

UCI Machine Learning Repository: Heart disease data set. (n.d.). Retrieved May 31, 2022, from

Singh, A., & Kumar, R. (2020). Heart disease prediction using machine learning algorithms. 2020 International Conference on Electrical and Electronics Engineering (ICE3).

Mishra, A. (2020, May 28). Metrics to evaluate your machine learning algorithm. Medium.

Retrieved May 31, 2022, from

UCI Machine Learning Repository: Statlog (heart) data set. (n.d.). Retrieved May 31, 2022, from



2022-12-05 — Updated on 2022-12-24