Preprint / Version 1

Exploring the weightage of correlates of Diabetes prediction using Machine learning

##article.authors##

  • Aarush Raheja Symbiosis Internationa School Pune

DOI:

https://doi.org/10.58445/rars.3204

Keywords:

Machine Learning, Diabetes prediction

Abstract

There is an increase in the prevalence of Type 1 and Type 2 Diabetes/diabetes around the world. This alarming trend highlights the  need for prediction tools that can help identify risk factors associated with these chronic diseases, so that interventions can be implemented at a much earlier stage in their development. This study investigates two distinct datasets drawn from Kaggle, focusing on clinical and lifestyle factors, respectively. We constructed machine learning models, such as Logistic Regression L1 penalty, Logistic Regression L2 penalty, Random forest and a dummy classifier model to contrast basic accuracy, to first make predictions of the likelihood of diabetes occurrence given various factors. We demonstrate that the models have high predictive accuracy, with the logistic regression L2 penalty model  achieving 95.297% accuracy, the logistic regression L1 penalty model  achieving 96% accuracy, and the random forest model achieving 97% accuracy. However, the key contribution of this study is to provide interpretation of these models to determine the most important drivers of the models’ predictions. We find that  even though well-known factors such as HBA1C level, hypertension, and heart disease have high associations with diabetes, factors such as mental health even though below BMI and HBA1C level do have a moderate predictive power.

 

References

References

Ahmed, N. (n.d.). Machine learning based diabetes prediction and development of smart web application. Science Direct. https://doi.org/10.1016/j.ijcce.2021.12.001

Alhussan, A. (n.d.). Classification of Diabetes Using Feature Selection and Hybrid Al-Biruni Earth Radius and Dipper Throated Optimization. PUBMED Central. 10.3390/diagnostics13122038

Farida, M. (n.d.). A scoping review of artificial intelligence-based methods for diabetes risk prediction. PUBMED. 10.1038/s41746-023-00933-5

Jyoti, R. K. (2020). Diabetes Prediction Using Machine Learning. IJSCSEIT. https://doi.org/10.32628/CSEIT206463

Khokar, P. B. (2025). Advances in artificial intelligence for diabetes prediction: insights from a systematic literature review. Science Direct. https://doi.org/10.1016/j.artmed.2025.103132

Lugnar, M. (2023). Identifying top ten predictors of type 2 diabetes through machine learning analysis of UK Biobank data. Scientific Reports. https://doi.org/10.1038/s41598-024-52023-5

Negi, P. (n.d.). Evaluating Feature Selection Methods to Enhance Diabetes Prediction with Random Forest. ACM. https://doi.org/10.1145/3647444.364793

Noh, M. J. (n.d.). Diabetes Prediction Through Linkage of Causal Discovery and Inference Model with Machine Learning Models. MDPI. https://doi.org/10.3390/biomedicines13010124

Perez, E. R., & Molano, B. A. (2025). Learning from the machine: is diabetes in adults predicted by lifestyle variables? A retrospective predictive modelling study of NHANES 2007-2018. PUBMED. 10.1136/bmjopen-2024-096595

Qin, Y. (n.d.). Machine Learning Models for Data-Driven Prediction of Diabetes by Lifestyle Type. PUBMED. 10.3390/ijerph192215027

Ravaut, M. (2021). Development and Validation of a Machine Learning Model Using Administrative Health Data to Predict Onset of Type 2 Diabetes. PUBMED.

Zhou, H. (2023). A diabetes prediction model based on Boruta feature selection and ensemble learning. BMC. https://doi.org/10.1186/s12859-023-05300-5

Additional Files

Posted

2025-10-11