Preprint / Version 1

Improving Diabetes Prediction Accuracy Using Ensemble Machine Learning Models

##article.authors##

  • Aadit Singh Monta Vista High School, Independent Researcher

DOI:

https://doi.org/10.58445/rars.3258

Keywords:

HbA1c, diabetes prediction, machine learning, Random Forest, Voting Classifier, Kaggle, classification, glycemic control, ensemble learning, predictive modeling

Abstract

This study investigates prediction of HbA1c level which is a principal biomarker of diabetes control based on patient biographical and health data from a publicly accessible dataset [1]. I tried regression models like Linear Regression [2], Decision Tree Regressor [3], and Random Forest Regressor [4] to predict accurate HbA1c levels. Upon facing poorly performing models, most likely because of data bias and feature insufficiency, I restructured the task as a classification problem by approximating the ranges of HbA1c levels into significant categories. I implemented models including Random Forest Classifier [5], Decision Tree Classifier [6],  K-Nearest Neighbors [7], and an ensemble Voting Classifier  [8]. The Voting Classifier increased the best accuracy to 72.5%, improving over Random Forest’s standalone accuracy of 68.1% [5]. Model tuning focused on parameters such as the number of trees and maximum depth. Variance Inflation Factor analysis was executed to evaluate feature multicollinearity and it confirmed that multicollinearity was not a major issue. Results show that classification models are more suitable for this dataset and confirm the importance of feature engineering and hyperparameter adjustment. This finding demonstrates that classification models better suit this dataset, showing how predictive instruments can assist medical personnel in approximating HbA1c values without resorting to decisions purely based on costly or time-consuming laboratory testing.

References

AravindPCoder. (2023, November 18). Diabetes dataset. Kaggle.

https://www.kaggle.com/datasets/aravindpcoder/diabetes-dataset?resource=download

Scikit-Learn Developers. (2025). Linear Regression documentation. Scikit-Learn.

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

Scikit-Learn Developers. (2025). Decision Tree Regressor documentation. Scikit-Learn.

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html

Scikit-Learn Developers. (2025). Random Forest Regressor documentation. Scikit-Learn.

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html

Scikit-Learn Developers. (2025). Random Forest Classifier documentation. Scikit-Learn.

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

Scikit-Learn Developers. (2025). Decision Tree Classifier documentation. Scikit-Learn.

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

Kartik. (2025, August 23). K-nearest neighbors (KNN). GeeksforGeeks.

https://www.geeksforgeeks.org/machine-learning/k-nearest-neighbours/

Scikit-Learn Developers. (2025). Voting Classifier documentation. Scikit-Learn.

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html

Alhassan, Zakhriya, et al. “Improving Current Glycated Hemoglobin Prediction in Adults: Use of Machine Learning Algorithms with Electronic Health Records.” JMIR Medical Informatics, U.S. National Library of Medicine, 24 May 2021,

pmc.ncbi.nlm.nih.gov/articles/PMC8185616/.

Tao, X., Jiang, M., Liu, Y., Hu, Q., Zhu, B., Hu, J., et al. (2023, September 30). Predicting three-month fasting blood glucose and glycated hemoglobin changes in patients with Type 2 diabetes mellitus based on multiple machine learning algorithms. Scientific Reports.

https://doi.org/10.1038/s41598-023-43240-5

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.

https://doi.org/10.48550/arXiv.1201.0490

GraphPad by Dotmatics. (n.d.). Linear regression calculator.

https://www.graphpad.com/quickcalcs/linear1/

Tablas-Mejia, I. (2025). Conclusion section for research papers. San José State University Writing Center.

https://www.sjsu.edu/writingcenter/docs/handouts/Conclusion%20Section%20for%20Research%20Papers.pdf

Downloads

Posted

2025-10-17