Multi-Model Machine Learning Identifies MAPT, CHEK1, AURKA as Breast Cancer Prognostic Markers
DOI:
https://doi.org/10.58445/rars.3867Keywords:
machine learning, breast cancer, computational biologyAbstract
Breast cancer is the most common cancer among women globally, and the second leading cause of cancer death. While early detection improves survival rates, personalizing treatment based on genomic biomarkers could further increase survival duration. This project applies a multi-model machine learning approach to identify key gene biomarkers correlating with breast cancer survival duration, enabling physicians to personalize treatment. I hypothesized that different models contribute complementary findings: linear regression captures proportional gene-survival relationships, random forest reveals non-linear interactions, and neural network could perform deep analysis on larger datasets but was anticipated to underperform on the small METABRIC dataset (n=1904 patients). Using the METABRIC dataset, I preprocessed gene expression and clinical data by imputing missing values and encoding categorical variables. I trained three models—ElasticNet linear regression, random forest, and PyTorch neural network—using 70/15/15 train/validation/test split to predict overall survival time. Random forest achieved the best test performance (r²=0.147), followed by linear regression (r²=0.138), while the neural network underperformed (r²=0.052) as anticipated. The models identified MAPT, CHEK1, and AURKA as top gene biomarkers strongly associated with survival duration, consistent with published cancer genomics research. The analysis focused on genomic factors, yet survival duration also depends on age, comorbidities, lifestyle, and unrelated causes of death, introducing variance. The consistent under-prediction of survival durations aligns with this limitation and validates the integrity of our genomic-focused approach. These findings demonstrate that complementary models can uncover actionable genomic biomarkers, offering a pathway toward personalized breast cancer treatment and improved prognosis outcomes.
References
Ali, H. R., Dawson, S.-J., Blows, F. M., Provenzano, E., Pharoah, P. D., & Caldas, C. (2012). Aurora kinase A outperforms Ki67 as a prognostic marker in ER-positive breast cancer. British Journal of Cancer, 106(11), 1798–1806. https://doi.org/10.1038/bjc.2012.167
Al-kaabi, M. M., Alshareeda, A. T., Jerjees, D. A., Muftah, A. A., Green, A. R., Alsubhi, N. H., Nolan, C. C., Chan, S., Cornford, E., Madhusudan, S., Ellis, I. O., & Rakha, E. A. (2015). Checkpoint kinase1 (CHK1) is an important biomarker in breast cancer having a role in chemotherapy response. British Journal of Cancer, 112(5), 901–911. https://doi.org/10.1038/bjc.2014.576
Bonneau, C., Gurard-Levin, Z. A., Andre, F., Pusztai, L., & Rouzier, R. (2015). Predictive and Prognostic Value of the TauProtein in Breast Cancer. Anticancer Research, 35(10), 5179–5184.
Callari, M., Sola, M., Magrin, C., Rinaldi, A., Bolis, M., Paganetti, P., Colnaghi, L., & Papin, S. (2023). Cancer-specific association between Tau (MAPT) and cellular pathways, clinical outcome, and drug response. Scientific Data, 10(1), 637. https://doi.org/10.1038/s41597-023-02543-y
Cancer Facts & Figures. (2026). American Cancer Society. https://www.cancer.org/content/dam/cancer-org/research/cancer-facts-and-statistics/annual-cancer-facts-and-figures/2026/2026-cancer-facts-and-figures.pdf
Cox, D. R. (1972). Regression Models and Life-Tables. Journal of the Royal Statistical Society Series B: Statistical Methodology, 34(2), 187–202. https://doi.org/10.1111/j.2517-6161.1972.tb00899.x
Fong, Y., Evans, J., Brook, D., Kenkre, J., Jarvis, P., & Gower-Thomas, K. (2015). The Nottingham Prognostic Index: Five- and ten-year data for all-cause Survival within a Screened Population. The Annals of The Royal College of Surgeons of England, 97(2), 137–139. https://doi.org/10.1308/003588414X14055925060514
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. The MIT Press.
METABRIC Group, Curtis, C., Shah, S. P., Chin, S.-F., Turashvili, G., Rueda, O. M., Dunning, M. J., Speed, D., Lynch, A. G., Samarajiwa, S., Yuan, Y., Gräf, S., Ha, G., Haffari, G., Bashashati, A., Russell, R., McKinney, S., Langerød, A., Green, A., … Aparicio, S. (2012). The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature, 486(7403), 346–352. https://doi.org/10.1038/nature10983
Pereira, B., Chin, S.-F., Rueda, O. M., Vollan, H.-K. M., Provenzano, E., Bardwell, H. A., Pugh, M., Jones, L., Russell, R., Sammut, S.-J., Tsui, D. W. Y., Liu, B., Dawson, S.-J., Abraham, J., Northen, H., Peden, J. F., Mukherjee, A., Turashvili, G., Green, A. R., … Caldas, C. (2016). The somatic mutation profiles of 2,433 breast cancers refine their genomic and transcriptomic landscapes. Nature Communications, 7(1), 11479. https://doi.org/10.1038/ncomms11479
Siegel, R. L., Kratzer, T. B., Wagle, N. S., Sung, H., & Jemal, A. (2026). Cancer statistics, 2026. CA: A Cancer Journal for Clinicians, 76(1), e70043. https://doi.org/10.3322/caac.70043
The Cancer Genome Atlas Research Network, Weinstein, J. N., Collisson, E. A., Mills, G. B., Shaw, K. R. M., Ozenberger, B. A., Ellrott, K., Shmulevich, I., Sander, C., & Stuart, J. M. (2013). The Cancer Genome Atlas Pan-Cancer analysis project. Nature Genetics, 45(10), 1113–1120. https://doi.org/10.1038/ng.2764
Wu, M., Pang, J.-S., Sun, Q., Huang, Y., Hou, J.-Y., Chen, G., Zeng, J.-J., & Feng, Z.-B. (2019). The clinical significance of CHEK1 in breast cancer: A high-throughput data analysis and immunohistochemical study. International Journal of Clinical and Experimental Pathology, 12(1), 1–20.
Downloads
Posted
Categories
License
Copyright (c) 2026 Research Archive of Rising Scholars

This work is licensed under a Creative Commons Attribution 4.0 International License.