Preprint / Version 1

Multi-Model Machine Learning Identifies MAPT, CHEK1, AURKA as Breast Cancer Prognostic Markers

##article.authors##

  • Jasmine Chan The Bishop's School

DOI:

https://doi.org/10.58445/rars.3867

Keywords:

machine learning, breast cancer, computational biology

Abstract

Breast cancer is the most common cancer among women globally, and the second leading cause of cancer death. While early detection improves survival rates, personalizing treatment based on genomic biomarkers could further increase survival duration. This project applies a multi-model machine learning approach to identify key gene biomarkers correlating with breast cancer survival duration, enabling physicians to personalize treatment. I hypothesized that different models contribute complementary findings: linear regression captures proportional gene-survival relationships, random forest reveals non-linear interactions, and neural network could perform deep analysis on larger datasets but was anticipated to underperform on the small METABRIC dataset (n=1904 patients). Using the METABRIC dataset, I preprocessed gene expression and clinical data by imputing missing values and encoding categorical variables. I trained three models—ElasticNet linear regression, random forest, and PyTorch neural network—using 70/15/15 train/validation/test split to predict overall survival time. Random forest achieved the best test performance (r²=0.147), followed by linear regression (r²=0.138), while the neural network underperformed (r²=0.052) as anticipated. The models identified MAPT, CHEK1, and AURKA as top gene biomarkers strongly associated with survival duration, consistent with published cancer genomics research. The analysis focused on genomic factors, yet survival duration also depends on age, comorbidities, lifestyle, and unrelated causes of death, introducing variance. The consistent under-prediction of survival durations aligns with this limitation and validates the integrity of our genomic-focused approach. These findings demonstrate that complementary models can uncover actionable genomic biomarkers, offering a pathway toward personalized breast cancer treatment and improved prognosis outcomes.

References

Ali, H. R., Dawson, S.-J., Blows, F. M., Provenzano, E., Pharoah, P. D., & Caldas, C. (2012). Aurora kinase A outperforms Ki67 as a prognostic marker in ER-positive breast cancer. British Journal of Cancer, 106(11), 1798–1806. https://doi.org/10.1038/bjc.2012.167

Al-kaabi, M. M., Alshareeda, A. T., Jerjees, D. A., Muftah, A. A., Green, A. R., Alsubhi, N. H., Nolan, C. C., Chan, S., Cornford, E., Madhusudan, S., Ellis, I. O., & Rakha, E. A. (2015). Checkpoint kinase1 (CHK1) is an important biomarker in breast cancer having a role in chemotherapy response. British Journal of Cancer, 112(5), 901–911. https://doi.org/10.1038/bjc.2014.576

Bonneau, C., Gurard-Levin, Z. A., Andre, F., Pusztai, L., & Rouzier, R. (2015). Predictive and Prognostic Value of the TauProtein in Breast Cancer. Anticancer Research, 35(10), 5179–5184.

Callari, M., Sola, M., Magrin, C., Rinaldi, A., Bolis, M., Paganetti, P., Colnaghi, L., & Papin, S. (2023). Cancer-specific association between Tau (MAPT) and cellular pathways, clinical outcome, and drug response. Scientific Data, 10(1), 637. https://doi.org/10.1038/s41597-023-02543-y

Cancer Facts & Figures. (2026). American Cancer Society. https://www.cancer.org/content/dam/cancer-org/research/cancer-facts-and-statistics/annual-cancer-facts-and-figures/2026/2026-cancer-facts-and-figures.pdf

Cox, D. R. (1972). Regression Models and Life-Tables. Journal of the Royal Statistical Society Series B: Statistical Methodology, 34(2), 187–202. https://doi.org/10.1111/j.2517-6161.1972.tb00899.x

Fong, Y., Evans, J., Brook, D., Kenkre, J., Jarvis, P., & Gower-Thomas, K. (2015). The Nottingham Prognostic Index: Five- and ten-year data for all-cause Survival within a Screened Population. The Annals of The Royal College of Surgeons of England, 97(2), 137–139. https://doi.org/10.1308/003588414X14055925060514

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. The MIT Press.

METABRIC Group, Curtis, C., Shah, S. P., Chin, S.-F., Turashvili, G., Rueda, O. M., Dunning, M. J., Speed, D., Lynch, A. G., Samarajiwa, S., Yuan, Y., Gräf, S., Ha, G., Haffari, G., Bashashati, A., Russell, R., McKinney, S., Langerød, A., Green, A., … Aparicio, S. (2012). The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature, 486(7403), 346–352. https://doi.org/10.1038/nature10983

Pereira, B., Chin, S.-F., Rueda, O. M., Vollan, H.-K. M., Provenzano, E., Bardwell, H. A., Pugh, M., Jones, L., Russell, R., Sammut, S.-J., Tsui, D. W. Y., Liu, B., Dawson, S.-J., Abraham, J., Northen, H., Peden, J. F., Mukherjee, A., Turashvili, G., Green, A. R., … Caldas, C. (2016). The somatic mutation profiles of 2,433 breast cancers refine their genomic and transcriptomic landscapes. Nature Communications, 7(1), 11479. https://doi.org/10.1038/ncomms11479

Siegel, R. L., Kratzer, T. B., Wagle, N. S., Sung, H., & Jemal, A. (2026). Cancer statistics, 2026. CA: A Cancer Journal for Clinicians, 76(1), e70043. https://doi.org/10.3322/caac.70043

The Cancer Genome Atlas Research Network, Weinstein, J. N., Collisson, E. A., Mills, G. B., Shaw, K. R. M., Ozenberger, B. A., Ellrott, K., Shmulevich, I., Sander, C., & Stuart, J. M. (2013). The Cancer Genome Atlas Pan-Cancer analysis project. Nature Genetics, 45(10), 1113–1120. https://doi.org/10.1038/ng.2764

Wu, M., Pang, J.-S., Sun, Q., Huang, Y., Hou, J.-Y., Chen, G., Zeng, J.-J., & Feng, Z.-B. (2019). The clinical significance of CHEK1 in breast cancer: A high-throughput data analysis and immunohistochemical study. International Journal of Clinical and Experimental Pathology, 12(1), 1–20.

Downloads

Posted

2026-06-07