Preprint / Version 1

A Machine Learning Framework for Predicting Protein-Protein Interactions from Sequence-Derived Physicochemical Features

##article.authors##

  • Krithik Alluri Lenape High School

DOI:

https://doi.org/10.58445/rars.2672

Keywords:

protein-protein interactions (PPIs), Machine Learning

Abstract

The prediction of protein-protein interactions (PPIs) from primary sequence data remains a fundamental challenge in systems biology. While experimental methods are resource-intensive, computational approaches offer a scalable alternative to map the complex human interactome. This study develops a machine learning framework to not only predict PPIs with high accuracy but also to uncover the underlying biochemical principles that govern these associations. We constructed a balanced dataset of approximately 1.8 million human protein pairs derived from the BioGRID database. For each protein, we engineered a feature set based on its Amino Acid Composition (AAC) and Dipeptide Composition (DPC) to represent its global and local physicochemical properties. An Extreme Gradient Boosting (XGBoost) classifier was trained on these features to distinguish between interacting and non-interacting pairs. The final model demonstrated strong predictive performance on a large, held-out test set, achieving an accuracy of 78.1% and an Area Under the Curve (AUC) of 0.865. To interpret the model’s logic, we employed SHAP (SHapley Additive exPlanations). The interpretability analysis revealed that the model’s predictions were overwhelmingly driven by AAC features. Specifically, the model learned that a high abundance of hydrophobic residues (e.g., Phenylalanine, Isoleucine) increased interaction likelihood, while a high abundance of polar residues (e.g., Serine) decreased it. Our work successfully validates a highly accurate and, critically, interpretable model for PPI prediction. By demonstrating that a machine learning model can independently learn fundamental principles of biophysics, such as the hydrophobic effect, from sequence data alone, we highlight the power of interpretable AI to generate new biological insights from large-scale genomic data.

References

Bartel, D. P. (2004). MicroRNAs. Cell, 116(2), 281–297. https://doi.org/10.1016/s0092-8674(04)00045-5

Chen, P., Zhang, W., Chen, Y., Zheng, X., & Yang, D. (2020). Comprehensive analysis of aberrantly expressed long non‑coding RNAs, microRNAs, and mRNAs associated with the competitive endogenous RNA network in cervical cancer. Molecular Medicine Reports, 22(1), 405–415. https://doi.org/10.3892/mmr.2020.11120

Li, C., Wang, X., & Song, Q. (2020). MicroRNA 885-5p Inhibits Hepatocellular Carcinoma Metastasis by Repressing AEG1; OncoTargets and Therapy, Volume 13, 981–988. https://doi.org/10.2147/ott.s228576

Lu, Y., & Luan, X. R. (2019). miR-147a suppresses the metastasis of non-small-cell lung cancer by targeting CCL5. Journal of International Medical Research, 48(4). https://doi.org/10.1177/0300060519883098

Mitchell, P. S., Parkin, R. K., Kroh, E. M., Fritz, B. R., Wyman, S. K., Pogosova-Agadjanyan, E. L., Peterson, A., Noteboom, J., O’Briant, K. C., Allen, A., Lin, D. W., Urban, N., Drescher, C. W., Knudsen, B. S., Stirewalt, D. L., Gentleman, R., Vessella, R. L., Nelson, P. S., Martin, D. B., & Tewari, M. (2008). Circulating microRNAs as stable blood-based markers for cancer detection. Proceedings of the National Academy of Sciences, 105(30), 10513–10518. https://doi.org/10.1073/pnas.0804549105

O’Neill, K., Syed, N., Crook, T., Dubey, S., Potharaju, M., Limaye, S., Ranade, A., Anichini, G., Patil, D., Datta, V., & Datar, R. (2023). Profiling of circulating glial cells for accurate blood‐based diagnosis of glial malignancies. International Journal of Cancer, 154(7), 1298–1308. https://doi.org/10.1002/ijc.34827

Rachagani, S., Macha, M. A., Heimann, N., Seshacharyulu, P., Haridas, D., Chugh, S., & Batra, S. K. (2014). Clinical implications of miRNAs in the pathogenesis, diagnosis and therapy of pancreatic cancer. Advanced Drug Delivery Reviews, 81, 16–33. https://doi.org/10.1016/j.addr.2014.10.020

Schultz, N. A., Dehlendorff, C., Jensen, B. V., Bjerregaard, J. K., Nielsen, K. R., Bojesen, S. E., Calatayud, D., Nielsen, S. E., Yilmaz, M., Holländer, N. H., Andersen, K. K., & Johansen, J. S. (2014). MicroRNA biomarkers in whole blood for detection of pancreatic cancer. JAMA, 311(4), 392. https://doi.org/10.1001/jama.2013.284664

Sidey-Gibbons, J. a. M., & Sidey-Gibbons, C. J. (2019). Machine learning in medicine: a practical introduction. BMC Medical Research Methodology, 19(1). https://doi.org/10.1186/s12874-019-0681-4

Siegel, R. L., Miller, K. D., Wagle, N. S., & Jemal, A. (2023). Cancer statistics, 2023. CA a Cancer Journal for Clinicians, 73(1), 17–48. https://doi.org/10.3322/caac.21763

Wang, J., Tao, W., Chen, X., Farokhzad, O. C., & Liu, G. (2017). Emerging Advances in Nanotheranostics with Intelligent Bioresponsive Systems. Theranostics, 7(16), 3915–3919. https://doi.org/10.7150/thno.21317

Downloads

Posted

2025-06-27