Preprint / Version 1

Machine Learning to Detect Conflicting Clinical Classifications of Genetic Variants

##article.authors##

  • Vallerie Cheng

DOI:

https://doi.org/10.58445/rars.564

Keywords:

machine learning, genetic variants, clinical classification

Abstract

Identifying conflicting variants in the clinical classification of pathogenicity is important, as diagnoses directly affect the treatment plans for patients. Machine learning models can effectively categorize and analyze multidimensional data, and the incorporation of feature selection algorithms into these models allows us to identify relationships between clinical features that can contribute to conflicting clinical classifications of pathogenicity. We use the ClinVar dataset to address this need, which serves as a public archive for annotations of human genetic variants. In ClinVar, variants are manually categorized into different classes: benign, likely benign, uncertain significance, probably pathogenic, and pathogenic by researchers. However, there are inconsistencies in the annotations across clinical laboratories that can create confusion when assessing the impact of a variant on a patient's condition. This project proposes the development of a machine learning model trained on the ClinVar dataset to address this issue as well as feature selection to identify the properties of the variants that are most predictive of conflicting pathogenicity labels. The model leverages variant annotations such as genetic features, clinical data, and other critical information to identify patterns and relationships that harmonize conflicting classifications. We trained a random forest model and studied the importance of the input features using both tree-based and lasso feature selection.  The five most significant features based on the tree-based feature selection, which inherently handles the nonlinear relationship between features, are (1) the score of the deleteriousness of variants, (2) allele frequencies emitted by ExAC, (3) Phred Scaled Score, (4) LoFtool’s gene intolerance score, and (5) allele frequencies emitted by GO-ESP. The utilization of machine learning for identifying conflicting clinical classifications of genetic variants helps ensure precise and consistent interpretation of variants. This, in turn, plays a crucial role in improving clinical genomics, making diagnoses more accurate, and enabling personalized treatment options.

References

Academic.oup.com. (n.d.). https://academic.oup.com/bioinformatics/article/33/4/471/2525582

Combined annotation dependent depletion. CADD. (n.d.). https://cadd.gs.washington.edu/

Ensembl variation - calculated variant consequences. Calculated consequences. (n.d.-a). https://useast.ensembl.org/info/genome/variation/prediction/predicted_data.html#consequences

Favalli, V., Tini, G., Bonetti, E., Vozza, G., Guida, A., Gandini, S., ... & Mazzarella, L. (2021). Machine learning-based reclassification of germline variants of unknown significance: The RENOVO algorithm. The American Journal of Human Genetics, 108(4), 682-695.

Glossary. ROSALIND. (n.d.). https://rosalind.info/glossary/blosum62/

​​Karczewski, K. J., Weisburd, B., Thomas, B., Solomonson, M., Ruderfer, D. M., Kavanagh, D., Hamamsy, T., Lek, M., Samocha, K. E., Cummings, B. B., Birnbaum, D., The Exome Aggregation Consortium, Daly, M. J., & MacArthur, D. G. (2017, January 4). The EXAC browser: Displaying reference data information from over 60 000 exomes. Nucleic acids research. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5210650/#B3

Larrea-Sebal, A., Benito-Vicente, A., Fernandez-Higuero, J. A., Jebari-Benslaiman, S., Galicia-Garcia, U., Uribe, K. B., ... & Martín, C. (2021). MLb-LDLr: a machine learning model for predicting the pathogenicity of LDLr missense variants. Basic to Translational Science, 6(11), 815-827.

Leblanc, E., Washington, P., Varma, M., Dunlap, K., Penev, Y., Kline, A., & Wall, D. P. (2020). Feature replacement methods enable reliable home video analysis for machine learning detection of autism. Scientific reports, 10(1), 21245.

Mahecha, D., Nuñez, H., Lattig, M. C., & Duitama, J. (2022). Machine learning models for accurate prioritization of variants of uncertain significance. Human Mutation, 43(4), 449-460.

NHLBI Grand Opportunity Exome Sequencing Project (ESP). (n.d.). https://esp.gs.washington.edu/drupal/

Nicora, G., Zucca, S., Limongelli, I., Bellazzi, R., & Magni, P. (2022). A machine learning approach based on ACMG/AMP guidelines for genomic variant classification and prioritization. Scientific reports, 12(1), 2517.

Tariq, Q., Daniels, J., Schwartz, J. N., Washington, P., Kalantarian, H., & Wall, D. P. (2018). Mobile detection of autism through machine learning on home video: A development and prospective validation study. PLoS medicine, 15(11), e1002705.

Tariq, Q., Fleming, S. L., Schwartz, J. N., Dunlap, K., Corbin, C., Washington, P., ... & Wall, D. P. (2019). Detecting developmental delay and autism through machine learning models using home videos of Bangladeshi children: Development and validation study. Journal of medical Internet research, 21(4), e13822.

U.S. National Library of Medicine. (n.d.). Clinvar. National Center for Biotechnology Information. https://www.ncbi.nlm.nih.gov/clinvar/

Variant effect predictor other information. Other information. (n.d.). https://useast.ensembl.org/info/docs/tools/vep/script/vep_other.html

Washington, P., Chrisman, B., Leblanc, E., Dunlap, K., Kline, A., Mutlu, C., ... & Wall, D. P. (2022). Crowd annotations can approximate clinical autism impressions from short home videos with privacy protections. Intelligence-based medicine, 6, 100056.

Washington, P., Kalantarian, H., Tariq, Q., Schwartz, J., Dunlap, K., Chrisman, B., ... & Wall, D. P. (2019). Validity of online screening for autism: crowdsourcing study comparing paid and unpaid diagnostic tasks. Journal of medical Internet research, 21(5), e13668.

Washington, P., Leblanc, E., Dunlap, K., Penev, Y., Kline, A., Paskov, K., ... & Wall, D. P. (2020). Precision telemedicine through crowdsourced machine learning: testing variability of crowd workers for video-based autism feature recognition. Journal of personalized medicine, 10(3), 86.

Washington, P., Leblanc, E., Dunlap, K., Penev, Y., Varma, M., Jung, J. Y., ... & Wall, D. P. (2020). Selection of trustworthy crowd workers for telemedical diagnosis of pediatric autism spectrum disorder. In Biocomputing 2021: proceedings of the Pacific symposium (pp. 14-25).

Washington, P., Park, N., Srivastava, P., Voss, C., Kline, A., Varma, M., ... & Wall, D. P. (2020). Data-driven diagnostics and the potential of mobile artificial intelligence for digital therapeutic phenotyping in computational psychiatry. Biological Psychiatry: Cognitive Neuroscience and Neuroimaging, 5(8), 759-769.

Washington, P., Paskov, K. M., Kalantarian, H., Stockham, N., Voss, C., Kline, A., ... & Wall, D. P. (2019). Feature selection and dimension reduction of social autism data. In Pacific Symposium on Biocomputing 2020 (pp. 707-718).

Washington, P., Tariq, Q., Leblanc, E., Chrisman, B., Dunlap, K., Kline, A., ... & Wall, D. P. (2020). Crowdsourced feature tagging for scalable and privacy-preserved autism diagnosis. medRxiv, 2020-12.

Washington, P., & Wall, D. P. (2023). A Review of and Roadmap for Data Science and Machine Learning for the Neuropsychiatric Phenotype of Autism. Annual Review of Biomedical Data Science, 6.

Posted

2023-10-08