Preprint / Version 1

Utilizing a Multimodal Deep Learning Model to Identify Parkinson’s Disease from Voice Samples

##article.authors##

  • Andrew Oh Crescenta Valley High School
  • Grace Kim iANT Education

DOI:

https://doi.org/10.58445/rars.3170

Keywords:

Parkinson’s Disease, voice analysis, multimodal deep learning, acoustic features, log-mel spectrograms, transformer models

Abstract

Dysarthria, a motor speech disorder affecting control of speech muscles, is a common early symptom of Parkinson’s Disease (PD), making voice analysis a promising tool for early detection. Acoustic biomarkers derived from sustained vowel phonations have shown potential for PD detection. This study develops a multimodal transformer model that integrates engineered acoustic features with log-Mel spectrogram embeddings to classify PD from sustained /a/ phonations. Eight acoustic features—mean fundamental frequency (F0), local jitter, local shimmer, detrended fluctuation analysis (DFA) exponent, pitch period entropy (PPE), recurrence period density entropy (RPDE), pitch variability, and harmonics-to-noise ratio (HNR)—were extracted using Librosa and Parselmouth and processed through a numerical branch. In parallel, log-Mel spectrograms were encoded with pretrained CNN backbones (ResNet18, EfficientNet_B0, MobileNet_V3_Large) in an image branch. A transformer layer integrated both modalities, with a final classifier predicting PD vs. healthy controls. Two datasets were combined: D1 with 81 recordings (41 HC, 40 PD; average ages 47.7 ± 14.3 and 67.0 ± 9.0 years) and D2 with 99 recordings (44 HC, 55 PD; average ages 67.1 ± 5.2 and 67.2 ± 8.7 years), yielding 180 recordings (85 HC, 95 PD). Data were split 80% training, 10% validation, and 10% testing. Models were trained with a learning rate of 1e-4, batch size 32, and 10 epochs across 5 independent runs. The best model achieved an accuracy of 0.93 ± 0.07, precision of 0.96 ± 0.05, recall of 0.91 ± 0.08, and F1-score of 0.94 ± 0.06, with stable training and validation loss convergence. These findings suggest that lightweight multimodal fusion of engineered acoustic and spectrographic features outperforms single-modality baselines and holds promise for scalable, noninvasive voice-based PD screening.

References

Bloem, B. R., Okun, M. S., & Klein, C. (2021). Parkinson’s disease. The Lancet, 397(10291), 2284-2303. https://doi.org/10.1016/S0140-6736(21)00218-X

Ibarra, E. J., Arias-Londoño, J. D., Zañartu, M., & Godino-Llorente, J. I. (2023). Towards a Corpus (and Language)-Independent Screening of Parkinson’s Disease from Voice and Speech Through Domain Adaptation. Bioengineering, 10(11), Article 1316. https://doi.org/10.3390/bioengineering10111316

Marsden, C. D. (1994, June). Parkinson’s disease. Journal of Neurology, Neurosurgery & Psychiatry, 57(6), 672-681. https://doi.org/10.1136/jnnp.57.6.672

Postuma, R. B., Berg, D., Stern, M., Poewe, W., Olanow, C. W., Oertel, W., Obeso, J., Marek, K., Litvan, I., Lang, A. E., Halliday, G., Goetz, C. G., Gasser, T., Dubois, B., Chan, P., Bloem, B. R., Adler, C. H., & Deuschl, G. (2015, October). MDS clinical diagnostic criteria for Parkinson’s disease. Movement Disorders, 30(12), 1591-1601. https://doi.org/10.1002/mds.26424

Skodda, S., Grönheit, W., & Schlegel, U. (2012, February 28). Impairment of vowel articulation as a possible marker of disease progression in Parkinson’s disease. PLOS ONE, 7(2), e32132. https://doi.org/10.1371/journal.pone.0032132

Vásquez-Correa, J. C., Arias-Vergara, T., Orozco-Arroyave, J. R., Eskofier, B., Klucken, J., & Nöth, E. (2018). Multimodal assessment of Parkinson’s disease: A deep learning approach. IEEE. https://doi.org/10.1109/ISBI.2018.8363564

Barukab, O., Abuzaid, M., Al-Sharif, A., Ali, N., Alsharif, M., & Aslam, N. (2022). Analysis of Parkinson’s Disease Using an Imbalanced Voice Dataset. Diagnostics, 12, Article 3000. https://doi.org/10.3390/diagnostics12103000

Wang, W., Lee, J., Harrou, F., & Sun, Y. (2020, August 21). Early detection of Parkinson’s disease using deep learning and machine learning. IEEE Access. https://doi.org/10.1109/ACCESS.2020.3016062

Grover, S., Bhartia, S., Akshama, Yadav, A., & Seeja, K. R. (2018). Predicting severity of Parkinson’s disease using deep learning. Procedia Computer Science, 132, 1788-1794. https://doi.org/10.1016/j.procs.2018.05.154

Iyer, A., Kemp, A., Rahmatallah, Y., Pillai, L., Glover, A., Prior, F., Larson-Prior, L., & Virmani, T. (2023). A machine learning method to process voice samples for identification of Parkinson’s disease. Scientific Reports, 13(1), 20615. https://doi.org/10.1038/s41598-023-47568-w

Guo, Z., Li, X., Huang, H., Guo, N., & Li, Q. (2019, March). Deep learning-based image segmentation on multimodal medical imaging. IEEE Transactions on Radiation and Plasma Medical Sciences, 3(2), 162–169. https://doi.org/10.1109/TRPMS.2018.2890359

Zhou, T., Ruan, S., & Canu, S. (2019). A review: Deep learning for medical image segmentation using multi-modality fusion. Array, 3–4, 100004. https://doi.org/10.1016/j.array.2019.100004

“Voice Samples for Patients with Parkinson’s Disease and Healthy Controls. (2023). Figshare. https://figshare.com/articles/dataset/Voice_Samples_for_Patients_with_Parkinson_s_Disease_and_Healthy_Controls/23849127

Italian Parkinson’s Voice and Speech. (2020). IEEE DataPort. https://ieee-dataport.org/open-access/italian-parkinsons-voice-and-speech

Tsanas, A., Little, M., McSharry, P., & Ramig, L. (2009, October 29). Accurate telemonitoring of Parkinson’s disease progression by non-invasive speech tests. Nature Precedings. https://doi.org/10.1038/npre.2009.3920.1

McFee, B., et al. Librosa: Audio and music signal processing in Python [Software documentation]. Retrieved from https://librosa.org/doc/latest/index.html

Jadoul, Y., Thompson, B., & de Boer, B. Parselmouth documentation [Software documentation]. Retrieved from https://parselmouth.readthedocs.io/en/stable/

Stevens, S. S., & Volkmann, J. (1940). The relation of pitch to frequency: A revised scale. The American Journal of Psychology, 53(3), 329-353. https://doi.org/10.2307/1417526

Tesfai, S. (2024, May 24). Multimodal ensemble models for Parkinson's disease diagnosis using log-Mel spectrograms and acoustic features. IEEE. https://doi.org/10.1109/URTC60662.2023.10534982

Suhas, B. N., Mallela, J., Illa, A., Yamini, B. K., Atchayaram, N., & Yadav, R. (2020, August 28). Speech task based automatic classification of ALS and Parkinson’s disease and their severity using log Mel spectrograms. IEEE. https://doi.org/10.1109/SPCOM50965.2020.9179503

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Köpf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., & Chintala, S. PyTorch documentation. https://docs.pytorch.org/docs/stable/index.html

Downloads

Posted

2025-10-05