Utilizing a Multimodal Deep Learning Model to Identify Parkinson’s Disease from Voice Samples
DOI:
https://doi.org/10.58445/rars.3170Keywords:
Parkinson’s Disease, voice analysis, multimodal deep learning, acoustic features, log-mel spectrograms, transformer modelsAbstract
Dysarthria, a motor speech disorder affecting control of speech muscles, is a common early symptom of Parkinson’s Disease (PD), making voice analysis a promising tool for early detection. Acoustic biomarkers derived from sustained vowel phonations have shown potential for PD detection. This study develops a multimodal transformer model that integrates engineered acoustic features with log-Mel spectrogram embeddings to classify PD from sustained /a/ phonations. Eight acoustic features—mean fundamental frequency (F0), local jitter, local shimmer, detrended fluctuation analysis (DFA) exponent, pitch period entropy (PPE), recurrence period density entropy (RPDE), pitch variability, and harmonics-to-noise ratio (HNR)—were extracted using Librosa and Parselmouth and processed through a numerical branch. In parallel, log-Mel spectrograms were encoded with pretrained CNN backbones (ResNet18, EfficientNet_B0, MobileNet_V3_Large) in an image branch. A transformer layer integrated both modalities, with a final classifier predicting PD vs. healthy controls. Two datasets were combined: D1 with 81 recordings (41 HC, 40 PD; average ages 47.7 ± 14.3 and 67.0 ± 9.0 years) and D2 with 99 recordings (44 HC, 55 PD; average ages 67.1 ± 5.2 and 67.2 ± 8.7 years), yielding 180 recordings (85 HC, 95 PD). Data were split 80% training, 10% validation, and 10% testing. Models were trained with a learning rate of 1e-4, batch size 32, and 10 epochs across 5 independent runs. The best model achieved an accuracy of 0.93 ± 0.07, precision of 0.96 ± 0.05, recall of 0.91 ± 0.08, and F1-score of 0.94 ± 0.06, with stable training and validation loss convergence. These findings suggest that lightweight multimodal fusion of engineered acoustic and spectrographic features outperforms single-modality baselines and holds promise for scalable, noninvasive voice-based PD screening.
References
Bloem, B. R., Okun, M. S., & Klein, C. (2021). Parkinson’s disease. The Lancet, 397(10291), 2284-2303. https://doi.org/10.1016/S0140-6736(21)00218-X
Ibarra, E. J., Arias-Londoño, J. D., Zañartu, M., & Godino-Llorente, J. I. (2023). Towards a Corpus (and Language)-Independent Screening of Parkinson’s Disease from Voice and Speech Through Domain Adaptation. Bioengineering, 10(11), Article 1316. https://doi.org/10.3390/bioengineering10111316
Marsden, C. D. (1994, June). Parkinson’s disease. Journal of Neurology, Neurosurgery & Psychiatry, 57(6), 672-681. https://doi.org/10.1136/jnnp.57.6.672
Postuma, R. B., Berg, D., Stern, M., Poewe, W., Olanow, C. W., Oertel, W., Obeso, J., Marek, K., Litvan, I., Lang, A. E., Halliday, G., Goetz, C. G., Gasser, T., Dubois, B., Chan, P., Bloem, B. R., Adler, C. H., & Deuschl, G. (2015, October). MDS clinical diagnostic criteria for Parkinson’s disease. Movement Disorders, 30(12), 1591-1601. https://doi.org/10.1002/mds.26424
Skodda, S., Grönheit, W., & Schlegel, U. (2012, February 28). Impairment of vowel articulation as a possible marker of disease progression in Parkinson’s disease. PLOS ONE, 7(2), e32132. https://doi.org/10.1371/journal.pone.0032132
Vásquez-Correa, J. C., Arias-Vergara, T., Orozco-Arroyave, J. R., Eskofier, B., Klucken, J., & Nöth, E. (2018). Multimodal assessment of Parkinson’s disease: A deep learning approach. IEEE. https://doi.org/10.1109/ISBI.2018.8363564
Barukab, O., Abuzaid, M., Al-Sharif, A., Ali, N., Alsharif, M., & Aslam, N. (2022). Analysis of Parkinson’s Disease Using an Imbalanced Voice Dataset. Diagnostics, 12, Article 3000. https://doi.org/10.3390/diagnostics12103000
Wang, W., Lee, J., Harrou, F., & Sun, Y. (2020, August 21). Early detection of Parkinson’s disease using deep learning and machine learning. IEEE Access. https://doi.org/10.1109/ACCESS.2020.3016062
Grover, S., Bhartia, S., Akshama, Yadav, A., & Seeja, K. R. (2018). Predicting severity of Parkinson’s disease using deep learning. Procedia Computer Science, 132, 1788-1794. https://doi.org/10.1016/j.procs.2018.05.154
Iyer, A., Kemp, A., Rahmatallah, Y., Pillai, L., Glover, A., Prior, F., Larson-Prior, L., & Virmani, T. (2023). A machine learning method to process voice samples for identification of Parkinson’s disease. Scientific Reports, 13(1), 20615. https://doi.org/10.1038/s41598-023-47568-w
Guo, Z., Li, X., Huang, H., Guo, N., & Li, Q. (2019, March). Deep learning-based image segmentation on multimodal medical imaging. IEEE Transactions on Radiation and Plasma Medical Sciences, 3(2), 162–169. https://doi.org/10.1109/TRPMS.2018.2890359
Zhou, T., Ruan, S., & Canu, S. (2019). A review: Deep learning for medical image segmentation using multi-modality fusion. Array, 3–4, 100004. https://doi.org/10.1016/j.array.2019.100004
“Voice Samples for Patients with Parkinson’s Disease and Healthy Controls. (2023). Figshare. https://figshare.com/articles/dataset/Voice_Samples_for_Patients_with_Parkinson_s_Disease_and_Healthy_Controls/23849127
Italian Parkinson’s Voice and Speech. (2020). IEEE DataPort. https://ieee-dataport.org/open-access/italian-parkinsons-voice-and-speech
Tsanas, A., Little, M., McSharry, P., & Ramig, L. (2009, October 29). Accurate telemonitoring of Parkinson’s disease progression by non-invasive speech tests. Nature Precedings. https://doi.org/10.1038/npre.2009.3920.1
McFee, B., et al. Librosa: Audio and music signal processing in Python [Software documentation]. Retrieved from https://librosa.org/doc/latest/index.html
Jadoul, Y., Thompson, B., & de Boer, B. Parselmouth documentation [Software documentation]. Retrieved from https://parselmouth.readthedocs.io/en/stable/
Stevens, S. S., & Volkmann, J. (1940). The relation of pitch to frequency: A revised scale. The American Journal of Psychology, 53(3), 329-353. https://doi.org/10.2307/1417526
Tesfai, S. (2024, May 24). Multimodal ensemble models for Parkinson's disease diagnosis using log-Mel spectrograms and acoustic features. IEEE. https://doi.org/10.1109/URTC60662.2023.10534982
Suhas, B. N., Mallela, J., Illa, A., Yamini, B. K., Atchayaram, N., & Yadav, R. (2020, August 28). Speech task based automatic classification of ALS and Parkinson’s disease and their severity using log Mel spectrograms. IEEE. https://doi.org/10.1109/SPCOM50965.2020.9179503
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Köpf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., & Chintala, S. PyTorch documentation. https://docs.pytorch.org/docs/stable/index.html
Downloads
Posted
Categories
License
Copyright (c) 2025 Andrew Oh, Grace Kim

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.