Speech as an Emotional Load Biomarker in Clinical Applications
DOI:
https://doi.org/10.24950/rspmi.2587Keywords:
Biomarkers, Emotions, Machine Learning, SpeechAbstract
Introduction: Healthcare professionals often contend with significant emotional burdens in their work, including the impact of negative emotions, such as stress and anxiety, which can have profound consequences on immediate and long-term healthcare delivery. In this paper a stress estimation algorithm is proposed based on the classification of negative valence emotions in speech recordings.
Methods: An end-to-end machine learning pipeline is proposed. Two distinct decision models are considered, VGG-16 and SqueezeNet, while sharing a common constant Q power spectrogram input for acoustic representation. The system is trained and evaluated using the RAVDESS and TESS emotional speech datasets.
Results: The system was evaluated for individual emotion
classification (multiclass problem) and also for negative and
neutral or positive emotion classification (binary problem). The results achieved are comparable to previously reported systems, with the SqueezeNet model offering a significantly smaller footprint, enabling versatile applications. Further exploration of the model's parameter space holds promise for enhanced performance.
Conclusion: The proposed system can constitute a feasible
approach for the estimation of a low-cost non-invasive biomarker for negative emotions. This allows to raise alerts and develop mitigating actions to the burden of negative emotions, being an additional management tool for healthcare services that allows to maintain quality and maximize availability.
Downloads
References
Cohen S, Kamarck T, Mermelstein R. Perceived Stress Scale [Internet]. Chicago: APA Psyc Tests; 1983 [cited 2023 Sep 19]. Available from: https://psycnet.apa.org/doiLanding?doi: 10.1037%2Ft02889-000
Loera B, Converso D, Viotti S. Evaluating the Psychometric Properties of the Maslach Burnout Inventory-Human Services Survey (MBI-HSS) among Italian Nurses: How Many Factors Must a Researcher Consider? PLoS One. 2014;9:e114987. doi: 10.1371/journal.pone.0114987.
Coelho L, Reis S, Moreira C, Cardoso H, Sequeira M, Coelho R. Benchmarking Computer-Vision Based Facial Emotion Classification Algorithms While Wearing Surgical Masks. Engineering Proceedings. 2023 (in press).
Vieira FMP, Ferreira MA, Dias D, Cunha JPS. VitalSticker: A novel multimodal physiological wearable patch device for health monitoring. In: 2023 IEEE 7th Portuguese Meeting on Bioengineering (ENBENG). 2023. p. 100–3.
Deepa P, Khilar R. Speech technology in healthcare. Measurement. Sensors. 2022;24:100565.
Vigo I, Coelho L, Reis S. Speech- and Language-Based Classification of Alzheimer’s Disease: A Systematic Review. Bioengineering. 2022;9:27.
Vieira H, Costa N, Sousa T, Reis S, Coelho L. Voice-Based classification of amyotrophic lateral sclerosis: where are we and where are we going? A systematic review. Neurodegener Dis. 2019;19:163-70. doi: 10.1159/000506259
Braga D, Madureira AM, Coelho L, Abraham A. Neurodegenerative Diseases Detection Through Voice Analysis. In: Abraham A, Muhuri PKr, Muda AK, Gandhi N, editors. Hybrid Intelligent Systems. Cham: Springer International Publishing; 2018. p. 213–23.
Lindquist KA. Emotions Emerge from More Basic Psychological Ingredients: A Modern Psychological Constructionist Model. Emotion Rev. 2013;5:356–68.
Livingstone SR, Russo FA. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS One. 2018;13:e0196391. doi: 10.1371/journal.pone.0196391.
Pichora-Fuller MK, Dupuis K. Toronto emotional speech set (TESS) [Internet]. Borealis; 2020. [cited 2023 Sep 19].Available from: https://borealisdata. ca/citation?persistentId=doi:10.5683/SP2/E8H2MF
McFee B, Raffel C, Liang D, Ellis DPW, McVicar M, Battenberg E, et al. librosa: Audio and Music Signal Analysis in Python. Proceedings of the 14th Python in Science Conference. 2015;18–24.
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library [Internet]. arXiv; 2019 [cited 2023 Sep 26]. Available from: http://arxiv. org/abs/1912.01703
Boersma P, Weenink D. Praat: doing phonetics by computer [Internet]. 2018. [cited 2023 Sep 19]. Available from: http://www.praat.org
Eyben F, Scherer KR, Schuller BW, Sundberg J, André E, Busso C, et al. The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing. IEEE Trans Affective Comput. 2016 Apr;7(2):190–202.
Eyben F, Wöllmer M, Schuller B. openSMILE -- The Munich Versatile and Fast Open-Source Audio Feature Extractor. MM’10 - Proceedings of the ACM Multimedia 2010 International Conference. 2010. 1459 p.
Cabral JP, Oliveira LC. Emovoice: a system to generate emotions in speech. In: Interspeech 2006 [Internet]. ISCA; 2006 [cited 2023 Sep 26]. p. paper 1645-Wed2BuP.3-0. Available from: https://www.isca-speech.org/archive/interspeech_2006/cabral06_interspeech.html
Simonyan K, Zisserman A. Very Deep Convolutional Networks for Large-Scale Image Recognition [Internet]. arXiv; 2015 [cited 2023 Sep 26]. Available from: http://arxiv.org/abs/1409.1556
de Lope J, Graña M. An ongoing review of speech emotion recognition. Neurocomputing. 2023;528:1–11.
Iandola FN, Han S, Moskewicz MW, Ashraf K, Dally WJ, Keutzer K. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size [Internet]. arXiv; 2016 [cited 2023 Sep 26]. Available from: http://arxiv.org/abs/1602.07360
Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L. ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition [Internet]. 2009 [cited 2023 Sep 26]. p. 248–55. [cited 2023 Sep 26] Available from: https://ieeexplore.ieee.org/document/5206848
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2024 Internal Medicine
This work is licensed under a Creative Commons Attribution 4.0 International License.
Copyright (c) 2023 Medicina Interna