Optimizing Emotional Insight through Unimodal and Multimodal Long Short-term Memory Models

Ibrahim, Hemin F. and Loo, Chu K. and Geda, Shreeyash Y. and K. Al-Talabani, Abdulbasit (2024) Optimizing Emotional Insight through Unimodal and Multimodal Long Short-term Memory Models. ARO-THE SCIENTIFIC JOURNAL OF KOYA UNIVERSITY, 12 (1). pp. 154-160. ISSN 2410-9355

Text (Research Article)
ARO.11477.VOL12.NO1.2024.ISSUE22-PP154-160.pdf - Published Version
Available under License Creative Commons Attribution Non-commercial Share Alike.
Download (1MB)

Official URL: http://dx.doi.org/10.14500/aro.11477

Abstract

The field of multimodal emotion recognition is increasingly gaining popularity as a research area. It involves analyzing human emotions across multiple modalities, such as acoustic, visual, and language. Emotion recognition is more effective as a multimodal learning task than relying on a single modality. In this paper, we present an unimodal and multimodal long short-term memory model with a class weight parameter technique for emotion recognition on the CMU-Multimodal Opinion Sentiment and Emotion Intensity dataset. In addition, a critical challenge lies in selecting the most effective fusion method for integrating multiple modalities. To address this, we applied four different fusion techniques: Early fusion, late fusion, deep fusion, and tensor fusion. These fusion methods improved the performance of multimodal emotion recognition compared to unimodal approaches. With the highly imbalanced number of samples per emotion class in the MOSEI dataset, adding a class weight parameter technique leads our model to outperform the state of the art on all three modalities — acoustic, visual, and language — as well as on all the fusion models. The challenges of class imbalance, which can lead to biased model performance, and using an effective fusion method for integrating multiple modalities often result in decreased accuracy in recognizing less frequent emotion classes. Our proposed model shows 2–3% performance improvement in the unimodal and 2% in the multimodal over the state-of-the-art achieved results.

Item Type:	Article
Additional Information:	Ahmed, J., and Green 2nd, R.C., 2024. Cost aware LSTM model for predicting hard disk drive failures based on extremely imbalanced S.M.A.R.T. sensors data. Engineering Applications of Artificial Intelligence, 127, 107339. DOI: https://doi.org/10.1016/j.engappai.2023.107339 Angelov, P., Gu, X., Iglesias, J., Ledezma, A., Sanchis, A., Sipele, O., and Ramezani, R., 2017. Cybernetics of the mind: Learning individual’s perceptions autonomously. IEEE Systems, Man, and Cybernetics Magazine, 3(2), pp.6-17. DOI: https://doi.org/10.1109/MSMC.2017.2664478 Busso, C., Deng, Z., Yildirim, S., Bulut, M., Lee, C.M., Kazemzadeh, A., Lee, S.,and Neumann, U., 2004. Analysis of Emotion Recognition Using Facial Expressions, Speech and Multimodal Information. In: Proceedings of the 6th International Conference on Multimodal Interfaces. DOI: https://doi.org/10.1145/1027933.1027968 Chen, L., Huang, T., Miyasato, T., and Nakatsu, R., 1998. Multimodal Human Emotion/Expression Recognition. In: Proceedings 3rd IEEE International Conference on Automatic Face and Gesture Recognition. Nara, Japan. Churamani, N., Barros, P., Strahl, E., and Wermter, S., 2018. Learning Empathy-Driven Emotion Expressions using Affective Modulations. In: Proceedings of the 2018 International Joint Conference on Neural Networks (IJCNN). DOI: https://doi.org/10.1109/IJCNN.2018.8489158 Crangle, C.E., Wanga, R., Perreau-Guimaraesa, M., Nguyena, M.U., Nguyena, D.T., and Suppes, P., 2019. Machine learning for the recognition of emotion in the speech of couples in psychotherapy using the Stanford Suppes Brain Lab Psychotherapy Dataset. Available from: https://arxiv.org/abs/1901.04110v1 Degottex, G., Kane, J., Drugman, T., Raitio, T., and Scherer, S., 2014. COVAREP - A Collaborative Voice analysis Repository for Speech Technologies. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Florence, Italy. DOI: https://doi.org/10.1109/ICASSP.2014.6853739 Drugman, T., Thomas, M., Gudnason, J., Naylor, P., and Dutoit, T., 2012. Detection of glottal closure instants from speech signals: A quantitative review. IEEE Transactions on Audio Speech and Language Processing, 20, pp.994-1009. DOI: https://doi.org/10.1109/TASL.2011.2170835 Ekman, P., Friesen, W.V., and Ancoli, S., 1980. Facial signs of emotional experience. Journal of Personality and Social Psychology, 39, pp.1125-1134. DOI: https://doi.org/10.1037/h0077722 Geetha, A.V., Mala, T., Priyanka, D., and Uma, E., 2024. Multimodal emotion recognition with deep learning: Advancements, challenges, and future directions. Information Fusion, 105, 102218. DOI: https://doi.org/10.1016/j.inffus.2023.102218 Gladys, A.A., and Vetriselvi, V., 2023. Survey on multimodal approaches to emotion recognition. Neurocomputing, 556, p.126693. DOI: https://doi.org/10.1016/j.neucom.2023.126693 Griol, D., Molina, J.M., and Callejas, Z., 2019. Combining speech-based and linguistic classifiers to recognize emotion in user spoken utterances. Neurocomputing, 326, pp.132-140. DOI: https://doi.org/10.1016/j.neucom.2017.01.120 Huang, Y., Yang, J., Liao, P., and Pan, J., 2017. Fusion of Facial Expressions and EEG for Multimodal Emotion Recognition. Computational Intelligence and Neuroscience, 2017, p.2107451. DOI: https://doi.org/10.1155/2017/2107451 Jiang, Y., Li, W., Hossain, MS., Chen, M., Alelaiwi, A., and Al-Hammadi, M., 2020. A snapshot research and implementation of multimodal information fusion for data-driven emotion recognition. Information Fusion, 53, pp.209-221. DOI: https://doi.org/10.1016/j.inffus.2019.06.019 Kane, J., and Gobl, C., 2011. Identifying Regions of Non-modal Phonation Using Features of the Wavelet Transform. In: Proceedings of the Annual Conference of the International Speech Communication Association. DOI: https://doi.org/10.21437/Interspeech.2011-76 Kim, J.K., and Kim, Y.B., 2018. Joint Learning of Domain Classification and Out-of-Domain Detection with Dynamic Class Weighting for Satisficing False Acceptance Rates.In: Proceedings of the Annual Conference of the International Speech Communication Association. DOI: https://doi.org/10.21437/Interspeech.2018-1581 Stöckli, S., Schulte-Mecklenbeck, M., Borer, S., and Samson, A.C., 2018. Facial expression analysis with AFFDEX and FACET: A validation study. Behavior Research Methods, 50, pp. 1446-1460. DOI: https://doi.org/10.3758/s13428-017-0996-1 Li, P., Abdel-Aty, M., and Yuan, J., 2020. Real-time crash risk prediction on arterials based on LSTM-CNN. Accident Analysis and Prevention, 135, p.105371. DOI: https://doi.org/10.1016/j.aap.2019.105371 Lotfian, R., and Busso, C., 2019. Over-sampling emotional speech data based on subjective evaluations provided by multiple individuals. IEEE Transactions on Affective Computing, 12, pp.870-882. DOI: https://doi.org/10.1109/TAFFC.2019.2901465 Nojavanasghari, B., Gopinath, D., Koushik, J., Baltrušaitis, T., and Morency, L.P., 2016. Deep Multimodal Fusion for Persuasiveness Prediction.In: Proceedings of the 18th ACM International Conference on Multimodal Interaction. New York. DOI: https://doi.org/10.1145/2993148.2993176 Paiva, A.M., Leite, I., Boukricha, B., and Wachsmuth, I., 2017. Empathy in virtual agents and robots: A survey. ACM Transactions on Interactive Intelligent Systems, 7, pp.1-40. DOI: https://doi.org/10.1145/2912150 Pennington, J., Socher, R., and Manning, C.D., 2014. GloVe: Global Vectors for Word Representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). DOI: https://doi.org/10.3115/v1/D14-1162 Sherstinsky, A., 2020. Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) network. Physica D: Nonlinear Phenomena, 404, p.132306. DOI: https://doi.org/10.1016/j.physd.2019.132306 Tong, E., Zadeh, A., Jones, C., and Morency, L.P., 2017. Combating Human Trafficking with Multimodal Deep Models. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). DOI: https://doi.org/10.18653/v1/P17-1142 Yang, Q., and Wu, X., 2006. 10 challenging problems in data mining research. International Journal of Information Technology and Decision Making, 5, pp.597-604. DOI: https://doi.org/10.1142/S0219622006002258 Yuan, J., and Liberman, M., 2008. Speaker identification on the SCOTUS corpus. The Journal of the Acoustical Society of America, 123, p.3878. DOI: https://doi.org/10.1121/1.2935783 Zadeh, A., Chen, M., Poria, S., Cambria, E., and Morency, LP., 2017. Tensor Fusion Network for Multimodal Sentiment Analysis. Copenhagen, Denmark. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. DOI: https://doi.org/10.18653/v1/D17-1115 Zadeh, A.B., Liang, P.P., Poria, S., Cambria, E., and Morency, L.P., 2018. Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Melbourne, Australia. Zhang, S., Yang, Y., Chen, C., Zhang, X., Leng, Q., and Zhao, X., 2024. Deep learning-based multimodal emotion recognition from audio, visual, and text modalities: A systematic review of recent advancements and future prospects. Expert Systems with Applications, 237, p.121692. DOI: https://doi.org/10.1016/j.eswa.2023.121692 Zhu, Q., Yeh, M.C., Cheng, K.T., and Avidan, S., 2006. Fast Human Detection Using a Cascade of Histograms of Oriented Gradients. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06). Zhu, X. Liu, Y., Li, J., Tao, W., and Qin, Z., 2018. Emotion Classification with Data Augmentation Using Generative Adversarial Networks. Springer, Cham. DOI: https://doi.org/10.1007/978-3-319-93040-4_28
Uncontrolled Keywords:	Multimodal emotion recognition, Long short- term memory model, Class weight technique, Fusion techniques, Imbalanced data handling
Subjects:	Q Science > QA Mathematics > QA76 Computer software
Divisions:	ARO-The Scientific Journal of Koya University > VOL 12, NO 1 (2024)
Depositing User:	Dr Salah Ismaeel Yahya
Date Deposited:	02 Sep 2024 06:58
Last Modified:	02 Sep 2024 06:58
URI:	http://eprints.koyauniversity.org/id/eprint/480

Actions (login required)

View Item