基于深度学习的多模态情感识别综述A survey of multimodal emotion recognition based on deep learning
刘颖,艾豪,张伟东
摘要(Abstract):
简要介绍了文本、语音和人脸等3种单模态情感识别方法,总结了常用的多模态情感数据集。通过分析基于深度学习的多模态情感识别的研究现状,按照融合方式将基于深度学习的多模态情感识别分为基于早期融合、晚期融合、混合融合以及多核融合等4种情感识别方法,并进行了对比分析。最后,指出了情感识别技术研究进展存在的问题及未来发展趋势。
关键词(KeyWords): 多模态;情感识别;深度学习;融合方式
基金项目(Foundation): 西安市科技创新人才服务企业项目(2020KJRC0110)
作者(Author): 刘颖,艾豪,张伟东
DOI: 10.13682/j.issn.2095-6533.2022.01.009
参考文献(References):
- [1] FERNANDEZ R,PICARD R W.Modeling drivers’ speech under stress[J].Speech Communication,2003,40(1-2):145-159.
- [2] BHARDWAJ A,NARAYAN Y,DUTTA M.Sentiment analysis for indian stock market prediction using sensex and nifty[J].Procedia Computer Science,2015,70:85-91.
- [3] ZVAREVASHE K,OLUGBARA O O.A framework for sentiment analysis with opinion mining of hotel reviews[C]//Proceedings of the 2018 Conference on Information Communications Technology and Society (ICTAS).Durban:IEEE,2018:1-4.
- [4] WANG X F,GERBER M S,BROWN D E.Automatic crime prediction using events extracted from twitter posts[C]//Proceedings of the International Conference on Social Computing,Behavioral-Cultural Modeling,and Prediction(SBP).Berlin:Springer,2012:231-238.
- [5] AKHTAR M S,CHAUHAN D S,EKBAL A.A deep multi-task contextual attention framework for multi-modal affect analysis[J].ACM Transactions on Knowledge Discovery from Data (TKDD),2020,14(3):1-27.
- [6] CASTELLANO G,KESSOUS L,CARIDAKIS G.Emotion recognition through multiple modalities:Face,body gesture,speech[C]//Proceedings of the Affect and Emotion in Human-Computer Interaction.Berlin:Springer,2008:92-103.
- [7] B?NZIGER T,PIRKER H,SCHERER K R.GEMEP-GEneva multimodal emotion portrayals:A corpus for the study of multimodal emotional expressions[J].Proceedings of Lrec.2006,6:15-19.
- [8] KONONENKO I.On biases in estimating multi-valued attributes[J].International Joint Conference on Artificial Intelligence,1995,95:1034-1040.
- [9] WANG M,CAO D,LI L,et al.Microblog sentiment analysis based on cross-media bag-of-words model[C]//Proceedings of International Conference on Internet Multimedia Computing and Service.New York:Association for Computing Machinery,2014:76-80.
- [10] KRIZHEVSKY A,SUTSKEVER I,HINTON G E.Imagenet classification with deep convolutional neural networks[J].Advances in Neural Information Processing Systems,2012,25:1097-1105.
- [11] SOLEYMANI M,GARCIA D,JOU B,et al.A survey of multimodal sentiment analysis[J].Image and Vision Computing,2017,65:3-14.
- [12] KAUR R,KAUTISH S.Multimodal sentiment analysis:A survey and comparison[J].International Journal of Service Science,Management,Engineering,and Technology(IJSSMET),2019,10(2):38-58.
- [13] YADAV A,VISHWAKARMA D K.Sentiment analysis using deep learning architectures:A review[J].Artificial Intelligence Review,2020,53(6):4335-4385.
- [14] SABOUR S,FROSST N,HINTON G E.Dynamic routing between capsules[C]//Proceeding of the 31st International Conference on Neural Information Proceesing systems.[S.l.]:ACM,2017:3859-3869.
- [15] VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[C]//Proceeding of the 31st International Conference on Neural Information Proceesing systems.[S.l.]:ACM,2017:6000-6010.
- [16] NGUYEN H T,NGUYEN M L.Effective attention networks for aspect-level sentiment classification[C]//Proceeding of the 2018 10th International Conference on Knowledge and Systems Engineering (KSE).Ho Chi Minh City:IEEE,2018:25-30.
- [17] PENNINGTON J,SOCHER R,MANNING C D.Glove:Global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).Doha:SIGDAT,2014:1532-1543.
- [18] ROGERS A,KOVALEVA O,RUMSHISKY A.A primer in bertology:What we know about how bert works[J].Transactions of the Association for Computational Linguistics,2020,8:842-866.
- [19] PETERS M E,NEUMANN M,IYYER M,et al.Deep contextualized word representations[EB/OL].[2021-10-10].http://arxiv.org/pdf/1802.05365.
- [20] RADFORD A,NARASIMHAN K,SALIMANS T,et al.Improving language understanding by generative pre-training[EB/OL].[2021-10-10].https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf.
- [21] WANG Y,SUN A,HAN J,et al.Sentiment analysis by capsules[C]//Proceedings of the 2018 World Wide Web Conference.Switzerland:International World Wide Web Conferences Steering Committee.[S.l.]:ACM,2018:1165-1174.
- [22] DU C,SUN H,WANG J,et al.Capsule network with interactive attention for aspect-level sentiment classification[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language.Hong Kong:Association for Computational Linguistics,2019:5492-5501.
- [23] CHEN Z,QIAN T.Transfer capsule network for aspect level sentiment classification[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.Florence:Association for Computational Linguistics,2019:547-556.
- [24] VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[J].Advances in Neural Information Processing Systems,2017,30:1-11.
- [25] LI Q,WU C,WANG Z,et al.Hierarchical transformer network for utterance-level emotion recognition[J].Applied Sciences,2020,10(13):4447-4454.
- [26] PRAVENA D,GOVIND D.Significance of incorporating excitation source parameters for improved emotion recognition from speech and electroglottographic signals[J].International Journal of Speech Technology,2017,20(4):787-797.
- [27]■.SPeech ACoustic (SPAC):A novel tool for speech feature extraction and classification[J].Applied Acoustics,2018,136:1-8.
- [28] BOERSMA P.Praat:Doing phonetics by computer[J].Ear & Hearing,2011,32(2):266.
- [29] EYBEN F,W?LLMER M,SCHULLER B.Opensmile:The munich versatile and fast open-source audio feature extractor[C]//Proceedings of the 18th ACM International Conference on Multimedia.New York:Association for Computing Machinery,2010:1459-1462.
- [30] SUN L,HE B,LIU Z,et al.Speech emotion recognition based on information cell[J].Journal of Zhejiang University (Engineering Science),2015,49(6):1001-1009.
- [31] SCHULLER B,BATLINER A,STEIDL S,et al.Recog-nising realistic emotions and affect in speech:State of theart and lessons learnt from the first challenge[J].Speech Communication,2011,53(9/10):1062-1087.
- [32] WANG Z,LIU G,SONG H.Feature fusion based on multiple kernal learning for speech emotion recognition[J].Computer Engineering,2019,45(8):248-254.
- [33] ZHANG L,LV J,QIANG Y,et al.Emotion recognition based on deep belief network[J].Journal of Taiyuan University of Technology,2019,50(1):101-107.
- [34] WU X,LIU S,CAO Y,et al.Speech emotion recognition using capsule networks[C]//Proceedings of the ICASSP 2019-2019 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).Brighton:IEEE,2019:6695-6699.
- [35] ZHANG S,ZHANG S,HUANG T,et al.Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching[J].IEEE Transactions on Multimedia,2017,20(6):1576-1590.
- [36] OJALA T,PIETIK?INEN M,HARWOOD D.A comparative study of texture measures with classification based on featured distributions[J].Pattern Recognition,1996,29(1):51-59.
- [37] HERNANDEZ-MATAMOROS A,BONARINI A,ESCAMILLA-HERNANDEZ E,et al.A facial expression recognition with automatic segmentation of face regions[C]//Proceedings of the International Conference on Intelligent Software Methodologies,Tools,and Techniques.Cham:Springer,2015:529-540.
- [38] HINTON G E.Reducing the dimensionality of data with neural networks[J].Science,2006,313(5786):504-507.
- [39] WANG H,LIANG H.Local occlusion facial expression recognition based on improved GAN[J].Computer Engineering and Application,2020,56(5):141-146.
- [40] LAI Y H,LAI S H.Emotion-preserving representation learning via generative adversarial network for multi-view facial expression recognition[C]//Proceedings of the 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018).Xi’an:IEEE,2018:263-270.
- [41] AN F,LIU Z.Facial expression recognition algorithm based on parameter adaptive initialization of CNN and LSTM[J].The Visual Computer,2020,36(3):483-498.
- [42] LIU P,HAN S,MENG Z,et al.Facial expression recognition via a boosted deep belief network[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Columbus:IEEE,2014:1805-1812.
- [43] ZHANG Y,LAI G,ZHANG M,et al.Explicit factor models for explainable recommendation based on phrase-level sentiment analysis[C]//Proceedings of the 37th international ACM SIGIR Conference on Research & Development in Information Retrieval.New York:Association for Computing Machinery,2014:83-92.
- [44] CAI Y,CAI H,WAN X.Multi-modal sarcasm detection in twitter with hierarchical fusion model[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.Florence:Association for Computational Linguistics,2019:2506-2515.
- [45] XU N,MAO W,CHEN G.Multi-interactive memory network for aspect based multimodal sentiment analysis[C]//Proceedings of the AAAI Conference on Artificial Intelligence.Honolulu:AAI,2019:371-378.
- [46] ZADEH A A B,LIANG P P,PORIA S,et al.Multimodal language analysis in the wild:Cmu-mosei dataset and interpretable dynamic fusion graph[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics.Melbourne:Association for Computational Linguistics,2018:2236-2246.
- [47] ZADEH A,ZELLERS R,PINCUS E,et al.Mosi:Multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos[EB/OL].[2021-10-10].https://arxiv.org/ftp/arxiv/papers/1606/1606.06259.pdf.
- [48] MORENCY L P,MIHALCEA R,DOSHI P.Towards multimodal sentiment analysis:Harvesting opinions from the web[C]//Proceedings of the 13th International Conference on Multimodal Interfaces.New York:Association for Computing Machinery,2011:169-176.
- [49] W?LLMER M,WENINGER F,KNAUP T,et al.Youtube movie reviews:Sentiment analysis in an audio-visual context[J].IEEE Intelligent Systems,2013,28(3):46-53.
- [50] BUSSO C,BULUT M,LEE C C,et al.IEMOCAP:Interactive emotional dyadic motion capture database[J].Language Resources and Evaluation,2008,42(4):335-359.
- [51] PORIA S,HAZARIKA D,MAJUMDER N,et al.Meld:A multimodal multi-party dataset for emotion recognition in conversations[EB/OL].[2021-10-10].https://arxiv.org/pdf/1810.02508.pdf.
- [52] YOU Q,LUO J,JIN H,et al.Cross-modality consistent regression for joint visual-textual sentiment analysis of social multimedia[C]//Proceedings of the Ninth ACM International Conference on Web Search and Data Mining.New York:Association for Computing Machinery,2016:13-22.
- [53] CHEN M,WANG S,LIANG P P,et al.Multimodal sentiment analysis with word-level fusion and reinforcement learning[C]//Proceedings of the 19th ACM International Conference on Multimodal Interaction.New York:Association for Computing Machinery,2017:163-171.
- [54] ZADEH A,CHEN M,PORIA S,et al.Tensor fusion network for multimodal sentiment analysis[EB/OL].[2021-10-10].https://arxiv.org/pdf/1707.07250.pdf.
- [55] ZADEH A,LIANG P P,PORIA S,et al.Multi-attention recurrent network for human communication comprehension[C]//Proceedings of the AAAI Conference on Artificial Intelligence.[S.l.]:AAAI,2018:5642-5649.
- [56] ZADEH A,LIANG P P,MAZUMDER N,et al.Memory fusion network for multi-view sequential learning[C]//Proceedings of the AAAI Conference on Artificial Intelligence.[S.l.]:AAAI,2018:5634-5641.
- [57] MAI S,XING S,HU H.Analyzing multimodal sentiment via acoustic-and visual-LSTM with channel-aware temporal convolution network[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2021,29:1424-1437.
- [58] XU N,MAO W,CHEN G.Multi-interactive memory network for aspect based multimodal sentiment analysis[C]//Proceedings of the AAAI Conference on Artificial Intelligence.[S.l.]:AAAI,2019:371-378.
- [59] LIU D,CHEN L,WANG Z,et al.Speech expression multimodal emotion recognition based on deep belief network[J].Journal of Grid Computing,2021,19(2):1-13.
- [60] SIRIWARDHANA S,KALUARACHCHI T,BILLINGHURST M,et al.Multimodal emotion recognition with transformer-based self supervised feature fusion[J].IEEE Access,2020,8:176274-176285.
- [61] NOJAVANASGHARI B,GOPINATH D,KOUSHIK J,et al.Deep multimodal fusion for persuasiveness prediction[C]//Proceedings of the 18th ACM International Conference on Multimodal Interaction.New York:Association for Computing Machinery,2016:284-288.
- [62] YU Y,LIN H,MENG J,et al.Visual and textual sentiment analysis of a microblog using deep convolutional neural networks[J].Algorithms,2016,9(2):41.
- [63] PORIA S,CAMBRIA E,HAZARIKA D,et al.Context-dependent sentiment analysis in user-generated videos[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.Vancouver:Association for Computational Linguistics,2017:873-883.
- [64] TAN Y,SUN Z,DUAN F,et al.A multimodal emotion recognition method based on facial expressions and electroencephalography[J].Biomedical Signal Processing and Control,2021,70:103029.
- [65] PANDEYA Y R,BHATTARAI B,LEE J.Deep-learning-based multimodal emotion classification for music videos[J].Sensors,2021,21(14):4927.
- [66] HUANG F,ZHANG X,ZHAO Z,et al.Image-text sentiment analysis via deep multimodal attentive fusion[J].Knowledge-Based Systems,2019,167:26-37.
- [67] SIMONYAN K,ZISSERMAN A.Very deep convolutional networks for large-scale image recognition[EB/OL].[2021-10-10].https://arxiv.org/pdf/1409.1556.pdf.
- [68] TASHU T M,HAJIYEVA S,HORVATH T.Multimodal emotion recognition from art using sequential co-attention[J].Journal of Imaging,2021,7(8):157.
- [69] LIAN Z,LIU B,TAO J.CTNet:Conversational transformer network for emotion recognition[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2021,29:985-1000.
- [70] PORIA S,PENG H,HUSSAIN A,et al.Ensemble application of convolutional neural networks and multiple kernel learning for multimodal sentiment analysis[J].Neurocomputing,2017,261:217-230.
- [71] PORIA S,CAMBRIA E,GELBUKH A.Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis[C]//Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing.Lisbon:SIGDAT,2015:2539-2544.
- [72] BALTRU?AITIS T,ROBINSON P,MORENCY L P.3D constrained local model for rigid and non-rigid facial tracking[C]//Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition.Providence:IEEE,2012:2610-2617.
- [73] PORIA S,CHATURVEDI I,CAMBRIA E,et al.Convolutional MKL based multimodal emotion recognition and sentiment analysis[C]//Proceedings of the 2016 IEEE 16th International Conference on Data Mining (ICDM).Barcelona:IEEE,2016:439-448.
- [74] LE Q,MIKOLOV T.Distributed representations of sentences and documents[C]//Proceedings of the International Conference on Machine Learning.[S.l.]:ACM,2014:1188-1196.
- [75] SIMONYAN K,ZISSERMAN A.Very deep convolutional networks for large-scale image recognition[EB/OL].[2021-10-10].https://arxiv.org/pdf/1409.1556.pdf.
- [76] CHEN Y.Convolutional neural network for sentence classification[D].Waterloo:University of Waterloo,2015.
- [77] PORIA S,CAMBRIA E,HAZARIKA D,et al.Context-dependent sentiment analysis in user-generated videos[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.Vancouver:Association for Computational Linguistics,2017:873-883.
- [78] HAZARIKA D,PORIA S,ZADEH A,et al.Conversational memory network for emotion recognition in dyadic dialogue videos[C]//Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics.North American Chapter:Association for Computational Linguistics,2018:2122.
- [79] CIMTAY Y,EKMEKCIOGLU E,CAGLAR-OZHAN S.Cross-subject multimodal emotion recognition based on hybrid fusion[J].IEEE Access,2020,8:168865-168878.