Efficient Low-Rank Multimodal Fusion With Modality-Specific Factors
Efficient Low-Rank Multimodal Fusion With Modality-Specific Factors
+ +⋯+ 1 ∘ + +⋯+ 1 ∘ + + ⋯+ 1
# # #
(,) (.) (/) (,) (.) (/) (,) (.) (/)
*" *" *" %" *& *& *& %& *' *' *' %'
Figure 1: Overview of our Low-rank Multimodal Fusion model structure: LMF first obtains the unimodal
representation za , zv , zl by passing the unimodal inputs xa , xv , xl into three sub-embedding networks
fv , fa , fl respectively. LMF produces the multimodal output representation by performing low-rank
multimodal fusion with modality-specific factors. The multimodal representation can be then used for
generating prediction tasks.
Figure 3: Decomposing weight tensor into low-rank factors (See Section 3.2.1 for details.)
where ΛM m=1 denotes the element-wise product factors into M order-3 tensors and swap the or-
over a sequence of tensors: Λ3t=1 xt = x1 ○ x2 ○ x3 . der in which we do the element-wise product and
An illustration of the trimodal case of equation 6 summation:
is shown in Figure 1. We can also derive equation r
(1) (2) (r)
M
6 for a bimodal case to clarify what it does: h = ∑ [ Λ [wm , wm , ..., wm ] ⋅ ẑm ] (8)
i=1 m=1 i,∶
r
and now the summation is done along the first di-
h= (∑ wa(i) ⊗ wv(i) ) ⋅ Z mension of the bracketed matrix. [⋅]i,∶ indicates the
i=1 i-th slice of a matrix. In this way, we can parame-
r r terize the model with M order-3 tensors, instead of
= (∑ wa(i) ⋅ za ) ○ (∑ wv(i) ⋅ zv ) (7)
i=1 i=1
parameterizing with sets of vectors.
Table 2: Results for sentiment analysis on CMU-MOSI, emotion recognition on IEMOCAP and personality
trait recognition on POM. Best results are highlighted in bold.
References
Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe
Figure 4: The Impact of different rank settings on Kazemzadeh, Emily Mower, Samuel Kim, Jean-
Model Performance: As the rank increases, the nette Chang, Sungbok Lee, and Shrikanth S.
results become unstable and low rank is enough in Narayanan. 2008. Iemocap: Interactive emotional
dyadic motion capture database. Journal of Lan-
terms of the mean absolute error.
guage Resources and Evaluation 42(4):335–359. Louis-Philippe Morency, Rada Mihalcea, and Payal
https://doi.org/10.1007/s10579-008-9076-6. Doshi. 2011. Towards multimodal sentiment anal-
ysis: Harvesting opinions from the web. In Proceed-
Lawrence S Chen, Thomas S Huang, Tsutomu ings of the 13th International Conference on Multi-
Miyasato, and Ryohei Nakatsu. 1998. Multimodal modal Interactions. ACM, pages 169–176.
human emotion/expression recognition. In Auto-
matic Face and Gesture Recognition, 1998. Proceed- Behnaz Nojavanasghari, Deepak Gopinath, Jayanth
ings. Third IEEE International Conference on. IEEE, Koushik, Tadas Baltrušaitis, and Louis-Philippe
pages 366–371. Morency. 2016. Deep multimodal fusion for persua-
siveness prediction. In Proceedings of the 18th ACM
Minghai Chen, Sen Wang, Paul Pu Liang, Tadas Bal- International Conference on Multimodal Interaction.
trušaitis, Amir Zadeh, and Louis-Philippe Morency. ACM, pages 284–288.
2017. Multimodal sentiment analysis with word-
level fusion and reinforcement learning. In Pro- Sunghyun Park, Han Suk Shim, Moitreya Chatterjee,
ceedings of the 19th ACM International Con- Kenji Sagae, and Louis-Philippe Morency. 2014a.
ference on Multimodal Interaction. ACM, New Computational analysis of persuasiveness in social
York, NY, USA, ICMI 2017, pages 163–171. multimedia: A novel dataset and multimodal predic-
https://doi.org/10.1145/3136755.3136801. tion approach. In Proceedings of the 16th Interna-
tional Conference on Multimodal Interaction. ACM,
pages 50–57.
Corinna Cortes and Vladimir Vapnik. 1995. Support-
vector networks. Machine learning 20(3):273–297. Sunghyun Park, Han Suk Shim, Moitreya Chatterjee,
Kenji Sagae, and Louis-Philippe Morency. 2014b.
Liyanage C De Silva, Tsutomu Miyasato, and Ryohei Computational analysis of persuasiveness in social
Nakatsu. 1997. Facial emotion recognition using multimedia: A novel dataset and multimodal pre-
multi-modal information. In Information, Commu- diction approach. In Proceedings of the 16th In-
nications and Signal Processing, 1997. ICICS., Pro- ternational Conference on Multimodal Interaction.
ceedings of 1997 International Conference on. IEEE, ACM, New York, NY, USA, ICMI ’14, pages 50–57.
volume 1, pages 397–401. https://doi.org/10.1145/2663204.2663260.
Gilles Degottex, John Kane, Thomas Drugman, Tuomo Jeffrey Pennington, Richard Socher, and Christopher D
Raitio, and Stefan Scherer. 2014. Covarepa collabo- Manning. 2014. Glove: Global vectors for word rep-
rative voice analysis repository for speech technolo- resentation.
gies. In Acoustics, Speech and Signal Processing
(ICASSP), 2014 IEEE International Conference on. Verónica Pérez-Rosas, Rada Mihalcea, and Louis-
IEEE, pages 960–964. Philippe Morency. 2013. Utterance-level multi-
modal sentiment analysis. In Proceedings of the
Akira Fukui, Dong Huk Park, Daylen Yang, Anna 51st Annual Meeting of the Association for Compu-
Rohrbach, Trevor Darrell, and Marcus Rohrbach. tational Linguistics (Volume 1: Long Papers). vol-
2016. Multimodal compact bilinear pooling for ume 1, pages 973–982.
visual question answering and visual grounding.
arXiv preprint arXiv:1606.01847 . Soujanya Poria, Iti Chaturvedi, Erik Cambria, and
Amir Hussain. 2016. Convolutional mkl based mul-
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long timodal emotion recognition and sentiment analysis.
short-term memory. Neural Comput. 9(8):1735– In Data Mining (ICDM), 2016 IEEE 16th Interna-
1780. https://doi.org/10.1162/neco.1997.9.8.1735. tional Conference on. IEEE, pages 439–448.