MAiVAR-T: Multimodal audio-image and video action recognizer using transformers

Document Type

Conference Proceeding

Publication Title

2023 11th European Workshop on Visual Information Processing (EUVIP)

Publisher

IEEE

School

School of Engineering / School of Science

RAS ID

58304

Funders

Edith Cowan University

Comments

Shaikh, M. B., Chai, D., Islam, S. M. S., & Akhtar, N. (2023, September). MAiVAR-T: Multimodal audio-image and video action recognizer using transformers [Paper presentation]. 2023 11th European Workshop on Visual Information Processing (EUVIP), Gjovik, Norway. https://doi.org/10.1109/EUVIP58404.2023.10323051

Abstract

In line with the human capacity to perceive the world by simultaneously processing and integrating high-dimensional inputs from multiple modalities like vision and audio, we propose a novel model, MAiVAR-T (Multimodal Audio-Image to Video Action Recognition Transformer). This model employs an intuitive approach for the combination of audio-image and video modalities, with a primary aim to escalate the effectiveness of multimodal human action recognition (MHAR). At the core of MAiVAR-T lies the significance of distilling substantial representations from the audio modality and transmuting these into the image domain. Subsequently, this audio-image depiction is fused with the video modality to formulate a unified representation. This concerted approach strives to exploit the contextual richness inherent in both audio and video modalities, thereby promoting action recognition. In contrast to existing state-of-the-art strategies that focus solely on audio or video modalities, MAiVAR-T demonstrates superior performance. Our extensive empirical evaluations conducted on a benchmark action recognition dataset corroborate the model’s remarkable performance. This underscores the potential enhancements derived from integrating audio and video modalities for action recognition purposes. To ensure transparency and reproducibility of our work, the source code is made publicly available at https://bit.ly/43do8DH.

DOI

10.1109/EUVIP58404.2023.10323051

Access Rights

subscription content

Share

 
COinS