Document Type

Journal Article

Publication Title

Neural Computing and Applications

Volume

36

First Page

5499

Last Page

5513

Publisher

Springer

School

School of Engineering / School of Science / Centre for Artificial Intelligence and Machine Learning (CAIML)

RAS ID

62440

Funders

Edith Cowan University / Australia and Higher Commission (HEC), Pakistan / Office of National Intelligence National Intelligence Postdoctoral Grant / Australian Government / Open Access funding enabled and organized by CAUL and its Member Institution.

Comments

Shaikh, M. B., Chai, D., Islam, S. M. S., & Akhtar, N. (2024). Multimodal fusion for audio-image and video action recognition. Neural Computing and Applications, 36, 5499-5513. https://doi.org/10.1007/s00521-023-09186-5

Abstract

Multimodal Human Action Recognition (MHAR) is an important research topic in computer vision and event recognition fields. In this work, we address the problem of MHAR by developing a novel audio-image and video fusion-based deep learning framework that we call Multimodal Audio-Image and Video Action Recognizer (MAiVAR). We extract temporal information using image representations of audio signals and spatial information from video modality with the help of Convolutional Neutral Networks (CNN)-based feature extractors and fuse these features to recognize respective action classes. We apply a high-level weights assignment algorithm for improving audio-visual interaction and convergence. This proposed fusion-based framework utilizes the influence of audio and video feature maps and uses them to classify an action. Compared with state-of-the-art audio-visual MHAR techniques, the proposed approach features a simpler yet more accurate and more generalizable architecture, one that performs better with different audio-image representations. The system achieves an accuracy 87.9% and 79.0% on UCF51 and Kinetics Sounds datasets, respectively. All code and models for this paper will be available at https://tinyurl.com/4ps2ux6n.

DOI

10.1007/s00521-023-09186-5

Creative Commons License

Creative Commons Attribution 4.0 License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

 
COinS