Research outputs 2022 to 2026

Multimodal fusion for audio-image and video action recognition

Muhammad B. Shaikh, Edith Cowan UniversityFollow
Douglas Chai, Edith Cowan UniversityFollow
Syed M. S. Islam, Edith Cowan UniversityFollow
Naveed Akhtar

Document Type

Journal Article

Publication Title

Neural Computing and Applications

Volume

First Page

5499

Last Page

5513

Publisher

Springer

School

School of Engineering / School of Science / Centre for Artificial Intelligence and Machine Learning (CAIML)

RAS ID

62440

Funders

Edith Cowan University / Australia and Higher Commission (HEC), Pakistan / Office of National Intelligence National Intelligence Postdoctoral Grant / Australian Government / Open Access funding enabled and organized by CAUL and its Member Institution.

Comments

Shaikh, M. B., Chai, D., Islam, S. M. S., & Akhtar, N. (2024). Multimodal fusion for audio-image and video action recognition. Neural Computing and Applications, 36, 5499-5513. https://doi.org/10.1007/s00521-023-09186-5

Abstract

Multimodal Human Action Recognition (MHAR) is an important research topic in computer vision and event recognition fields. In this work, we address the problem of MHAR by developing a novel audio-image and video fusion-based deep learning framework that we call Multimodal Audio-Image and Video Action Recognizer (MAiVAR). We extract temporal information using image representations of audio signals and spatial information from video modality with the help of Convolutional Neutral Networks (CNN)-based feature extractors and fuse these features to recognize respective action classes. We apply a high-level weights assignment algorithm for improving audio-visual interaction and convergence. This proposed fusion-based framework utilizes the influence of audio and video feature maps and uses them to classify an action. Compared with state-of-the-art audio-visual MHAR techniques, the proposed approach features a simpler yet more accurate and more generalizable architecture, one that performs better with different audio-image representations. The system achieves an accuracy 87.9% and 79.0% on UCF51 and Kinetics Sounds datasets, respectively. All code and models for this paper will be available at https://tinyurl.com/4ps2ux6n.

DOI

10.1007/s00521-023-09186-5

Related Publications

Shaikh, M. (2023). Multimodal human action recognition using deep learning. Edith Cowan University. https://doi.org/10.25958/v1j4-6h36

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.

Download

Included in

Computer Engineering Commons, Computer Sciences Commons

COinS

Link to publisher version (DOI)

10.1007/s00521-023-09186-5

Research outputs 2022 to 2026

Multimodal fusion for audio-image and video action recognition

Document Type

Publication Title

Volume

First Page

Last Page

Publisher

School

RAS ID

Funders

Comments

Abstract

DOI

Related Publications

Creative Commons License

Included in

Link to publisher version (DOI)

Search

Links

Browse

Author Information

Article Locations

Research outputs 2022 to 2026

Multimodal fusion for audio-image and video action recognition

Authors

Document Type

Publication Title

Volume

First Page

Last Page

Publisher

School

RAS ID

Funders

Comments

Abstract

DOI

Related Publications

Creative Commons License

Included in

Share

Link to publisher version (DOI)

Search

Links

Browse

Author Information

Article Locations