Research outputs 2022 to 2026

From CNNs to transformers in multimodal human action recognition: A survey

Muhammad Bilal Shaikh, Edith Cowan UniversityFollow
Douglas Chai, Edith Cowan UniversityFollow
Syed Muhammad Shamsul Islam, Edith Cowan UniversityFollow
Naveed Akhtar

Abstract

Due to its widespread applications, human action recognition is one of the most widely studied research problems in Computer Vision. Recent studies have shown that addressing it using multimodal data leads to superior performance as compared to relying on a single data modality. During the adoption of deep learning for visual modelling in the past decade, action recognition approaches have mainly relied on Convolutional Neural Networks (CNNs). However, the recent rise of Transformers in visual modelling is now also causing a paradigm shift for the action recognition task. This survey captures this transition while focusing on Multimodal Human Action Recognition (MHAR). Unique to the induction of multimodal computational models is the process of ‘fusing’ the features of the individual data modalities. Hence, we specifically focus on the fusion design aspects of the MHAR approaches. We analyze the classic and emerging techniques in this regard, while also highlighting the popular trends in the adaption of CNN and Transformer building blocks for the overall problem. In particular, we emphasize on recent design choices that have led to more efficient MHAR models. Unlike existing reviews, which discuss Human Action Recognition from a broad perspective, this survey is specifically aimed at pushing the boundaries of MHAR research by identifying promising architectural and fusion design choices to train practicable models. We also provide an outlook of the multimodal datasets from their scale and evaluation viewpoint. Finally, building on the reviewed literature, we discuss the challenges and future avenues for MHAR.

Document Type

Journal Article

Date of Publication

7-9-2024

Volume

Issue

Publication Title

ACM Transactions on Multimedia Computing, Communications and Applications

Publisher

Association for Computing Machinery

School

School of Engineering / School of Science

RAS ID

71566

Funders

Edith Cowan University / Higher Education Commission of Pakistan (PM/HRDI-UESTPs/UETs-I/Phase-1/Batch-VI/2018) / Australian Government Office of National Intelligence (NIPG-2021-001)

Creative Commons License

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 4.0 License.

Comments

Shaikh, M. B., Chai, D., Islam, S. M. S., & Akhtar, N. (2024). From CNNs to transformers in multimodal human action recognition: A survey. ACM Transactions on Multimedia Computing, Communications and Applications, 20(8). https://doi.org/10.1145/3664815

Download

Included in

Electrical and Computer Engineering Commons

COinS

Link to publisher version (DOI)

10.1145/3664815

Research outputs 2022 to 2026

From CNNs to transformers in multimodal human action recognition: A survey

Abstract

Document Type

Date of Publication

Volume

Issue

Publication Title

Publisher

School

RAS ID

Funders

Creative Commons License

Comments

Included in

Link to publisher version (DOI)

Search

Links

Browse

Author Information

Article Locations

Research outputs 2022 to 2026

From CNNs to transformers in multimodal human action recognition: A survey

Authors/Creators

Abstract

Document Type

Date of Publication

Volume

Issue

Publication Title

Publisher

School

RAS ID

Funders

Creative Commons License

Comments

Included in

Share

Link to publisher version (DOI)

Search

Links

Browse

Author Information

Article Locations