Theses: Doctorates and Masters

Multimodal human action recognition using deep learning

Muhammad Bilal Shaikh, Edith Cowan UniversityFollow

Author Identifier

Muhammad Bilal Shaikh

https://orcid.org/0000-0001-9042-5018

Date of Award

2023

Document Type

Thesis - ECU Access Only

Publisher

Edith Cowan University

Degree Name

Doctor of Philosophy

School

School of Engineering

First Supervisor

Douglas Chai

Second Supervisor

Syed Mohammed Shamsul Islam

Third Supervisor

Naveed Akhtar

Abstract

This dissertation addresses the increasingly critical domain of human action recognition via videos, which manifests relevance in numerous fields, ranging from video retrieval and human-robot interaction to visual surveillance, human-computer interaction to sports analysis, and healthcare to entertainment. Currently, the growing attraction towards multimodal action recognition is driven by the advent of technologically superior and economically viable multimodal sensors, accompanied by top-tier data collection mechanisms.

The initial segment of this thesis meticulously scrutinizes the existing literature on multimodal action recognition. The inaugural chapter of this section is dedicated to a comprehensive survey of RGB-D data acquisition sensors, encompassing the spectrum of techniques from classical machine learning to modern deep learning methodologies applied in multimodal action recognition. The subsequent chapter pivots towards a second in-depth survey which covers the transition from Convolutional Neural Networks to Transformer-based techniques from the viewpoint of feature fusion and action learning.

The next phase in this discourse emphasizes the preprocessing of datasets and the formulation of intermediary datasets. We put forth the concept of lightweight datasets, meticulously designed to harness distinctive multimodal features from alternate representations, thereby establishing a fresh perspective on data utilization.

Following this, the third segment of this thesis unveils two novel methodologies for multimodal human action recognition. The initial method integrates meticulously segmented visual and complex audio features, utilizing convolutional neural networks (CNNs) to deeply embed features essential for action recognition. Meanwhile, the second approach is centered on the usage of audiovisual token embeddings, deploying transformers for multimodal feature extraction, which are then forwarded to the fusion module. This module is instrumental in learning intricate spatial and temporal relationships vital to recognizing human actions.

Through rigorous experimentation, we provide robust evidence underscoring the superiority of our proposed methods over existing state-of-the-art techniques on benchmark datasets for multimodal human actions, with a specific emphasis on audio and visual modalities. Furthermore, we substantiate the scalability of the proposed methodologies, by assessing their performance on datasets of varied sizes and characteristics, thereby ensuring that our innovative approaches retain their effectiveness irrespective of data scale and diversity.

Related Publications

Shaikh, M. B., & Chai, D. (2021). RGB-D data-based action recognition: A review. Sensors, 21(12), article 4246. https://doi.org/10.3390/s21124246

https://ro.ecu.edu.au/ecuworkspost2013/10444/

Shaikh, M. B., Chai, D., Islam, S. M. S., & Akhtar, N. (2023). PyMAiVAR: An open-source Python suit for audio-image representation in human action recognition. Software Impacts, 17, article 100544. https://doi.org/10.1016/j.simpa.2023.100544

https://ro.ecu.edu.au/ecuworks2022-2026/2804/

Shaikh, M. B., Chai, D., Islam, S. M. S., & Akhtar, N. (2023, September). MAiVAR-T: Multimodal audio-image and video action recognizer using transformers [Paper presentation]. 2023 11th European Workshop on Visual Information Processing (EUVIP), Gjovik, Norway. https://doi.org/10.1109/EUVIP58404.2023.10323051

https://ro.ecu.edu.au/ecuworks2022-2026/3335/

Shaikh, M. B., Chai, D., Islam, S. M. S., & Akhtar, N. (2024). MHAiR: A dataset of audio-image representations for multimodal human actions. Data, 9(2), article 21. https://doi.org/10.3390/data9020021

https://ro.ecu.edu.au/ecuworks2022-2026/3397/

Shaikh, M. B., Chai, D., Islam, S. M. S., & Akhtar, N. (2024). Multimodal fusion for audio-image and video action recognition. Neural Computing and Applications. Advance online publication. https://doi.org/10.1007/s00521-023-09186-5

https://ro.ecu.edu.au/ecuworks2022-2026/3398/

DOI

10.25958/v1j4-6h36

Access Note

Access to this thesis is embargoed until 21st December 2026.

Recommended Citation

Shaikh, M. (2023). Multimodal human action recognition using deep learning. Edith Cowan University. https://doi.org/10.25958/v1j4-6h36

Access to this thesis is restricted. Please see the Access Note below for access details.

COinS

Theses: Doctorates and Masters

Multimodal human action recognition using deep learning

Author Identifier

Date of Award

Document Type

Publisher

Degree Name

School

First Supervisor

Second Supervisor

Third Supervisor

Abstract

Related Publications

DOI

Access Note

Recommended Citation

Search

Links

Browse

Author Information

Links

Paper Locations

Theses: Doctorates and Masters

Multimodal human action recognition using deep learning

Author

Author Identifier

Date of Award

Document Type

Publisher

Degree Name

School

First Supervisor

Second Supervisor

Third Supervisor

Abstract

Related Publications

DOI

Access Note

Recommended Citation

Share

Search

Links

Browse

Author Information

Links

Paper Locations