Multimodal human action recognition using deep learning

Author Identifier

Muhammad Bilal Shaikh

Date of Award


Document Type

Thesis - ECU Access Only


Edith Cowan University

Degree Name

Doctor of Philosophy


School of Engineering

First Supervisor

Douglas Chai

Second Supervisor

Syed Mohammed Shamsul Islam

Third Supervisor

Naveed Akhtar


This dissertation addresses the increasingly critical domain of human action recognition via videos, which manifests relevance in numerous fields, ranging from video retrieval and human-robot interaction to visual surveillance, human-computer interaction to sports analysis, and healthcare to entertainment. Currently, the growing attraction towards multimodal action recognition is driven by the advent of technologically superior and economically viable multimodal sensors, accompanied by top-tier data collection mechanisms.

The initial segment of this thesis meticulously scrutinizes the existing literature on multimodal action recognition. The inaugural chapter of this section is dedicated to a comprehensive survey of RGB-D data acquisition sensors, encompassing the spectrum of techniques from classical machine learning to modern deep learning methodologies applied in multimodal action recognition. The subsequent chapter pivots towards a second in-depth survey which covers the transition from Convolutional Neural Networks to Transformer-based techniques from the viewpoint of feature fusion and action learning.

The next phase in this discourse emphasizes the preprocessing of datasets and the formulation of intermediary datasets. We put forth the concept of lightweight datasets, meticulously designed to harness distinctive multimodal features from alternate representations, thereby establishing a fresh perspective on data utilization.

Following this, the third segment of this thesis unveils two novel methodologies for multimodal human action recognition. The initial method integrates meticulously segmented visual and complex audio features, utilizing convolutional neural networks (CNNs) to deeply embed features essential for action recognition. Meanwhile, the second approach is centered on the usage of audiovisual token embeddings, deploying transformers for multimodal feature extraction, which are then forwarded to the fusion module. This module is instrumental in learning intricate spatial and temporal relationships vital to recognizing human actions.

Through rigorous experimentation, we provide robust evidence underscoring the superiority of our proposed methods over existing state-of-the-art techniques on benchmark datasets for multimodal human actions, with a specific emphasis on audio and visual modalities. Furthermore, we substantiate the scalability of the proposed methodologies, by assessing their performance on datasets of varied sizes and characteristics, thereby ensuring that our innovative approaches retain their effectiveness irrespective of data scale and diversity.



Access Note

Access to this thesis is embargoed until 21st December 2026.

Access to this thesis is restricted. Please see the Access Note below for access details.