Multimodal human action recognition using deep learning
Date of Award
2023
Document Type
Thesis - ECU Access Only
Publisher
Edith Cowan University
Degree Name
Doctor of Philosophy
School
School of Engineering
First Supervisor
Douglas Chai
Second Supervisor
Syed Mohammed Shamsul Islam
Third Supervisor
Naveed Akhtar
Abstract
This dissertation addresses the increasingly critical domain of human action recognition via videos, which manifests relevance in numerous fields, ranging from video retrieval and human-robot interaction to visual surveillance, human-computer interaction to sports analysis, and healthcare to entertainment. Currently, the growing attraction towards multimodal action recognition is driven by the advent of technologically superior and economically viable multimodal sensors, accompanied by top-tier data collection mechanisms.
The initial segment of this thesis meticulously scrutinizes the existing literature on multimodal action recognition. The inaugural chapter of this section is dedicated to a comprehensive survey of RGB-D data acquisition sensors, encompassing the spectrum of techniques from classical machine learning to modern deep learning methodologies applied in multimodal action recognition. The subsequent chapter pivots towards a second in-depth survey which covers the transition from Convolutional Neural Networks to Transformer-based techniques from the viewpoint of feature fusion and action learning.
The next phase in this discourse emphasizes the preprocessing of datasets and the formulation of intermediary datasets. We put forth the concept of lightweight datasets, meticulously designed to harness distinctive multimodal features from alternate representations, thereby establishing a fresh perspective on data utilization.
Following this, the third segment of this thesis unveils two novel methodologies for multimodal human action recognition. The initial method integrates meticulously segmented visual and complex audio features, utilizing convolutional neural networks (CNNs) to deeply embed features essential for action recognition. Meanwhile, the second approach is centered on the usage of audiovisual token embeddings, deploying transformers for multimodal feature extraction, which are then forwarded to the fusion module. This module is instrumental in learning intricate spatial and temporal relationships vital to recognizing human actions.
Through rigorous experimentation, we provide robust evidence underscoring the superiority of our proposed methods over existing state-of-the-art techniques on benchmark datasets for multimodal human actions, with a specific emphasis on audio and visual modalities. Furthermore, we substantiate the scalability of the proposed methodologies, by assessing their performance on datasets of varied sizes and characteristics, thereby ensuring that our innovative approaches retain their effectiveness irrespective of data scale and diversity.
DOI
10.25958/v1j4-6h36
Access Note
Access to this thesis is embargoed until 21st December 2026.
Recommended Citation
Shaikh, M. (2023). Multimodal human action recognition using deep learning. Edith Cowan University. https://doi.org/10.25958/v1j4-6h36