A hybrid deep learning framework for daily living human activity recognition with cluster-based video summarization

Document Type

Journal Article

Publication Title

Multimedia Tools and Applications




School of Science / Centre for Securing Digital Futures


Hossain, S., Deb, K., Sakib, S., & Sarker, I. H. (2024). A hybrid deep learning framework for daily living human activity recognition with cluster-based video summarization. Multimedia Tools and Applications. Advance online publication. https://doi.org/10.1007/s11042-024-19022-0


In assisted living facilities or nursing homes, residents’ movements or actions can be monitored using Human Activity Recognition (HAR), ensuring they receive proper care and attention. The significance of HAR is substantial in reviewing and updating emergency response plans to address unusual behavior patterns of individuals in the context of daily living activities. Recognizing activity from video data entails extracting spatial features and subsequently determining the temporal variations across these extracted spatial parameters. A specified number of frames is required to be sampled to analyze video data in recognizing the association of semantic information across the sequential frames. Even while sample frames engage in an essential function, they are often selected at random or skipped sequentially, resulting in temporal data loss. A proper video summary that retains the originality of the video while presenting the most important details might be a solution to the problem highlighted. Addressing the issue, we propose a cluster-based approach for selecting keyframes that facilitates generating video summarization by extracting the relevant frames. Additionally, we explore two different deep learning strategies for recognizing action to assess the effective one: (a) pose-based activity recognition model and (b) single hybrid pre-trained CNN-LSTM model. The experimental findings demonstrate the efficacy of the single hybrid CNN-LSTM technique. Our proposed model yields a mean accuracy of 95.56% for the RGB video data modality, surpassing the performance of several recent works of multimodal using the MSRDailyActivity3D dataset. In addition, the proposed model is evaluated using two challenging datasets: PRECIS HAR and UCF11. Our proposed single hybrid CNN-LSTM model achieves 95.12% precision, 95.11% recall, and 95.03% f1 score on the MSRDailyActivity3D dataset.



Access Rights

subscription content