Author Identifier (ORCID)
Arooba Maqsood: https://orcid.org/0009-0001-0853-9268
Martin Masek: https://orcid.org/0000-0001-8620-6779
Syed Zulqarnain Gilani: https://orcid.org/0000-0002-7448-2327
Abstract
Drunk driving remains a significant public safety challenge, demanding innovative alternatives to conventional methods such as field sobriety tests and breathalysers. Estimating a driver's level of intoxication through facial cues is particularly challenging due to the subtle and person-specific nature of alcohol-induced behaviours. In this paper, we present BiFuseNet, a 3D spatio-temporal multi-modal network designed to classify alcohol impairment levels into three categories: sober, moderate, and severe. Unlike prior approaches that rely on either uni-modal RGB video or hand-crafted facial features, our method exploits complementary physiological cues from RGB and infrared (IR) facial videos. We introduce a Bi-directional Hierarchical Fusion (BiHF) module that applies cross-attention mechanisms at multiple semantic levels of our BiFuseNet, including early, middle, and late feature stages. This enables deep integration of modality-specific signals across varying temporal and spatial contexts. To capture both short-term facial movements and sustained facial dynamics, we implement a sliding window strategy that samples over 30 frames across ten-minute recordings. Extensive experiments on a public dataset demonstrate that BiFuseNet outperforms uni-modal and traditional fusion baselines, achieving a classification accuracy of 88.41% and an AUC-ROC of 0.91, establishing a new state of the art in estimating blood alcohol concentration.
Document Type
Conference Proceeding
Date of Publication
10-12-2025
Publication Title
ICMI 2025 Proceedings of the 27th International Conference on Multimodal Interaction
Publisher
Association for Computing Machinery, Inc
School
School of Science
RAS ID
84394
Funders
Western Australian Future Health Research and Innovation Fund
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.
First Page
605
Last Page
613
Comments
Tariq, A., Maqsood, A., Masek, M., & Gilani, S. Z. (2025). BiFuseNet: A multimodal network for estimating blood alcohol concentration via bidirectional hierarchical fusion. In Proceedings of the 27th International Conference on Multimodal Interaction (pp. 605-613). https://doi.org/10.1145/3716553.3750808