Author Identifier (ORCID)

Arooba Maqsood: https://orcid.org/0009-0001-0853-9268

Martin Masek: https://orcid.org/0000-0001-8620-6779

Syed Zulqarnain Gilani: https://orcid.org/0000-0002-7448-2327

Abstract

Drunk driving remains a significant public safety challenge, demanding innovative alternatives to conventional methods such as field sobriety tests and breathalysers. Estimating a driver's level of intoxication through facial cues is particularly challenging due to the subtle and person-specific nature of alcohol-induced behaviours. In this paper, we present BiFuseNet, a 3D spatio-temporal multi-modal network designed to classify alcohol impairment levels into three categories: sober, moderate, and severe. Unlike prior approaches that rely on either uni-modal RGB video or hand-crafted facial features, our method exploits complementary physiological cues from RGB and infrared (IR) facial videos. We introduce a Bi-directional Hierarchical Fusion (BiHF) module that applies cross-attention mechanisms at multiple semantic levels of our BiFuseNet, including early, middle, and late feature stages. This enables deep integration of modality-specific signals across varying temporal and spatial contexts. To capture both short-term facial movements and sustained facial dynamics, we implement a sliding window strategy that samples over 30 frames across ten-minute recordings. Extensive experiments on a public dataset demonstrate that BiFuseNet outperforms uni-modal and traditional fusion baselines, achieving a classification accuracy of 88.41% and an AUC-ROC of 0.91, establishing a new state of the art in estimating blood alcohol concentration.

Document Type

Conference Proceeding

Date of Publication

10-12-2025

Publication Title

ICMI 2025 Proceedings of the 27th International Conference on Multimodal Interaction

Publisher

Association for Computing Machinery, Inc

School

School of Science

RAS ID

84394

Funders

Western Australian Future Health Research and Innovation Fund

Creative Commons License

Creative Commons Attribution 4.0 License
This work is licensed under a Creative Commons Attribution 4.0 License.

Comments

Tariq, A., Maqsood, A., Masek, M., & Gilani, S. Z. (2025). BiFuseNet: A multimodal network for estimating blood alcohol concentration via bidirectional hierarchical fusion. In Proceedings of the 27th International Conference on Multimodal Interaction (pp. 605-613). https://doi.org/10.1145/3716553.3750808

First Page

605

Last Page

613

Share

 
COinS
 

Link to publisher version (DOI)

10.1145/3716553.3750808