M-AIDE: Mechanistic agentic interpretability for decoding empathy in language models
Author Identifier (ORCID)
Nima Mirnateghi: https://orcid.org/0000-0002-1814-7452
Syed Mohammed Shamsul Islam: https://orcid.org/0000-0002-3200-2903
Syed Afaq Ali Shah: https://orcid.org/0000-0003-2181-8445
Abstract
Large language models (LLMs) have transformed conversational agents, powering applications from everyday assistants to domain-specific systems. Yet, their internal mechanisms remain opaque, limiting our understanding of how complex behaviours are represented. Therapeutic conversational agents provide a compelling setting to study this problem, as they require models to encode empathic behaviours. For a better understanding of these behaviours, we present M-AIDE, an agentic framework designed to systematically interpret empathy-related features in LLMs. We apply this technique to therapeutic dialogue data, specifically to understand how LLMs may encode perceived empathy. Our approach leverages mechanistic interpretability to uncover artificial empathy features aligned with psychological categories of empathy. M-AIDE integrates automated interpretability into its pipeline, enabling large-scale classification and explanation of discovered features without exhaustive manual inspection. Our experiments reveal a gradient of representation: low-level features predominate at early layers, while distinct empathy features emerge as layers become deeper. The source code is available at: https://github.com/ai-voyage/M-AIDE.git.
Keywords
Artificial empathy, large language models, mechanistic interpretability
Document Type
Conference Proceeding
Date of Publication
1-1-2025
Publication Title
2025 40th International Conference on Image and Vision Computing New Zealand (IVCNZ)
Publisher
IEEE
School
Centre for Artificial Intelligence and Machine Learning (CAIML)
Funders
Edith Cowan University
Copyright
subscription content
Comments
Mirnateghi, N., Tahir, S., Islam, S. M. S., & Shah, S. A. A. (2025). M-AIDE: Mechanistic agentic interpretability for decoding empathy in language models. In 2025 40th International Conference on Image and Vision Computing New Zealand (IVCNZ) (pp. 1-6). IEEE. https://doi.org/10.1109/IVCNZ67716.2025.11281845