M-AIDE: Mechanistic agentic interpretability for decoding empathy in language models

Author Identifier (ORCID)

Nima Mirnateghi: https://orcid.org/0000-0002-1814-7452

Syed Mohammed Shamsul Islam: https://orcid.org/0000-0002-3200-2903

Syed Afaq Ali Shah: https://orcid.org/0000-0003-2181-8445

Abstract

Large language models (LLMs) have transformed conversational agents, powering applications from everyday assistants to domain-specific systems. Yet, their internal mechanisms remain opaque, limiting our understanding of how complex behaviours are represented. Therapeutic conversational agents provide a compelling setting to study this problem, as they require models to encode empathic behaviours. For a better understanding of these behaviours, we present M-AIDE, an agentic framework designed to systematically interpret empathy-related features in LLMs. We apply this technique to therapeutic dialogue data, specifically to understand how LLMs may encode perceived empathy. Our approach leverages mechanistic interpretability to uncover artificial empathy features aligned with psychological categories of empathy. M-AIDE integrates automated interpretability into its pipeline, enabling large-scale classification and explanation of discovered features without exhaustive manual inspection. Our experiments reveal a gradient of representation: low-level features predominate at early layers, while distinct empathy features emerge as layers become deeper. The source code is available at: https://github.com/ai-voyage/M-AIDE.git.

Keywords

Artificial empathy, large language models, mechanistic interpretability

Document Type

Conference Proceeding

Date of Publication

1-1-2025

Publication Title

2025 40th International Conference on Image and Vision Computing New Zealand (IVCNZ)

Publisher

IEEE

School

Centre for Artificial Intelligence and Machine Learning (CAIML)

Funders

Edith Cowan University

Comments

Mirnateghi, N., Tahir, S., Islam, S. M. S., & Shah, S. A. A. (2025). M-AIDE: Mechanistic agentic interpretability for decoding empathy in language models. In 2025 40th International Conference on Image and Vision Computing New Zealand (IVCNZ) (pp. 1-6). IEEE. https://doi.org/10.1109/IVCNZ67716.2025.11281845

Copyright

subscription content

Share

 
COinS
 

Link to publisher version (DOI)

10.1109/IVCNZ67716.2025.11281845