Research outputs 2022 to 2026

M-AIDE: Mechanistic agentic interpretability for decoding empathy in language models

Author Identifier (ORCID)

Nima Mirnateghi: https://orcid.org/0000-0002-1814-7452

Syed Mohammed Shamsul Islam: https://orcid.org/0000-0002-3200-2903

Syed Afaq Ali Shah: https://orcid.org/0000-0003-2181-8445

Abstract

Large language models (LLMs) have transformed conversational agents, powering applications from everyday assistants to domain-specific systems. Yet, their internal mechanisms remain opaque, limiting our understanding of how complex behaviours are represented. Therapeutic conversational agents provide a compelling setting to study this problem, as they require models to encode empathic behaviours. For a better understanding of these behaviours, we present M-AIDE, an agentic framework designed to systematically interpret empathy-related features in LLMs. We apply this technique to therapeutic dialogue data, specifically to understand how LLMs may encode perceived empathy. Our approach leverages mechanistic interpretability to uncover artificial empathy features aligned with psychological categories of empathy. M-AIDE integrates automated interpretability into its pipeline, enabling large-scale classification and explanation of discovered features without exhaustive manual inspection. Our experiments reveal a gradient of representation: low-level features predominate at early layers, while distinct empathy features emerge as layers become deeper. The source code is available at: https://github.com/ai-voyage/M-AIDE.git.

Keywords

Artificial empathy, large language models, mechanistic interpretability

Document Type

Conference Proceeding

Date of Publication

1-1-2025

Publication Title

2025 40th International Conference on Image and Vision Computing New Zealand (IVCNZ)

Publisher

IEEE

School

Centre for Artificial Intelligence and Machine Learning (CAIML)

Funders

Edith Cowan University

Comments

Mirnateghi, N., Tahir, S., Islam, S. M. S., & Shah, S. A. A. (2025). M-AIDE: Mechanistic agentic interpretability for decoding empathy in language models. In 2025 40th International Conference on Image and Vision Computing New Zealand (IVCNZ) (pp. 1-6). IEEE. https://doi.org/10.1109/IVCNZ67716.2025.11281845

Copyright

subscription content

Link to Full Text

COinS

Link to publisher version (DOI)

10.1109/IVCNZ67716.2025.11281845

Research outputs 2022 to 2026

M-AIDE: Mechanistic agentic interpretability for decoding empathy in language models

Author Identifier (ORCID)

Abstract

Keywords

Document Type

Date of Publication

Publication Title

Publisher

School

Funders

Comments

Copyright

Link to publisher version (DOI)

Search

Links

Browse

Author Information

Article Locations

Research outputs 2022 to 2026

M-AIDE: Mechanistic agentic interpretability for decoding empathy in language models

Authors/Creators

Author Identifier (ORCID)

Abstract

Keywords

Document Type

Date of Publication

Publication Title

Publisher

School

Funders

Comments

Copyright

Share

Link to publisher version (DOI)

Search

Links

Browse

Author Information

Article Locations