Author Identifier
Daniel Kam: http://orcid.org/0000-0003-0709-3484
Date of Award
2025
Document Type
Thesis
Publisher
Edith Cowan University
Degree Name
Doctor of Philosophy
School
School of Science
First Supervisor
Mike Johnstone
Second Supervisor
Patryk Szewczyk
Abstract
Digital forensic examiners often rely on keyword search techniques to locate evidential material. While effective for targeted retrieval, such methods are constrained by their dependence on the lexical matching of correctly selected search terms, potentially overlooking semantically relevant content expressed using alternate or unexpected language. Existing forensic search tools offer limited support for conceptual exploration of indexed textual data.
This thesis presents the design and implementation of an end-to-end topic modelling system for Autopsy. It includes an integrated Autopsy module, data mining of Autopsy’s Solr-based search index, and the application of three unsupervised topic modelling algorithms – Latent Dirichlet Allocation (LDA), BERTopic, and Top2Vec. Each algorithm was applied to three distinct datasets – 20Newsgroup, Hillary Clinton Emails, and Elliot Rodger’s Manifesto. A dual-layer LLM evaluation framework was devised to evaluate topic model outputs, focusing on interpretability, alignment with investigative themes, and forensic utility.
BERTopic and Top2Vec outperformed LDA in both interpretability and theme alignment. BERTopic achieved an average interpretability score (IR) of 1.83 and Top2Vec 1.78, compared to LDA’s 1.37. Notably, only the neural models generated topics rated as highly interpretable (IR = 3). Mann–Whitney U tests confirmed statistically significant differences: LDA vs BERTopic (U = 287.5, p = 0.008, r = –0.310) and LDA vs Top2Vec (U = 308.0, p = 0.019, r = –0.271), indicating medium effect sizes. These findings suggest that the neural topic models—particularly BERTopic—tend to yield more interpretable topics than LDA.
A greater number of strong topic label-to-theme matches and statistically significant associations were observed between dataset-aligned topic models and investigative themes (χ²(4) = 15.36, p = .004). Standardised residuals reinforced this finding, with Elliot Rodger Manifesto-trained models showing the strongest association (+2.49) with corresponding investigative themes, followed by 20Newsgroup (+1.54) and Clinton (+0.46).
Whilst neural models achieved higher interpretability scores and retained a greater proportion of meaningful (post- filtered) topic words — as measured by Retained Word(s) Percentage (RWP) — LDA-derived topic words demonstrated higher rates of lexical matches when reapplied as search terms for exact string matching against indexed data. These findings reflect the impact of textual processing rather than model architecture alone.
The topic modelling system—enriched through LLM evaluations—demonstrates a semi automated approach designed to support the sensemaking activities of digital forensic examiners working with large volumes of textual data. By surfacing latent topic words, automatically labelling topics, filtering noisy terms and categorising topics by investigative theme, SolrSleuth aims to assist examiners in planning and executing keyword searches for evidence discovery.
Key contributions include: (1) The development of SolrSleuth1, a plugin for Autopsy incorporating a custom ETL pipeline for topic modelling of the search index; (2) The integration of LLM-assisted evaluations as a proxy for human judgement; (3) Empirical comparison of three unsupervised topic modelling algorithms—LDA, BERTopic, and Top2Vec—across diverse forensic datasets; and (4) demonstration of SolrSleuth’s potential utility in assisting with simulated investigative tasks.
Access Note
Access to this thesis is embargoed until 11th February 2030
DOI
10.25958/8mmg-8p59
Recommended Citation
Kam, D. (2025). Forensic corpus analysis via topic modelling of the autopsy search index. Edith Cowan University. https://doi.org/10.25958/8mmg-8p59