Author Identifier
Joseph Kane: https://orcid.org/0000-0001-9728-4529
Date of Award
2025
Document Type
Thesis - ECU Access Only
Publisher
Edith Cowan University
Degree Name
Doctor of Philosophy (Integrated)
School
School of Science
First Supervisor
Mike Johnstone
Second Supervisor
Patryk Szewczyk
Abstract
Text-to-speech conversion has been extensively researched and developed since the advent of integrated circuits in computers in 1958. Over sixty years later, most computer-generated voices remained easily identifiable as robotic. The aim of this study was to enhance the realism of computer-generated text-to-speech systems. Increased realism improves artificial voices for individuals reliant on assistive technologies. This research demonstrated that the variable modulated timings of syllables was the most effective way of making robotic sounding voice, become more naturally human. The variable timings reflected the human need to draw breath, with faster speech and longer breaks between words for longer sentences. The research identified classification engines designed for prosody and emotional capture, and examined studies capturing paralinguistic elements capable of conveying more meaning than the literal interpretation of spoken words. Emotive text-to-speech engines were also analysed to leverage prior knowledge of techniques required for modifiable pitch, timbre, and tempo, thereby creating a richer audio experience within the text-to-speech algorithm. Through laboratory experiments, this research created a modular platform for digital speech enhancement. The study filled gaps in academic knowledge, contributing to the development of a flexible and scalable approach to text-to-speech enhancement. Applications of this algorithm include improving high-definition audio codecs for telephony, restoring old recordings, and enhancing human-computer interfaces. Such advancements have the potential to lower barriers to computing and improve accessibility for a wide range of users.
DOI
10.25958/gx2w-rk61
Access Note
Access to this thesis is embargoed until 7th August 2026
Recommended Citation
Kane, J. (2025). Artificial Intelligence audio upscaling by the addition of prosody. Edith Cowan University. https://doi.org/10.25958/gx2w-rk61