|
pp. 1447-1461
S&M4387 Research paper https://doi.org/10.18494/SAM5946 Published: March 23, 2026 Design of a Mandarin Spoken Dialogue System Using Tacotron2-based Speech Synthesis with Dialogist-aware System-speaking-style Switching [PDF] Ing-Jr Ding, Po-Jung Chen, Xin-Bau Li, and Yih-Her Yan (Received September 24, 2025; Accepted February 20, 2026) Keywords: spoken dialogue system, Tacotron2 speech synthesis, model fine tuning, synthetic speech evaluation, YOLO dialogist identification
As the global aging trend intensifies, the demand for long-term care systems will continue to rise, necessitating solutions to the problems of a shortage of manpower and excessive burdens on traditional human care. Among all care systems using AI techniques, the chatting system that can create a tight interaction between the aged and the system has inevitably become a necessary AI tool. However, for the aged, including those in Taiwan society, text-typing-based AI chatting systems with the interaction model of text-in–text-out are highly complicated and difficult to use. To tackle this issue, we will develop a Mandarin spoken dialogue system where chatting interactions will be in a simple and straight speech-to-speech mode. In addition, to provide emotion-connected voice interactions with psychological comfort and social companionship, the designed dialogue system will specifically contain the functionality of dialogist-aware system-speaking-style switching; in accordance with the system dialogist identity, the responding synthetic speech of the system will be in the style of a target speaker that is matched to the dialogist. The developed Mandarin spoken dialogue system in this study typically includes three computing modules, automatic speech recognition (ASR), semantics understanding of a large language model (LLM), and text-to-speech (TTS) speech synthesis. For the first two modules, the open source Google ASR and Google Gemma LLM are effectively employed and suitably integrated into the dialogue system. For TTS, to additionally perform system-speaking-style switching, the well-known Tacotron2 speech synthesis approach is adopted in this work. The Tacotron2 approach presented by Google is famous for its effectiveness in the deep learning of the speech database available. In this study, an initial Tacotron2 TTS model is first established using the Mandarin speech database ‘Biaobei,’ following which, a model fine-tuning procedure that uses small amounts of speech data from the specific target speaker to adjust the initial model parameters is designed. Aimed at the dialogist recognition of the dialogue system, You Only Look Once (YOLO)-based face detection is performed to classify the dialogist identity. With the recognized dialogist, the fine-tuned adaptation Tacotron2 model matched to this dialogist will then be used to perform speech synthesis. To evaluate the naturalness of the synthetic speech, various signal analysis evaluation metrics, including Mel-cepstral distortion (MCD), linear prediction code distortion (LPCD), and peak signal-to-noise ratio (PSNR), are also carried out in this work to investigate the effectiveness and compare the accuracy by the human-decision mean opinion score (MOS) approach.
Corresponding author: Ing-Jr Ding![]() ![]() This work is licensed under a Creative Commons Attribution 4.0 International License. Cite this article Ing-Jr Ding, Po-Jung Chen, Xin-Bau Li, and Yih-Her Yan, Design of a Mandarin Spoken Dialogue System Using Tacotron2-based Speech Synthesis with Dialogist-aware System-speaking-style Switching, Sens. Mater., Vol. 38, No. 3, 2026, p. 1447-1461. |