Sensors and Materials

Young Researcher Paper Award 2025
🥇Winners

Notice of retraction
Vol. 32, No. 8(2), S&M2292

Print: ISSN 0914-4935
Online: ISSN 2435-0869
Sensors and Materials
is an international peer-reviewed open access journal to provide a forum for researchers working in multidisciplinary fields of sensing technology.

Tweets by Journal_SandM Sensors and Materials
is covered by Science Citation Index Expanded (Clarivate Analytics), Scopus (Elsevier), and other databases.

Instructions to authors
English 日本語

Instructions for manuscript preparation
English 日本語

Template
English

Publisher
MYU K.K.
Sensors and Materials
1-23-3-303 Sendagi,
Bunkyo-ku, Tokyo 113-0022, Japan
Tel: 81-3-3827-8549
Fax: 81-3-3827-8547

MYU Research, a scientific publisher, seeks a native English-speaking proofreader with a scientific background. B.Sc. or higher degree is desirable. In-office position; work hours negotiable. Call 03-3827-8549 for further information.

MYU Research
(proofreading and recording)

MYU K.K.
(translation service)

The Art of Writing Scientific Papers
(How to write scientific papers)
(Japanese Only)

Sensors and Materials, Volume 38, Number 3(3) (2026)
Copyright(C) MYU K.K.

pp. 1447-1461
S&M4387 Research paper
https://doi.org/10.18494/SAM5946
Published: March 23, 2026

Design of a Mandarin Spoken Dialogue System Using Tacotron2-based Speech Synthesis with Dialogist-aware System-speaking-style Switching [PDF]

Ing-Jr Ding, Po-Jung Chen, Xin-Bau Li, and Yih-Her Yan

(Received September 24, 2025; Accepted February 20, 2026)

Keywords: spoken dialogue system, Tacotron2 speech synthesis, model fine tuning, synthetic speech evaluation, YOLO dialogist identification

As the global aging trend intensifies, the demand for long-term care systems will continue to rise, necessitating solutions to the problems of a shortage of manpower and excessive burdens on traditional human care. Among all care systems using AI techniques, the chatting system that can create a tight interaction between the aged and the system has inevitably become a necessary AI tool. However, for the aged, including those in Taiwan society, text-typing-based AI chatting systems with the interaction model of text-in–text-out are highly complicated and difficult to use. To tackle this issue, we will develop a Mandarin spoken dialogue system where chatting interactions will be in a simple and straight speech-to-speech mode. In addition, to provide emotion-connected voice interactions with psychological comfort and social companionship, the designed dialogue system will specifically contain the functionality of dialogist-aware system-speaking-style switching; in accordance with the system dialogist identity, the responding synthetic speech of the system will be in the style of a target speaker that is matched to the dialogist. The developed Mandarin spoken dialogue system in this study typically includes three computing modules, automatic speech recognition (ASR), semantics understanding of a large language model (LLM), and text-to-speech (TTS) speech synthesis. For the first two modules, the open source Google ASR and Google Gemma LLM are effectively employed and suitably integrated into the dialogue system. For TTS, to additionally perform system-speaking-style switching, the well-known Tacotron2 speech synthesis approach is adopted in this work. The Tacotron2 approach presented by Google is famous for its effectiveness in the deep learning of the speech database available. In this study, an initial Tacotron2 TTS model is first established using the Mandarin speech database ‘Biaobei,’ following which, a model fine-tuning procedure that uses small amounts of speech data from the specific target speaker to adjust the initial model parameters is designed. Aimed at the dialogist recognition of the dialogue system, You Only Look Once (YOLO)-based face detection is performed to classify the dialogist identity. With the recognized dialogist, the fine-tuned adaptation Tacotron2 model matched to this dialogist will then be used to perform speech synthesis. To evaluate the naturalness of the synthetic speech, various signal analysis evaluation metrics, including Mel-cepstral distortion (MCD), linear prediction code distortion (LPCD), and peak signal-to-noise ratio (PSNR), are also carried out in this work to investigate the effectiveness and compare the accuracy by the human-decision mean opinion score (MOS) approach.

Corresponding author: Ing-Jr Ding

This work is licensed under a Creative Commons Attribution 4.0 International License.

Cite this article
Ing-Jr Ding, Po-Jung Chen, Xin-Bau Li, and Yih-Her Yan, Design of a Mandarin Spoken Dialogue System Using Tacotron2-based Speech Synthesis with Dialogist-aware System-speaking-style Switching, Sens. Mater., Vol. 38, No. 3, 2026, p. 1447-1461.