Learning Prosodic Patterns for Mandarin Speech Synthesis

作者:Yiqiang Chen, Wen Gao, Tingshao Zhu, Charles Ling

摘要

Higher quality synthesized speech is required for widespread use of text-to-speech (TTS) technology, and the prosodic pattern is the key feature that makes synthetic speech sound unnatural and monotonous, which mainly describes the variation of pitch. The rules used in most Chinese TTS systems are constructed by experts, with weak quality control and low precision. In this paper, we propose a combination of clustering and machine learning techniques to extract prosodic patterns from actual large mandarin speech databases to improve the naturalness and intelligibility of synthesized speech. Typical prosody models are found by clustering analysis. Some machine learning techniques, including Rough Set, Artificial Neural Network (ANN) and Decision tree, are trained for fundamental frequency and energy contours, which can be directly used in a pitch-synchronous-overlap-add-based (PSOLA-based) TTS system. The experimental results showed that synthesized prosodic features greatly resembled their original counterparts for most syllables.

论文关键词:TTS, clustering, Rough Set, ANN, Decision tree, Beyesian network

论文评审过程:

论文官网地址:https://doi.org/10.1023/A:1015568521453