Audio samples from "Rapid Style Adaptation Using Residual Error Embedding for Expressive Speech Synthesis"

Authors: Xixin Wu, Yuewen Cao, Mu Wang, Songxiang Liu, Shiyin Kang, Zhiyong Wu, Xunying Liu, Dan su, Dong Yu, Helen Meng
Abstract: Synthesizing expressive speech with appropriate prosodic variations, e.g., various styles, still has much room for improvement. Previous methods have explored to use manual annotations as conditioning attributes to provide variation information. However, the related training data are expensive to obtain and the annotated style codes can be ambiguous and unreliable. In this paper, we explore utilizing the residual error as conditioning attributes. The residual error is the difference between the prediction of trained average model and the ground truth. We encode the residual error into style embedding via a neural networkbased error encoder. The style embedding is then fed to the target synthesis model to provide information for modeling various style distributions more accurately. The average model and the error encoder is jointly optimized with the target synthesis model. Our proposed method has two advantages: 1) the embedding is automatically learned with no need of manual style annotations, which helps overcome data sparsity and ambiguity; 2) For any unseen audio utterance, the style embedding can be efficiently generated. This enables rapid adaptation to the desired style to be achieved with only one adaption utterance. Experimental results show that our proposed method outperforms the baseline model in both speech quality and style similarity.

System Comparison

"Tacotron": Baseline Tacotron system trained with style-mixed data, without style embeddings.
"EEN-0": The error encoding network (EEN) system conditioned on style embedding with zero values. This system is based on the EEN trained with training data. The acoustic outputs are generated with the target model in the EEN, fed with linguistic input from testing sample and conditioned on style embedding with zero values.
"EEN-adpt": EEN conditioned on style embedding obtained from the adaptation utterance sample. This system is the same as the EEN-0, except that the style embedding is calculated based on the adaptation sample, with the average model and the error encoder.
"Adaptation Utterance": The adaptation utterance based on which the embedding is obtained. The EEN-adpt is adapted to the style of the adaptation utterance

"A few hours later, three weddings had taken place."
Tacotron
EEN-0
EEN-adpt Adaptation Utterance
EEN-adpt Adaptation Utterance

"It's a strange thing, but my love for Hermia has melted like snow."
Tacotron
EEN-0
EEN-adpt Adaptation Utterance
EEN-adpt Adaptation Utterance

Embedding Analysis

We found that the 23-rd dimension of the style embedding is highly negatively correlated to the mean F0 value of training samples. Hence we manipulate the value in that dimension while keep the other value unchanged and generate speech based on the manipulated embedding vector. The samples can be found as follows:

"emb-0": embedding vector extract from a reference utterance.
"emb+0.2": add +0.2 to the 23-rd dimension of the emb-0.
"emb-0.2": add -0.2 to the 23-rd dimension of the emb-0.
"emb-0.4": add -0.4 to the 23-rd dimension of the emb-0.
"emb-0.6": add -0.6 to the 23-rd dimension of the emb-0.

"A few hours later, three weddings had taken place."
emb+0.2
emb-0
emb-0.2
emb-0.4
emb-0.6