Cross-Lingual Neural Network Speech Synthesis Based on Multiple Embeddings.

Tijana V. Nosek; Siniša B. Suzić; Darko J. Pekar; Radovan J. Obradović; Milan S. Sečujski; Vlado D. Delić

doi:10.9781/ijimai.2021.11.005

Authors

Tijana V. Nosek University of Novi Sad
Siniša B. Suzić University of Novi Sad
Darko J. Pekar University of Novi Sad
Radovan J. Obradović University of Novi Sad
Milan S. Sečujski University of Novi Sad
Vlado D. Delić University of Novi Sad

DOI:

https://doi.org/10.9781/ijimai.2021.11.005

Keywords:

Cross-lingual, Artificial Neural Networks, Speech Synthesis, Vocoder

Supporting Agencies

This research was supported by Speech Morphing Systems Inc., Campbell, CA, United States of America, as well as the Science Fund of the Republic of Serbia (grant #6524560, AI – S ADAPT). Speech corpora used in the research were provided by Speech Morphing Systems Inc. for research purposes.

Abstract

The paper presents a novel architecture and method for speech synthesis in multiple languages, in voices of multiple speakers and in multiple speaking styles, even in cases when speech from a particular speaker in the target language was not present in the training data. The method is based on the application of neural network embedding to combinations of speaker and style IDs, but also to phones in particular phonetic contexts, without any prior linguistic knowledge on their phonetic properties. This enables the network not only to efficiently capture similarities and differences between speakers and speaking styles, but to establish appropriate relationships between phones belonging to different languages, and ultimately to produce synthetic speech in the voice of a certain speaker in a language that he/she has never spoken. The validity of the proposed approach has been confirmed through experiments with models trained on speech corpora of American English and Mexican Spanish. It has also been shown that the proposed approach supports the use of neural vocoders, i.e. that they are able to produce synthesized speech of good quality even in languages that they were not trained on.

Downloads

Download data is not yet available.

References

C. Traber, K. Huber, K. Nedir, B. Pfister, E. Keller, and B. Zellner, “From multilingual to polyglot speech synthesis,” in Proceedings of the 6th European Conference on Speech Communication and Technology EUROSPEECH 1999, Budapest, Hungary, 1999, pp. 835–838.

M. Chu, H. Peng, Y. Zhao, Z. Niu, and E. Chang, “Microsoft Mulan – a bilingual TTS system,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing ICASSP 2003, Hong Kong, China, 2003, vol. I, pp. 264–267.

N. Campbell, “Foreign-language speech synthesis,” in Proceedings of the 3rd ESCA/COCOSDA Workshop (ETRW) on Speech Synthesis, Jenolan Caves, Australia, 1998, pp. 177–180.

M. Moberg, K. Pärssinen, and J. Iso Sipilä, “Cross-lingual phoneme mapping for multilingual synthesis systems,” in Proceedings of the 8th International Conference on Spoken Language Processing ICSLP 2004, Jeju Island, Korea, 2004, pp. 1029–1032.

L. Badino, C. Barolo, and S. Quazza, “Language independent phoneme mapping for foreign TTS,” in Proceedings of the 5th ISCA Workshop on Speech Synthesis, Pittsburgh, PA, USA, 2004, pp. 217–218.

Y. Qian, J. Xu, and F. K. Soong, “A frame mapping based HMM approach to cross-lingual voice transformation,” In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP 2011, Prague, Czech Republic, pp. 5120-5123.

J. He, Y. Qian, and F. K. Soong, “Turning a monolingual speaker into multilingual for a mixed-language TTS,” in Proceedings of the 13th Annual Conference of the International Speech Communication Association INTERSPEECH 2012, Portland, OR, USA, 2012, pp. 963–966.

Y. Qian, H. Liang and F. K. Soong, “A cross-language state sharing and mapping approach to bilingual (Mandarin–English) TTS,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 17, no. 6, pp. 1231–1239, 2009, doi: 10.1109/TASL.2009.2015708.

Y. J. Wu, Y. Nankaku, and K. Tokuda, “State mapping based method for cross-lingual speaker adaptation in HMM-based speech synthesis,” in Proceedings of the 10th Annual Conference of the International Speech Communication Association INTERSPEECH 2009, Brighton, United Kingdom, 2009, pp. 528–531.

H. Zen et al., “Statistical parametric speech synthesis based on speaker and language factorization,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 6, pp. 1713–1724, 2012, doi: 10.1109/TASL.2012.2187195.

L. Sun, H. Wang, S. Kang, K. Li, and H. Meng, “Personalized, cross-lingual TTS using phonetic posteriorgrams,” in Proceedings of the 17th Annual Conference of the International Speech Communication Association INTERSPEECH 2016, San Francisco, CA, USA, 2016, pp. 322–326.

F. L. Xie, F. K. Soong, and H. Li, “A KL divergence and DNN approach to cross-lingual TTS,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing ICASSP 2016, Shanghai, China, 2016, pp. 5515–5519.

Y. Fan, Y. Qian, F. K. Soong, and L. He, “Speaker and language factorization in DNN-based TTS synthesis,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing ICASSP 2016, Shanghai, China, 2016, pp. 5540–5544.

Y. Fan, Y. Qian, F. K. Soong, and L. He, “Unsupervised speaker adaptation for DNN-based TTS synthesis,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing ICASSP 2016, Shanghai, China, 2016, pp. 5135–5139.

J. Sotelo et al., “Char2wav: End-to-end speech synthesis,” in Proceedings of the 5th International Conference on Learning Representations ICLR 2017, Toulon, France, 2017, pp. 1–6.

Y. Wang et al., “Tacotron: A fully end-to-end text-to-speech synthesis model,” arXiv preprint arXiv:1703.10135, 2017. Accessed: July 15, 2020. [Online]. Available: https://arxiv.org/abs/1703.10135

S.Ö. Arık et al., “Deep voice: Real-time neural text-to-speech,” in Proceedings of the 34th International Conference on Machine Learning PMLR, Sydney, Australia, 2017, vol. 70, pp. 195–204.

Y. Zhang et al., “Learning to speak fluently in a foreign language: Multilingual speech synthesis and cross-language voice cloning,” arXiv preprint arXiv:1907.04448, 2019. Accessed: July 15, 2020. [Online]. Available: https://arxiv.org/abs/1907.04448.

M. Chen et al., “Cross-lingual, multi-speaker text-to-speech synthesis using neural speaker embedding”, in Proceedings of the 20th Annual Conference of the International Speech Communication Association INTERSPEECH 2019, Graz, Austria, 2019, pp. 2105–2109.

Z. Liu and B. Mak, “Cross-lingual multi-speaker text-to-speech synthesis for voice cloning without using parallel corpus for unseen speakers,” arXiv preprint arXiv:1911.11601, 2019. Accessed: July 15, 2020. [Online]. Available: https://arxiv.org/abs/1911.11601

S. Maiti, E. Marchi, and A. Conkie, “Generating multilingual voices using speaker space translation based on bilingual speaker data,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing ICASSP 2020, Shanghai, China, 2020, 7624–7628.

M. Sečujski, D. Pekar, S. Suzić, A. Smirnov, and T. Nosek, “Speaker/style-dependent neural network speech synthesis based on speaker/style embedding”, Journal of Universal Computer Science, vol. 26, no. 4, pp. 434–453, 2020.

N. Hojo, Y. Ijima, and H. Mizuno, “An investigation of DNN-based speech synthesis using speaker codes,” in Proceedings of the 17th Annual Conference of the International Speech Communication Association INTERSPEECH 2016, San Francisco, CA, USA, 2016, pp. 2278–2282.

Y. Fan, Y. Qian, F. K. Soong, and L. He, “Multi-speaker modeling and speaker adaptation for DNN-based TTS synthesis,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP 2015, South Brisbane, Australia, pp. 4475-4479.

T. Delić, S. Suzić, M. Sečujski, and D. Pekar, “Rapid development of new TTS voices by neural network adaptation,” in Proceedings of the 17th International Symposium INFOTEH-JAHORINA, Jahorina, Bosnia and Herzegovina, 2018, pp. 1–6.

M. Sečujski, S. Suzić, S. Ostrogonac, and D. Pekar, “Learning prosodic stress from data in neural network based text-to-speech synthesis,” SPIIRAS Proceedings, vol. 4, no. 59, pp. 192–215, 2018, doi: 10.15622/sp.59.8

Z. Wu, O. Watts, and S. King, “Merlin: an open source neural network speech synthesis system”, in Proceedings of the 9th ISCA Speech Synthesis Workshop, Sunnyvale, CA, USA, 2016, pp. 218–223.

M. Abadi et al., “Tensorflow: A system for large-scale machine learning,” in Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation OSDI 2016, Savannah, GA, USA, 2016, pp. 265-283.

M. Morise, F. Yokomori, and K. Ozawa, “WORLD: a vocoder-based high-quality speech synthesis system for real-time applications,” IEICE Transactions on Information and Systems, vol. 99, no. 7, pp.1877–1884, 2016, doi: 10.1587/transinf.2015EDP7457.

N. Kalchbrenner et al., “Efficient neural audio synthesis,” arXiv preprint arXiv:1802.08435, 2018. Accessed: July 18, 2020. [Online]. Available: https://arxiv.org/abs/1802.08435

S. King, “An Introduction to Statistical Parametric Speech Synthesis,” Sadhana, vol. 36, no. 5, pp. 837–852, 2011.