Few-shots Voice Cloning in Noisy Acoustic Conditions with Domain Adversarial Training

Authors: Jian Cong, Shan Yang, Lei Xie
Abstract: Data efficient voice cloning can synthesis speech in voice of target speaker with few seconds clean reference audio. Nevertheless, in actual scenarios, users usually make recording in public, which often accompanied by background noise. We propose to use domain adversarial training algorithm for robust voice cloning with few samples. Results indicate, whether the reference audio is clean or noise, our proposed method can achieve close performance in terms of naturalness and similarity. And achieve better performance than pre-denoising to the reference audio with external speech enhancement module.

1. Examples clean audio, noisy audio and de-noised audio:

speaker clean audio(C) noisy audio(N) denoised audio(D)
001
045
012
093

2. The results of speaker adaptaion:

Test set Original audio Synthesized speech
Baseline model Proposed model
001-C
001-N None
001-D *
045-C
045-N None
045-D *
012-C
012-N None
012-D *
093-C
093-N None
093-D *

2. The results of speaker encoding:

Test set Original audio Synthesized speech
Baseline model Proposed model
001-C
001-N None
001-D *
045-C
045-N None
045-D *
012-C
012-N None
012-D *
093-C
093-N None
093-D *