StarGAN-VC: non-parallel many-to-many voice conversion with StaGAN

会议:2018 IEEE Spoken Language Technology Workshop (SLT)
单位:NTT
作者:Hirokazu Kameoka, Takuhiro Kaneko

abstract

  本文提出StarGAN,优点:(1)no need of parallel data, transcription, time-alignment;(2)一个生成器学习many-to-many mapping;(3)可以实时完成转换;(4)只需要several minutes 的训练语句。在主管评测上超过现有的GAN变种。

introduction

不需要平行数据的方法:
(1)ASR-based,好的识别网络,+i-vector完成说话人身份的标记。
缺点:依赖一个好的ASR网络;
(2)VAE-VC:VAEs是AE的条件概率对应的形式,CVAE(conditional variational auto-encoder)是VAE有一个额外的输入。文本信息输入+额外的属性标签cc,完成source到target的转化。
缺点:decoder的输出oversmooth,会导致语音质量不高。
(3)GAN:
作者所在的组也是提出cycle-GAN-VC的组,用adversarial loss+cycle consistent loss+training loss,做one-to-one 的映射转化 。如果想用cycle-GAN-VC做many-to-many域的转换,就需要训练多对生成器和判别器对于不同的说话人映射对,但是实际上这些域是有重叠的,因为他们都代表的是speech,因此不同的attribute domain之间是有信息可以共享的。如果attribute domain的数量增加,对应的模型参数量平方倍数增长,因此也很难再用很少的数据进行训练。
缺点(还有没明白的点,那些必须一致,那些可以不用):和CVAE一样,测试时候的atrribute(说话人)必须是见过的(source也必须固定吗??)。对于CVAE,attribute cc必须是见过的; 对于cycle-GAN-VC,source对于训练和测试必须是一致的。

StarGAN也是图像上首先提出来的,仅需要一对encoder-decoder就可以完成many-to-many的转化,生成器依赖一个额外的属性cc控制生成。测试时对输入语音的属性没有限制。
(4)VAE-GAN-VC的结构提出了克服VAE的缺点【23】。但是语音质量和转换效果不如本文。
(5)VQ-VAE【27】(vector quantized VAE),通过使用WaveNet model克服VAE的缺点,(根据生成器生成的样本,WaveNet学习样本的分布, 纳尼??),还有一个faster版本【43】。但是总体来说,这个方法计算消耗大,实时困难,需要的训练样本也多。

cycle-GAN-VC

xRQ×Nx\in R^{Q\times N}yRQ×My\in R^{Q\times M}
其中,QQ是特征维度,N,MN,M是对应的语音长度,GG对应xyx\rightarrow y的转换,FF对应yxy\rightarrow x的转换

StarGAN-VC: non-parallel many-to-many voice conversion with StaGAN
StarGAN-VC: non-parallel many-to-many voice conversion with StaGAN
StarGAN-VC: non-parallel many-to-many voice conversion with StaGAN
StarGAN-VC: non-parallel many-to-many voice conversion with StaGAN

3. StarGAN-VC

对one-hot vector 的解释:each of which is filled with 1 at the index of a class in a cer- tain category and with 0 everywhere else.

training objection

Adversarial LossStarGAN-VC: non-parallel many-to-many voice conversion with StaGAN
Domain Classification LossStarGAN-VC: non-parallel many-to-many voice conversion with StaGAN
Cycle Consistency Loss

StarGAN-VC: non-parallel many-to-many voice conversion with StaGAN
StarGAN-VC: non-parallel many-to-many voice conversion with StaGANStarGAN-VC: non-parallel many-to-many voice conversion with StaGAN
λcls,λcyc,λid\lambda_{cls}, \lambda_{cyc}, \lambda_{id}都是正的正则超参数。

model architecture

StarGAN-VC: non-parallel many-to-many voice conversion with StaGAN
StarGAN-VC: non-parallel many-to-many voice conversion with StaGAN

4. subjective evaluation

数据:VCC2018, 选择2男2女(SM1/SM2, SF1/SF2),每人116句(81句train-5min, 35句evaluatio-2min),因此source-target可以有12种不同的组合, cc是4维的one-hot向量。

特征提取:a spectral envelope, a logarithmic fundamental frequency (log F0), and aperiodici- ties (APs) were extracted every 5 ms using the WORLD an- alyzer [46]. 36 mel-cepstral coefficients (MCCs) were then extracted from each spectral envelope. The F0 contours were converted using the logarithm Gaussian normalized transfor- mation described in [51]. The aperiodicities were used di- rectly without modification。

baseline model:VAEGAN-VC 【23】
测试:AB test (各20句),比较语音质量,ABX test(各24句)比较相似度。可以借鉴,不需要比很多,一个baseline model就可以
还给到一个MCC order的谱图进行比较(但是其实没太看出来这种图非平行数据有啥能比的??)确认一下,gan是时间帧对齐的吗??