ICASSP2020 : Aligntts: Efficient Feed-Forward Text-to-Speech System Without Explicit Alignment

一句话概括：

基于端到端模型合成慢，非自回归模型虽然快，但是对齐学的不好，因此，基于 Baum-Welch
算法，提出了 alignTTTS, 实现了快速合成，提出了 align-loss，提高了准确度和自然度。

Q&A

贡献

并行生成MEL谱。（因为使用了前馈transformer模型）
- Due to the feed-forward network structure, AlignTTS can generate the mel-spectrum in parallel.
- Combining with WaveGlow Vocoder, the speech synthesis speed is more than 50 times faster than real-time;
提出了alignment loss.
- 使用duration predictor预测时长，而不是使用attention，因此可以并行计算
- The alignment loss is proposed to guide AlignTTS to learn the alignment between the text and mel-spectrum.
- Specifically, the learned alignment is more precise in aligning text and mel-spectrum than the attention alignment from Transformer TTS [9], so that the more accurate conversion from text to mel-spectrum is learned in AlignTTS

ICASSP2020 : Aligntts: Efficient Feed-Forward Text-to-Speech System Without Explicit Alignment

分为：
- char embedding ;
- 多个FFT Block;
- lenghth regular;
- duration predictor
- linear layer(用于生成MEL谱).
- 其中FFT Block被lenght regular分成了两部分。
FFT Block ：
- 借鉴了Attention is all you need 中transformer结构，
- 但是添加了一个 1D CONV
lenght regular :
- 输入是一个时长序列，该序列由持续时间预测器（duration predictor）在推理中生成。
- 用于根据给定的持续时间序列来调节文本与mel-谱的对齐，调整方法与fast speech相同。
duration predictor ：
- 结构与前馈结构相似，仍然是char embedding ; 多个FFT Block; linear layer(输出标量，这里数值的个数与字符个数相同，表示每一个数值的持续时间。)
mix density network ：
- 由于FFT 与 duration predictor 在训练阶段需要正确的对齐信息，因此需要这个网络学习对齐信息。
- 网络训练使用alignment loss 完成。
- 多个线性层的堆叠，每个层都有layer normalization, 最后一层是Relu + dropout.
- 位于char embedding一侧FFT Block的顶部
- 输入是 duration sequence
- 测试阶段不使用。

首先训练 mix density network, 使用第一个FFT Block 和alignment loss
吧对齐矩阵转换成duration seqeunce
- Fixing the parameters of the first FFT blocks,
- the rest network of Feed-forward Transformer is trained using the mean square error (MSE) loss between the predicted and target mel-spectrum.
FFT 与 mix density network一起训练，微调参数。
最后用训练好的MDN( final mix density network) 计算字符时长，用MSE训练duration predictor