Within-sample variability-invariant loss for robust speaker recognition under noisy environments

作者: Danwei Cai, Ming Li
备注:Accepted at ICASSP 2020

Despite the significant improvements in speaker recognition enabled by deep neural networks, unsatisfactory performance persists under noisy environments. In this paper, we train the speaker embedding network to learn the “clean” embedding of the noisy utterance. Specifically, the network is trained with the original speaker identification loss with an auxiliary within-sample variability-invariant loss. This auxiliary variability-invariant loss is used to learn the same embedding among the clean utterance and its noisy copies and prevents the network from encoding the undesired noises or variabilities into the speaker representation. Furthermore, we investigate the data preparation strategy for generating clean and noisy utterance pairs on-the-fly. The strategy generates different noisy copies for the same clean utterance at each training step, helping the speaker embedding network generalize better under noisy environments. Experiments on VoxCeleb1 indicate that the proposed training framework improves the performance of the speaker verification system in both clean and noisy conditions.
尽管深度神经网络在说话人识别方面有了显著的改进,但在噪声环境下,说话人识别的性能仍然不尽如人意。本文通过训练说话人嵌入网络来学习噪声话语的“干净”嵌入。具体而言,网络训练与原始说话人识别损失与辅助样本内变异性不变的损失。该辅助变异不变loss用于学习干净的话语及其噪声副本之间的相同嵌入,并防止网络将不希望的噪声或变异编码成说话者表示。此外,我们还研究了在on the fly中生成干净和有噪声的话语对的数据准备策略。该策略在每个训练步,为同一个干净的语音生成不同的噪声副本,有助于说话人嵌入网络在噪声环境下更好地泛化。在VoxCeleb1上的实验表明,该训练框架提高了说话人确认系统在干净和噪声环境下的性能。

Index Terms— neural network, speaker recognition, speaker embedding, robustness, noisy conditions


Automatic speaker verification (ASV) refers to automatically mak- ing the decision to accept or reject a claimed speaker by analyzing the given speech from that speaker. In the past few years, the perfor- mance of ASV systems has been improved significantly with the suc- cessful application of deep neural network (DNN) to speaker embed- ding modeling [1, 2]. However, unsatisfactory performance persists under noisy environments, which commonly noticed in smartphones or smart speakers with ASV applications. The additive noises on a clean speech contaminate the low energy regions of the spectrogram and blur the acoustic details [3]. These noises result in the loss of speech intelligibility and quality, imposing great challenges on speaker recognition systems.
To compensate for these adverse impacts, various approaches have been proposed at different stages of the ASV systems. At the signal level, DNN based speech or feature enhancement [4, 5, 6, 7] has been investigated for ASV under complex environment. At the feature level, feature normalization techniques [8] and noise-robust features such as power-normalized cepstral coefficients (PNCC) [9] have also been applied to ASV systems. At the model level, ro- bust back-end modeling methods such as multi-condition training of probabilistic linear discriminant analysis (PLDA) models [10] and mixture of PLDA [11] were employed in the i-vector [12] frame- work. Also, score normalization [13] could be used to improve the robustness of the ASV system under noisy scenarios.

More recently, researchers are working on training deep speaker networks to cope with the distortions caused by noise. Within this framework, there are two main methods. The first one regards the noisy data as a different domain from the clean data and ap- plies adversarial training to deal with domain mismatch and get a noise-invariant speaker embedding [14, 15]. The second method employs a DNN speech enhancement network for ASV tasks. Shon et al. [16] train the speech enhancement network with feedbacks from the speaker network to find the time-frequency bins that are beneficial to ASV tasks with noisy speech. Zhao et al. [17] uses the intermediate result of the speech enhancement network as an auxil- iary input for the speaker embedding network and jointly optimize these two networks.
最近,研究人员正致力于训练深度说话人网络,以应对噪声造成的失真。在这个框架中,主要有两种方法。第一种方法将噪声数据看作与干净数据不同的域,利用对抗性训练处理域失配问题,得到噪声不变的说话人嵌入算法[14,15]。第二种方法采用DNN语音增强网络进行ASV任务。肖恩等人[16] 利用说话人网络的反馈对语音增强网络进行训练,找出有利于,含噪语音的ASV任务,的时频bins 。Zhao等人〔17〕使用语音增强网络的中间结果作为说话人嵌入网络的辅助输入,并对这两个网络进行联合优化。
In this work, our network learns enhancement directly at the embedding level for speaker recognition under noisy environments. We train the deep speaker embedding network by incorporating the original speaker identification loss with an auxiliary within-sample loss. The speaker identification loss learns the speaker represen- tation using the speaker label, while the within-sample loss aims to learn the embedding of noisy utterance as similar as possible to its clean version. In this way, the deep speaker embedding net- work is trained to prevent from encoding the additive noises into the speaker representation and learn the “clean” embedding for the noisy speech utterance. The loss that helps the speaker network to learn variability-invariant embedding is called within-sample variability-invariant loss.
Furthermore, to fully explore the modeling ability of the within- sample variability-invariant loss, we dynamically generate the clean and noisy utterance pairs when preparing data for the training pro- cess. Different noisy copies for the same clean utterance are gener- ated at different training steps, helping the speaker embedding net- work generalize better under noisy environments.


In this section, we describe the deep speaker embedding framework, which consists of a frame-level local pattern extractor, an utterance- level encoding layer, and several fully-connected layers for speaker embedding extraction and speaker classification.
Given a variable-length input feature sequence, the local pat- tern extractor, which is typically a convolutional neural network (CNN) [2] or a time-delayed neural network (TDNN) [1], learns the frame-level representations. An encoding layer is then applied to the top of it to get the utterance level representation. The most common encoding method is the average pooling layer, which aggregates the statistics (i.e., mean, or mean and standard deviation) [1, 2]. Self-attentive pooling layer [18], learnable dictionary encoding layer [19], and dictionary-based NetVLAD layer [20, 21] are other commonly used encoding layers. Once the utterance-level represen- tation is extracted, a fully connected layer and a speaker classifier are employed to further abstract the speaker representation and clas- sify the training speakers. After training, deep speaker embedding is extracted after the penultimate layer of the network for the given variable-length utterance.
In this work, the local pattern extractor is a residual convolu- tional neural network (ResNet) [22], and the encoding layer is a global statistics pooling (GSP) layer. For the frame-level represen- tation F ∈ RC×H×W , the output of GSP is a utterance-level repre- sentation V = [μ1,μ2,··· ,μC,σ1,σ2,··· ,σC], where μc and σc are the mean and standard deviation of the cth feature map:
and C, H, W denote the number of channels, height and width of the feature map respectively.


In this section, we describe the proposed framework with within- sample variability-invariant loss and online noisy data generation.

3.1. Within sample variability-invariantloss

A clean speech and its noisy copies contain the same acoustic contents for recognizing speakers. Ideally, the speaker embeddings of the noisy utterance should be the same as its clean version. But in reality, the deep speaker embedding network usually encodes the noises as parts of the speaker representation for the noisy speech.
The within-sample variability-invariant loss works with the original speaker identification loss together to train the speaker embedding network. The speaker identification loss is typically a cross-entropy. In our implementation, the hyper-parameters of the network are updated twice at each training step. The first update from the speaker identification loss is followed by the second update from the within-sample variability-invariant loss. Figure 1 shows the flowchart of our proposed framework.
3.2. Online data augmentation

In this work, we implement an online data augmentation strategy. Different parameters of noise types, noise clips and signal-to-noise ratio (SNR) are randomly selected to generate the clean-noisy utter- ance pair when training. Different permutations of these random pa- rameters generate different noisy segments for the same utterance at different training steps, so the network never “sees” the same noisy segment from the same clean speech.
During training, the SNR is a continuous random variable uni- formly distributed between 0 and 20dB, and there are four types of noise: music, ambient noise, television, and babble. The television noise is generated with one music file and one speech file. The babble noise is constructed by mixing three to six speech files into one, which results in overlapping voices simultaneously with the fore- ground speech.
在训练过程中,信噪比是一个连续的随机变量,均匀分布在0~20dB之间,有四种噪声:音乐噪声、环境噪声、电视噪声和含混不清的嘈杂的人语。电视噪声由一个音乐文件和一个语音文件产生。babble noise是通过将三到六个语音文件混合为一个而构建的,这导致了与前景语音同时重叠的声音。


The experiments are conducted on Voxceleb 1 dataset [23]. The training data contain 148642 utterances from 1211 speakers. In the test data, 4874 utterances from 40 speakers construct 37720 test trials. Although the Voxceleb dataset collected from online video is not strictly in clean condition, we assume the original data as a clean dataset and generate noisy data from the original data.
实验是在Voxceleb 1数据集上进行的[23]。训练数据包含1211个说话人的148642个话语。在测试数据中,来自40个说话人的4874个话语构成了37720个测试trials。虽然,从在线视频中采集的Voxceleb数据集并不严格处于干净状态,但我们假设原始数据为干净数据集,并从原始数据中产生噪声数据。

The MUSAN dataset [24] is used as the noise source. We split the MUSAN into two non-overlapping subsets for training and test-ing noisy data generation respectively.

4.2. Experimental setup

Speech signals are firstly converted to 64-dimensional log Mel- filterbank energies and then fed into the speaker embedding net- work. The detailed network architecture is in table 2. The front-end local pattern extractor is based on the well known ResNet-34 archi- tecture [22]. ReLU activation and batch normalization are applied to each convolutional layer.
首先将语音信号转换成64维对数Mel滤波器组能量,然后输入到说话人嵌入网络中。详细的网络架构见表2。前端本地模式提取器基于众所周知的ResNet-34体系结构[22]。对每个卷积层应用ReLU**和batch normalization。
For the speaker identification loss, a standard softmax-based cross-entropy loss or angular softmax loss (A-softmax) [25] is used. When training with softmax loss, dropout is added to the penultimate fully-connected layer to prevent overfitting.
对于说话人识别损失,使用基于标准softmax的交叉熵损失或angular softmax loss (A-softmax)[25]。当使用softmax loss进行训练时,将dropout添加到倒数第二个全连接层以防止过拟合。
Three training data settings are investigated: (1) original Vox- celeb 1 dataset (clean); (2) original training dataset and offline gen- erated noisy data, i.e., the noisy data are generated in advance (of- fline AUG); (3) original training data with online data augmentation (online AUG).
我们研究了三种训练数据设置:(1)原始Vox-celeb 1数据集(clean);(2)原始训练数据集和离线生成的噪声数据,即噪声数据提前生成(of-fline-AUG);(3)在线数据增强的原始训练数据(online-AUG)。
At the testing stage, cosine similarity is used for scoring. We use equal error rate (EER) and detection cost function (DCF) as the per- formance metric. The reported DCF is the average of two minimum DCFs when Ptarget is 0.01 and 0.001.

4.3. Experimental results

Eight deep speaker embedding networks are trained based on three training conditions and different loss functions. Table 1 shows the DCF and EER of three noise types (babble, ambient noise and music) at five SNR settings (0, 5, 10, 15, 20dB). Also, all of the 15 noisy testing trials are combined to form the “all noises” trial.
Several observations from the results are discussed in the fol- lowing. 1) The experimental results confirm that data augmentation strategy can greatly improve the performance of the deep speaker embedding system under noisy conditions. 2) Comparing with the offline data augmentation strategy, the performance improvement achieved by online data augmentation is more obvious in the low SNR conditions. 3) Training the deep speaker embedding system with within-sample variability-invariant loss can improve the sys- tem performance in the clean and all noisy conditions. 4) Com- paring with the network trained with offline data augmentation, the proposed framework using within-sample variability-invariant loss with online data augmentation achieves 13.0% and 6.5% reduction in terms of EER and DCF respectively. 5) When the speaker embed- ding network is trained discriminatively using the A-softmax loss with angular margin, the proposed within-class loss can still improve the system performance by setting constraints on the distance among the clean utterance and its noisy copies.
文中还讨论了几个观测结果。1) 实验结果表明,在噪声环境下,数据增强策略可以显著提高深度说话人嵌入系统的性能。2) 与离线数据增强策略相比,在线数据增强在低信噪比条件下的性能改善更为明显。3) 训练样本内变异不变损失的深度说话人嵌入系统,可以提高系统在干净和全噪声条件下的性能。4) 与离线数据增强训练的网络相比,在线数据增强的样本内变异不变损失框架在EER和DCF方面分别降低了13.0%和6.5%。5) 当利用带angular margin的A-softmax损失对说话人嵌入网络进行判别训练时,提出的类内损失仍然可以通过设置干净话语与其噪声副本之间的距离约束来提高系统性能。
The detection error tradeoff (DET) curves in figure 2 provide comparisons among four selected systems, two of which are trained with our proposed framework. The DET curve uses testing trials from all the noisy conditions.
图2中的detection error tradeoff (DET)曲线提供了四个选定系统之间的比较,其中两个系统使用我们提出的框架进行了训练。DET曲线使用所有噪声条件下的试验。
We also visualized the speaker embeddings by using the t- distributed stochastic neighbor embedding (t-SNE) algorithm [26]. The two-dimensional results of the speaker embeddings are shown in figure 4. Four speakers, each with six clean utterances, are se-lected from the training dataset for visualization. Also, each clean utterance has three 5dB noisy copies of music, babble and ambient noises. Comparing with the clean training condition, data augmen- tation helps the clean and noisy embeddings from the same utterance cluster together. Further, after training the deep speaker embedding network with within-noise variability-invariant loss, the clean and noisy embeddings of the same utterance are closer to each other.
The loss values of each training epoch are shown in figure 3 for the network with speaker softmax and within-sample MSE losses. The referenced MSE loss between embeddings from the clean and noisy data of the converged network trained with only softmax loss is also given. We can observe that the MSE loss is maintained at a low level during training, which helps the network to extract noisy embedding similar to its clean version.
图3所示为每个训练epoch在样本MSE损失范围内的具有说话人softmax的网络的损失值。文中还给出了仅用softmax loss训练的融合网络的干净数据和噪声数据的嵌入间的参考MSE损失。我们可以观察到,在训练过程中,MSE损失保持在一个较低的水平,这有助于网络提取与干净版本相似的噪声嵌入。
This paper has proposed the within-sample variability-invariant loss for deep speaker embedding networks under noisy conditions. By setting constraints on the embeddings extracted from the clean ut- terance and its noisy copies, the proposed loss works with the orig- inal speaker identification loss to learn robust embedding for noisy speeches. We also employ the data preparation strategy of generat- ing the clean and noisy utterance pairs on-the-fly to help the speaker embedding network generalize better under noisy environments. The proposed framework is flexible and can be extended to other similar applications when multiple views of the same training speech sample are available.


This research is funded in part by the National Natural Science Foun- dation of China (61773413) and Duke Kunshan University.

