X-VECTOR:ROBUST DNN EMBEDDING FOR SPEAKER RECOGNITION
2019/04/02
X-VECTOR:robust dnn embedding for speaker recognition
- abstract
DNN + data augmentation -> better performance of speaker recognition
-
- DNN
-
-
- i. variable-length utterances –(mapping)-> fixed-dimensional embeddings(x-vectors)
- ii. + embeddings leverage large-scale training datasets better than i-vectors
- iii. - challenging to collect substantial quantities of labeled data for training.
-
Solution: data augmentation (noise and reverberation)
- i. An inexpensive method to multiply the amount of training data and improve robustness
-
-
-
Result
- i. Before augmentation: The x-vectors are compared with i-vector baselines on Speakers in the Wild and NIST SRE 2016 Cantonese.
-
ii. After augmentation:
- + PLDA beneficial
- – i-vector extractor
- + x-vector DNN: due to supervised training
- x-vector achieves superior performance on the evaluation datasets.
-
Result
-
Introduction
-
I-vectors
- i. Universal background model (high-dimensional statistics)–-(large projection matrix T (learned in an unsupervised way to maximize the data likelihood)) --> i-vector (a low-dimensional representation) + probabilistic linear discriminant analysis (PLDA) classifier
-
DNN used to enhance phonetic modeling in the i-vector UBM
- i. DNN: acoustic models for ASR
- ii. Posteriors from DNN replace those from GMM
- iii. Bottleneck features are extracted from the DNN and combined with acoustic features
-
iv. For in-domain data, the improvement of DNN is substantial
- - Transcribing data
- – Computational complexity
-
DNN directly discriminate between speakers
- i. Frame-level representations extracted from DNN + Gaussian speaker models
- ii. End-to-end system: jointly embedding + similarity matric
- iii. Ii->text-independent + a temporal pooling layer into the network to handle variable-length segments
- iv. DNN embedding + separate classifier: accumulated backend technology developed over the years for i-vectors, such as length-normalization, PLDA scoring, and domain adaptation techniques.
-
I-vectors
-
SPEAKER RECOGNITION SYSTEMS
-
Acoustic-vector: GMM-UBM
- i. Features are 20 MFCCs with a frame-length of 25ms that are mean-normalized over a sliding window of up to 3 seconds.
- ii. Delta and acceleration are appended to create 60 dimension feature vectors
- iii. An energy-based speech activity detection (SAD) system selects features corresponding to speech frames.
- iv. The UBM is a 2048 component full-covariance GMM.
- v. The system uses a 600 dimensional i-vector extractor and PLDA for scoring.
-
Phonetic bottleneck i-vector
- i. DNN: time-delay acoustic model with p-norm nonlinearities
- ii. Penultimate layer is replaced with a 60 dimensional linear bottleneck layer
- iii. BNFs are concatenated with the same 20 dimensional MFCCs described in Section 2.1 plus deltas to create 100 dimensional features
- iv. Feature processing, UBM, i-vector extractor, and PLDA classifier are identical to the previous acoustic system
-
The x-vector system
- i. The embedding DNN architecture: x-vectors are extracted at layer segment6, before the nonlinearity. The N in the softmax layer corresponds to the number of training speakers
-
Acoustic-vector: GMM-UBM
-
-
- ii. 24 dimensional filterbanks with a frame-length of25ms, mean-normalized over a sliding window of up to 3 seconds + The same energy SAD
- iii. This pooling layer aggregates information across the time dimension so that subsequent layers operate on the entire segment.
- iv. A training example consists of a chunk of speech features (about 3 seconds average), and the corresponding speaker label
- v. 4.2 million Parameters.
-
PLDA classifier
- i. Centered, and projected using LDA
- ii. LDA dimension was tuned on the SITW development set to 200 for i-vectors and 150 for x-vectors.
- iii. Length-normalized and modeled by PLDA
- iv. Scores are normalized using adaptive s-norm
-
-
EXPERIMENTAL SETUP
- Training data
dataset |
contents |
size |
Number of participants |
sample rate |
note |
transcriptions |
|
sre |
04 |
unknown |
|
8k |
Telephone speech |
|
|
05 |
525h |
|
|
English/Chinese |
|
||
06 |
437h |
|
|
|
|
||
08 |
942h |
|
|
Multilingual |
|
||
10 |
2255h |
|
|
|
|
||
Swbd |
Phase 2 |
Part1 |
303h (3638*5min) |
657 |
|
Conversation |
|
Part2 |
373h (4472*5min) |
679 |
|
Telephone conversation |
|
||
Part3 |
222h (2728*5min) |
640 |
|
Telephone speech, gender, environment |
Provide transcription information |
||
Cellular |
Part1 |
109h (1309*5min) |
254 |
|
gender, environment |
Provide transcription information |
|
Part2 |
200 |
|
|
|
|
||
sre16_eval_enroll |
Call My Net speech collection |
|
|
|
Channel mismatch, language mismatch |
|
|
sre16_eval_test |
|
|
|
From both same and different telephone numbers as the enrollment |
|
||
sre16_major |
SRE16 unlabeled data |
|
|
|
|
|
- results