X-VECTOR:ROBUST DNN EMBEDDING FOR SPEAKER RECOGNITION

2019/04/02

X-VECTOR:robust dnn embedding for speaker recognition

  1. abstract

DNN + data augmentation -> better performance of speaker recognition

    1. DNN
      • i. variable-length utterances –(mapping)-> fixed-dimensional embeddings(x-vectors) 
      • ii. + embeddings leverage large-scale training datasets better than i-vectors
      • iii. - challenging to collect substantial quantities of labeled data for training.
    • Solution: data augmentation (noise and reverberation)
      • i. An inexpensive method to multiply the amount of training data and improve robustness

 

    1. Result
      • i. Before augmentation: The x-vectors are compared with i-vector baselines on Speakers in the Wild and NIST SRE 2016 Cantonese.
      • ii. After augmentation:
        1. + PLDA beneficial
        2. – i-vector extractor
        3. + x-vector DNN: due to supervised training
        4. x-vector achieves  superior performance on the evaluation datasets.
  1. Introduction
    1. I-vectors
      • i. Universal background model (high-dimensional statistics)–-(large projection matrix T (learned in an unsupervised way to maximize the data likelihood)) --> i-vector (a low-dimensional representation) + probabilistic linear discriminant analysis (PLDA) classifier
    2. DNN used to enhance phonetic modeling in the i-vector UBM
      • i. DNN: acoustic models for ASR
      • ii. Posteriors from DNN replace those from GMM
      • iii. Bottleneck features are extracted from the DNN and combined with acoustic features
      • iv. For in-domain data, the improvement of DNN is substantial
        1. - Transcribing data
        2. – Computational complexity
    3. DNN directly discriminate between speakers
      • i. Frame-level representations extracted from DNN + Gaussian speaker models
      • ii. End-to-end system: jointly embedding + similarity matric
      • iii. Ii->text-independent + a temporal pooling layer into the network to handle variable-length segments
      • iv. DNN embedding + separate classifier: accumulated backend technology developed over the years for i-vectors, such as length-normalization, PLDA scoring, and domain adaptation techniques.

 

  1. SPEAKER RECOGNITION SYSTEMS
    1. Acoustic-vector: GMM-UBM
      • i. Features are 20 MFCCs with a frame-length of 25ms that are mean-normalized over a sliding window of up to 3 seconds.
      • ii. Delta and acceleration are appended to create 60 dimension feature vectors
      • iii. An energy-based speech activity detection (SAD) system selects features corresponding to speech frames.
      • iv. The UBM is a 2048 component full-covariance GMM.
      • v. The system uses a 600 dimensional i-vector extractor and PLDA for scoring.
    2. Phonetic bottleneck i-vector
      • i. DNN: time-delay acoustic model with p-norm nonlinearities
      • ii. Penultimate layer is replaced with a 60 dimensional linear bottleneck layer
      • iii. BNFs are concatenated with the same 20 dimensional MFCCs described in Section 2.1 plus deltas to create 100 dimensional features
      • iv. Feature processing, UBM, i-vector extractor, and PLDA classifier are identical to the previous acoustic system
    3. The x-vector system
      • i. The embedding DNN architecture: x-vectors are extracted at layer segment6, before the nonlinearity. The N in the softmax layer corresponds to the number of training speakers

X-VECTOR:ROBUST DNN EMBEDDING FOR SPEAKER RECOGNITION

      • ii. 24 dimensional filterbanks with a frame-length of25ms, mean-normalized over a sliding window of up to 3 seconds + The same energy SAD
      • iii. This pooling layer aggregates information across the time dimension so that subsequent layers operate on the entire segment.
      • iv. A training example consists of a chunk of speech features (about 3 seconds average), and the corresponding speaker label
      • v. 4.2 million Parameters.
    • PLDA classifier
      • i. Centered, and projected using LDA
      • ii. LDA dimension was tuned on the SITW development set to 200 for i-vectors and 150 for x-vectors.
      • iii. Length-normalized and modeled by PLDA
      • iv. Scores are normalized using adaptive s-norm
  1. EXPERIMENTAL SETUP
    1. Training data

dataset

contents

size

Number of participants

sample rate

note

transcriptions

sre

04

unknown

 

8k

Telephone speech

 

05

525h

 

 

English/Chinese

 

06

437h

 

 

 

 

08

942h

 

 

Multilingual

 

10

2255h

 

 

 

 

Swbd

Phase 2

Part1

303h

(3638*5min)

657

 

Conversation

 

Part2

373h

(4472*5min)

679

 

Telephone conversation

 

Part3

222h

(2728*5min)

640

 

Telephone speech, gender, environment

Provide transcription information

Cellular

Part1

109h

(1309*5min)

254

 

gender, environment

Provide transcription information

Part2

200

 

 

 

 

sre16_eval_enroll

Call My Net speech collection

 

 

 

Channel mismatch, language mismatch

 

sre16_eval_test

 

 

 

From both same and different telephone numbers as the enrollment

 

sre16_major

SRE16 unlabeled data

 

 

 

 

 

 

  1. results

X-VECTOR:ROBUST DNN EMBEDDING FOR SPEAKER RECOGNITION