X-VECTOR:ROBUST DNN EMBEDDING FOR SPEAKER RECOGNITION

2019/04/02

X-VECTOR:robust dnn embedding for speaker recognition

abstract

DNN + data augmentation -> better performance of speaker recognition

1. DNN

- - i. variable-length utterances –(mapping)-> fixed-dimensional embeddings(x-vectors)
  - ii. + embeddings leverage large-scale training datasets better than i-vectors
  - iii. - challenging to collect substantial quantities of labeled data for training.
- Solution: data augmentation (noise and reverberation)
  - i. An inexpensive method to multiply the amount of training data and improve robustness

1. Result
  - i. Before augmentation: The x-vectors are compared with i-vector baselines on Speakers in the Wild and NIST SRE 2016 Cantonese.
  - ii. After augmentation:
    1. + PLDA beneficial
    2. – i-vector extractor
    3. + x-vector DNN: due to supervised training
    4. x-vector achieves superior performance on the evaluation datasets.
Introduction
1. I-vectors
  - i. Universal background model (high-dimensional statistics)–-(large projection matrix T (learned in an unsupervised way to maximize the data likelihood)) --> i-vector (a low-dimensional representation) + probabilistic linear discriminant analysis (PLDA) classifier
2. DNN used to enhance phonetic modeling in the i-vector UBM
  - i. DNN: acoustic models for ASR
  - ii. Posteriors from DNN replace those from GMM
  - iii. Bottleneck features are extracted from the DNN and combined with acoustic features
  - iv. For in-domain data, the improvement of DNN is substantial
    1. - Transcribing data
    2. – Computational complexity
3. DNN directly discriminate between speakers
  - i. Frame-level representations extracted from DNN + Gaussian speaker models
  - ii. End-to-end system: jointly embedding + similarity matric
  - iii. Ii->text-independent + a temporal pooling layer into the network to handle variable-length segments
  - iv. DNN embedding + separate classifier: accumulated backend technology developed over the years for i-vectors, such as length-normalization, PLDA scoring, and domain adaptation techniques.

SPEAKER RECOGNITION SYSTEMS
1. Acoustic-vector: GMM-UBM
  - i. Features are 20 MFCCs with a frame-length of 25ms that are mean-normalized over a sliding window of up to 3 seconds.
  - ii. Delta and acceleration are appended to create 60 dimension feature vectors
  - iii. An energy-based speech activity detection (SAD) system selects features corresponding to speech frames.
  - iv. The UBM is a 2048 component full-covariance GMM.
  - v. The system uses a 600 dimensional i-vector extractor and PLDA for scoring.
2. Phonetic bottleneck i-vector
  - i. DNN: time-delay acoustic model with p-norm nonlinearities
  - ii. Penultimate layer is replaced with a 60 dimensional linear bottleneck layer
  - iii. BNFs are concatenated with the same 20 dimensional MFCCs described in Section 2.1 plus deltas to create 100 dimensional features
  - iv. Feature processing, UBM, i-vector extractor, and PLDA classiﬁer are identical to the previous acoustic system
3. The x-vector system
  - i. The embedding DNN architecture: x-vectors are extracted at layer segment6, before the nonlinearity. The N in the softmax layer corresponds to the number of training speakers

- - ii. 24 dimensional ﬁlterbanks with a frame-length of25ms, mean-normalized over a sliding window of up to 3 seconds + The same energy SAD
  - iii. This pooling layer aggregates information across the time dimension so that subsequent layers operate on the entire segment.
  - iv. A training example consists of a chunk of speech features (about 3 seconds average), and the corresponding speaker label
  - v. 4.2 million Parameters.
- PLDA classiﬁer
  - i. Centered, and projected using LDA
  - ii. LDA dimension was tuned on the SITW development set to 200 for i-vectors and 150 for x-vectors.
  - iii. Length-normalized and modeled by PLDA
  - iv. Scores are normalized using adaptive s-norm
EXPERIMENTAL SETUP
1. Training data

dataset	contents		size	Number of participants	sample rate	note	transcriptions
sre	04		unknown		8k	Telephone speech
	05		525h			English/Chinese
	06		437h
	08		942h			Multilingual
	10		2255h
Swbd	Phase 2	Part1	303h (3638*5min)	657		Conversation
		Part2	373h (4472*5min)	679		Telephone conversation
		Part3	222h (2728*5min)	640		Telephone speech, gender, environment	Provide transcription information
	Cellular	Part1	109h (1309*5min)	254		gender, environment	Provide transcription information
	Cellular	Part2	200
sre16_eval_enroll	Call My Net speech collection					Channel mismatch, language mismatch
sre16_eval_test	Call My Net speech collection					From both same and different telephone numbers as the enrollment
sre16_major	SRE16 unlabeled data

results

X-VECTOR:ROBUST DNN EMBEDDING FOR SPEAKER RECOGNITION

X-VECTOR:ROBUST DNN EMBEDDING FOR SPEAKER RECOGNITION

相关推荐