簡要說明
摘要原文
2017年,作者提出了一個新的損失函數,稱為廣義損失端到端(GE2E)損失,與之前(2016年)基于元組的端到端(TE2E)丟失函數相比,這使說話人驗證模型的訓練更加有效。
與TE2E不同,GE2E損失函數以新的方式更新網絡參數,其通過關注(emphasizes)在訓練過程的各步驟(step)中都難以驗證的樣本來實現。另外,GE2E損失函數不需要樣本選擇的初始階段。 通過這些特性,我們具有新的損失函數的模型將說話人驗證的EER降低10%以上,同時將訓練時間縮短了60%。
我們還介紹了MultiReader技術,使我們能夠進行域自適應,訓練出支持多個關鍵字(Multi keywords)的更準確的模型(即,“ OK Google”和“ Hey Google”)以及多種方言。
容易混淆之處
論文涉及點有些多:
- 新提出的GE2E Loss應用在Text-independent Speaker Verification(TI-SV)、Text-dependent Speaker Verification(TD-SV)兩個領域。
- 同時Text-dependent領域又提出MultiReader(即多關鍵字 Multi keywords)
- 關于GE2E的具體實現,又提出兩種:一種是基于Softmax、一種是基于Constract(對比型,關注在訓練過程的各步驟中都難以驗證的樣本)
- 從而對比實驗有這么多。
- Text-independent領域1個:GE2E vs TE2E vs Softmax;
- Text-dependent領域4個:(GE2E, TE2E) x (MultiReader, None MultiReader).
- GE2E中的Softmax(這里所說的Softmax是用在GE2E內部的某個環(huán)節(jié))與Contrast對比1個,但這個沒結出具體數據。
論文回避之處:
- 論文摘要中首先強調:“與TE2E不同,GE2E損失函數以新的方式更新網絡參數,其通過關注在訓練過程的各步驟中都難以驗證的樣本來實現。”,這個即論文中所說的“Contrasts(對比形式)”的方式,找出負樣本中最難區(qū)分的。但論文中并未給出GE2E Loss中Softmax與Contrast兩種算法具體對比效果,只是簡單說:Softmax在TD-SV表現更好,而Contrast在TI-SV中稍微好點??赡苁荊E2E中的Constrast公式與Triplet Loss太相似了,直接回避。
背景
這里包括幾個相關的Loss算子:
- Triplet Loss
- Softmax
- TE2E
Softmax
交叉熵損失函數。直接輸出分類的類別概率。

Softmax function, a wonderful activation function that turns numbers aka logits into probabilities that sum to one.Softmax function outputs a vector that represents the probability distributions of a list of potential outcomes.
Triplet Loss
2015年FaceNet論文提出Triplet Loss.


The Triplet Loss minimizes the distance between an anchor and a positive, both of which have the same identity, and maximizes the distance between the anchor and a negative of a different identity.
Triplet Loss解決的問題:
- 類別數目固定,可以使用基于softmax的交叉熵損失函數。
- 類別數目是一個變量,可以使用triplet loss,即triplet loss跟類別數量無關。不然,如果類別數很大,比如10k量級的說話人數據集,softmax loss 部分的權重矩陣太大,難以訓練。
triplet loss的優(yōu)勢在于細節(jié)區(qū)分,triplet loss的缺點在于其收斂速度慢,有時不收斂。
Offline triplet
Generate triplets offline every n steps, using the most recent network checkpoint and computing the argmin and argmax on a subset of the data.
根據樣本之間的距離,分為:semi-hard triplets,hard triplets與easy triplets三種,選擇semi-hard triplets,hard triplets進行訓練。
此方法不夠高效,因為每過幾個epoch,要重新對negative examples進行分類。
Online triplet
Generate triplets online. This can be done by selecting the hard positive/negative exemplars from within a mini-batch.
使用triplet進行分類
FaceNet是特征向量提取器,輸出的是一個歐幾里得空間向量,隨后就可以用各種機器學習算法進行分類。
FaceNet. In this paper we present a system, called FaceNet, that directly learns a mapping from face images to a compact Euclidean space where distances directly correspond to a measure of face similarity. Once this space has been produced, tasks such as face recognition, verification and clustering can be easily implemented using standard techniques with FaceNet embeddings as feature vectors.
前作 Tuple Based End-to-End Loss
2016 End-to-end text-dependent speaker verification. ICASSP
Tuple Based End-to-End Loss:


Pros
- Simulates runtime behavior in the loss function
- Each (N+1)-tuple contains all utterances involved in a verification decision (unlike triplet)
Cons
- Most tuples are easy - training is inefficient
本文 Generalized End-to-End Loss
2017 Generalized end-to-end loss for speaker verification.
GE2E Loss

上圖,相同顏色的屬于同一類。 [圖片上傳失敗...(image-e1c855-1582508024839)] 是各類的中心。
GE2E loss pushes the embedding towards the centroid of the true speaker, and away from the centroid of the most similar different speaker. [圖片上傳失敗...(image-8a3965-1582508024839)]
Similarit Matrix
相似矩陣

Construct a similarity matrix for each batch:
- Embeddings are L2-normalized:
[圖片上傳失敗...(image-c91b8b-1582508024839)]
- Centroids:
[圖片上傳失敗...(image-f863af-1582508024839)]
- Similarity:
[圖片上傳失敗...(image-bd6fbb-1582508024839)]
Softmax vs. Contrast

Each row of [圖片上傳失敗...(image-ad80a9-1582508024839)] for [圖片上傳失敗...(image-70f086-1582508024839)] defines similarity between [圖片上傳失敗...(image-3598eb-1582508024839)] and every centroid [圖片上傳失敗...(image-552440-1582508024839)] . We want [圖片上傳失敗...(image-6fb2d8-1582508024839)] to be close to [圖片上傳失敗...(image-f5f189-1582508024839)] and far away from [圖片上傳失敗...(image-63f305-1582508024839)] for [圖片上傳失敗...(image-3ad46f-1582508024839)] .
Softmax與Contrast兩個Loss計算方式的對比,以下是作者的實驗結果。
- Softmax: Good for text-independent applications [圖片上傳失敗...(image-ed5237-1582508024839)]
- Contrast: Good for keyword-based applications. The contrast loss is defined on positive pairs and most aggressive negative pairs [圖片上傳失敗...(image-dd805e-1582508024839)]
這里有個疑問,為何在text-independent任務中Contrast(對比度損失)會比Softmax差呢?
Contrast方法其實是類似Triplet Loss,其Loss計算為:正對和最積極的負對之和。
Trick about centroid
For true speaker centroid, we should exclude the embedding itself. 即計算 [圖片上傳失敗...(image-5601ee-1582508024839)] 時要排除 [圖片上傳失敗...(image-621ff9-1582508024840)] 。

To avoid trivial solution: all utterances have same embedding. 即統(tǒng)一成一個公式。
[圖片上傳失敗...(image-698e1b-1582508024840)]
Efficiency estimate
TE2E vs. GE2E
主要思想是,TE2E是一個tuple算一次,而GE2E是基本等于一個批量的tuple,同時放到GPU計算,效率更高。
For TE2E, assume a training batch has: - N speakers - Each speaker has M utterances - P enrollment utterances per speaker
Number of all possible tuples: [圖片上傳失敗...(image-23526b-1582508024840)] Theoretically, one GE2E step is equivalent to [圖片上傳失敗...(image-4186b6-1582508024840)] TE2E steps.
TODO 對這里的具體計算不是很理解。為何能得出 [圖片上傳失敗...(image-3ee240-1582508024840)] 這個數值關系。
對比Triplet Loss
作者認為Triplet Loss的優(yōu)劣:
- Pros: Simple, and correctly models the embedding space
- Cons: Does NOT simulate runtime behavior。
這里的runtime behavior,即語音/人臉驗證(verification)場景中的使用流程:
- 注冊: 注冊語音 -> 語音向量 -> 多條語音向量取平均
- 驗證: 對驗證時錄制的語音提取特征向量 -> 與注冊語音庫中的平均向量計算相似度(如Cosine距離) -> 根據設置的閾值在判斷是否同一個人。

Text-Independent
Text-Independent Speaker Verification
We are interested in identifying speaker based on arbitrary speech Challenge:
- Length of utterance can vary
- Unlike keyword-based, where we assume fixed-length (0.8s) for keyword segment
Naive solution: Full sequence training?
- No batching - very very slow
- Dynamic RNN unrolling - even slower...
解決方法:Sliding window inference
- In inference time, we extract sliding windows, and compute per-window d-vector
- For experiments, we use 1.6s window size, with 50% overlap
- L2正則化 per-window d-vectors,然后取這些向量的平均值作為這條不定長語音的特征向量。

Training
Text-independent Training.
加速訓練,同時,充分利用數據。因每批次內部utterances長度相同,從而各utterances 計算時間相同。

- In training time, we need to group utterances by length
- Extract segments by minimal truncation
- Form batches of same-length segments
Experiment
Text-independent experiments.

Text-dependent
TODO
訓練記錄
Text-independent
流程分為兩個步驟:數據預處理與訓練.
音頻數據預處理:wav轉spectrogram等步驟很耗時,考慮使用GPU來計算,如使用PyTorch audio。
參考
訓練耗時參考:
Dataset:
-
LibriSpeech: train-other-500 (extract as
LibriSpeech/train-other-500). 500小時音頻。 -
VoxCeleb1: Dev A - D as well as the metadata file (extract as
VoxCeleb1/wavandVoxCeleb1/vox1_meta.csv). 1200 人 -
VoxCeleb2: Dev A - H (extract as
VoxCeleb2/dev). 6000人。
time: trained 1.56M steps (20 days with a single GPU) with a batch size of 64. GPU: GTX 1080 Ti.
訓練過程:

最終達到的效果:特征向量使用UMAP降維再畫圖。
