詳解 DNN 在聲學應用中的模型訓練

本文作者： AI研習社

2017-07-01 15:58

導語：分析 DNN 訓練聲學模型時神經網絡的輸入與輸出。

雷鋒網按：本文作者張慶恒，原載于作者個人博客，雷鋒網經授權發布。

本文通過簡單kaldi源碼，分析DNN訓練聲學模型時神經網絡的輸入與輸出。在進行DNN訓練之前需要用到之前GMM-HMM訓練的模型，以訓練好的mono模型為例，對模型進行維特比alignement（對齊），該部分主要完成了每個語音文件的幀到 transition-id 的映射。

不妨查看對齊后的結果：

$ copy-int-vector "ark:gunzip -c ali.1.gz|" ark,t:- | head -n 1
speaker001_00003 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 16 15 15 15 18 890 889 889 889 889 889 889 892 894 893 893 893 86 88 87 90 89 89 89 89 89 89 89 89 89 89 89 89 89 89 194 193 196 195 195 198 197 386 385 385 385 385 385 385 385 385 388 387 387 390 902 901 901 904 903 906 905 905 905 905 905 905 905 905 905 905 905 914 913 913 916 918 917 917 917 917 917 917 752 751 751 751 751 751 754 753 753 753 753 753 753 753 753 756 755 755 926 925 928 927 927 927 927 927 927 927 930 929 929 929 929 929 929 929 929 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 16 18

對于一個訓練語音文件speaker001_00003，后面的每一個數字標示一個transition-id，同時每個數字對應一個特征向量，對應的向量可以 copy-matrix 查看，可參考特征提取相關內容，鏈接如下：

http://t.cn/RX2n4Dx

同樣查看 transition-id：

$ show-transitions phones.txt final.mdl

Transition-state 1: phone = sil hmm-state = 0 pdf = 0
Transition-id = 1 p = 0.966816 [self-loop]
Transition-id = 2 p = 0.01 [0 -> 1]
Transition-id = 3 p = 0.01 [0 -> 2]
Transition-id = 4 p = 0.013189 [0 -> 3]
Transition-state 2: phone = sil hmm-state = 1 pdf = 1
Transition-id = 5 p = 0.970016 [self-loop]
Transition-id = 6 p = 0.01 [1 -> 2]
Transition-id = 7 p = 0.01 [1 -> 3]
Transition-id = 8 p = 0.01 [1 -> 4]
Transition-state 3: phone = sil hmm-state = 2 pdf = 2
Transition-id = 9 p = 0.01 [2 -> 1]
Transition-id = 10 p = 0.968144 [self-loop]
Transition-id = 11 p = 0.01 [2 -> 3]
Transition-id = 12 p = 0.0118632 [2 -> 4]
Transition-state 4: phone = sil hmm-state = 3 pdf = 3
Transition-id = 13 p = 0.01 [3 -> 1]
Transition-id = 14 p = 0.01 [3 -> 2]
Transition-id = 15 p = 0.932347 [self-loop]
Transition-id = 16 p = 0.0476583 [3 -> 4]
Transition-state 5: phone = sil hmm-state = 4 pdf = 4
Transition-id = 17 p = 0.923332 [self-loop]
Transition-id = 18 p = 0.0766682 [4 -> 5]
Transition-state 6: phone = a1 hmm-state = 0 pdf = 5
Transition-id = 19 p = 0.889764 [self-loop]
Transition-id = 20 p = 0.110236 [0 -> 1]
...

唯一的Transition-state對應唯一的pdf，其下又包括多個 Transition-id，

接下來看神經網絡的輸入與輸出到底是什么。這里以steps/nnet為例。追溯腳本到steps/nnet/train.sh，找到相關的命令：

...
labels_tr="ark:ali-to-pdf $alidir/final.mdl \"ark:gunzip -c $alidir/ali.*.gz |\" ark:- | ali-to-post ark:- ark:- |"

...
feats_tr="ark:copy-feats scp:$dir/train.scp ark:- |"
...
# input-dim,
get_dim_from=$feature_transform
num_fea=$(feat-to-dim "$feats_tr nnet-forward \"$get_dim_from\" ark:- ark:- |" -)
# output-dim,
num_tgt=$(hmm-info --print-args=false $alidir/final.mdl | grep pdfs | awk '{ print $NF }')
...

dnn)
utils/nnet/make_nnet_proto.py $proto_opts \
${bn_dim:+ --bottleneck-dim=$bn_dim} \
$num_fea $num_tgt $hid_layers $hid_dim >$nnet_proto
;;

從上面關鍵的幾個神經網絡的訓練的準備階段可以看出，神經網絡的輸入很清楚是變換后的特征向量（feats_tr），輸出是labels_tr，下面單獨運行上面的命令，來查看神經網絡的輸出（target）是什么。labels_tr的生成分兩步：

ali-to-pdf：將上面對齊文件中的transition-id轉化為對應的pdf-id；
ali-to-post: 根據得到的pdf-id，生成[pdf, post]對，即pdf與其對應的后驗概率。

$ ali-to-pdf final.mdl "ark:gunzip -c ali.1.gz|" ark,t:- | head -n 1
speaker001_00003 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 3 3 3 4 440 440 440 440 440 440 440 441 442 442 442 442 38 39 39 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 92 92 93 93 93 94 94 188 188 188 188 188 188 188 188 188 189 189 189 190 446 446 446 447 447 448 448 448 448 448 448 448 448 448 448 448 448 452 452 452 453 454 454 454 454 454 454 454 371 371 371 371 371 371 372 372 372 372 372 372 372 372 372 373 373 373 458 458 459 459 459 459 459 459 459 459 460 460 460 460 460 460 460 460 460 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 4

觀察前兩幀，結合文章一開始，transition-id 分別為4和1，而對應的pdf均為0。對該結果再進行ali-to-post：

$ ali-to-pdf final.mdl "ark:gunzip -c ali.1.gz|" ark,t:- | head -n 1 | ali-to-post ark,t:- ark,t:-
speaker001_00003 [ 0 1 ] [ 0 1 ] [ 0 1 ] [ 0 1 ] [ 0 1 ] [ 0 1 ] [ 0 1 ] [ 0 1 ] [ 0 1 ] [ 0 1 ] [ 0 1 ] [ 0 1 ] [ 0 1 ] [ 0 1 ] [ 0 1 ] ...... [ 3 1 ] [ 3 1 ] [ 3 1 ] [ 3 1 ] [ 4 1 ] [ 440 1 ] [ 440 1 ] [ 440 1 ] [ 440 1 ] [ 440 1 ] [ 440 1 ] [ 440 1 ] [ 441 1 ] [ 442 1 ] [ 442 1 ] [ 442 1 ] [ 442 1 ] [ 38 1 ] [ 39 1 ] [ 39 1 ] [ 40 1 ] [ 40 1 ] [ 40 1 ] [ 40 1 ] [ 40 1 ] [ 40 1 ] [ 40 1 ] [ 40 1 ] [ 40 1 ] [ 40 1 ] [ 40 1 ] [ 40 1 ] [ 40 1 ] [ 40 1 ] [ 40 1 ] [ 92 1 ] [ 92 1 ]...... [ 0 1 ] [ 0 1 ] [ 0 1 ] [ 0 1 ] [ 3 1 ] [ 4 1 ]

得到pdf-id以及相應的后驗概率，這里均為1。

由此得到了訓練數據以及對應的target label。進一步來看神經網絡的輸入與輸出的維度，網絡結構被utils/nnet/make_nnet_proto.py寫到nnet_proto文件中，該Python腳本的兩個重要參數 num_fea和num_tgt分別為神經網絡的輸入與輸出的維度。其中num_fea是由feat-to-dim獲得：

$ feat-to-dim scp:../tri4b_dnn/train.scp ark,t:- | grep speaker001_00003
speaker001_00003 40

這里為fbank特征，維度為40，而在真正作為神經網絡輸入時，進一步對特征向量進行的變換，從源碼steps/nnet/train.sh也可以看到splice參數（默認值為5），指定了對特征向量的變換：取對應幀前后5幀，拼成一個11幀組成的大向量（維度為440）。該部分特征變換的拓撲也被保存到final.feature_transform:

$ more final.feature_transform
<Nnet>
<Splice> 440 40
[ -5 -4 -3 -2 -1 0 1 2 3 4 5 ]
<!EndOfComponent>
...

后面在進行神經網絡的訓練時會使用該拓撲對特征向量進行變換，最終的神經網絡輸入維度為440。

而num_tgt的維度則是通過hmm-info獲得：

$ hmm-info final.mdl
number of phones 218
number of pdfs 1026
number of transition-ids 2834
number of transition-states 1413

$ hmm-info final.mdl | grep pdfs | awk '{ print $NF }'
1026

因此，看到神經網絡的輸出維度為1026，這時查看nnet_proto：

<AffineTransform> <InputDim> 440 <OutputDim> 1024 <BiasMean> -2.000000 <BiasRange> 4.000000 <ParamStddev> 0.037344 <MaxNorm> 0.000000
<Sigmoid> <InputDim> 1024 <OutputDim> 1024
<AffineTransform> <InputDim> 1024 <OutputDim> 1024 <BiasMean> -2.000000 <BiasRange> 4.000000 <ParamStddev> 0.109375 <MaxNorm> 0.000000
<Sigmoid> <InputDim> 1024 <OutputDim> 1024
<AffineTransform> <InputDim> 1024 <OutputDim> 1024 <BiasMean> -2.000000 <BiasRange> 4.000000 <ParamStddev> 0.109375 <MaxNorm> 0.000000
<Sigmoid> <InputDim> 1024 <OutputDim> 1024
<AffineTransform> <InputDim> 1024 <OutputDim> 1024 <BiasMean> -2.000000 <BiasRange> 4.000000 <ParamStddev> 0.109375 <MaxNorm> 0.000000
<Sigmoid> <InputDim> 1024 <OutputDim> 1024
<AffineTransform> <InputDim> 1024 <OutputDim> 1026 <BiasMean> 0.000000 <BiasRange> 0.000000 <ParamStddev> 0.109322 <LearnRateCoef> 1.000000 <BiasLearnRateCoef> 0.100000
<Softmax> <InputDim> 1026 <OutputDim> 1026

這里可以看到神經網絡的輸入維度有40變為440，輸出為pdf的個數（對應HMM狀態的個數）。

如果繼續追查代碼，最后可以找到單次神經網絡的訓練實現，kaldi/src/nnetbin/nnet-train-frmshuff.cc：

Perform one iteration (epoch) of Neural Network training with mini-batch Stochastic Gradient Descent. The training targets are usually pdf-posteriors, prepared by ali-to-post.

繼續分析代碼，可以看到幾個關鍵步驟：

解析訓練參數，配置網絡
讀取特征向量和target label，輸入為Matrix< BaseFloat >類型，輸出為Posterior類型，即<pdf-id, posterior>對。

// get feature / target pair,
Matrix<BaseFloat> mat = feature_reader.Value();
Posterior targets = targets_reader.Value(utt);

隨機打亂訓練數據，作為神經網絡輸入與期望輸出：

const CuMatrixBase<BaseFloat>& nnet_in = feature_randomizer.Value();
const Posterior& nnet_tgt = targets_randomizer.Value();
const Vector<BaseFloat>& frm_weights = weights_randomizer.Value();

前向傳播，計算估計值nnet_out

// forward pass,
nnet.Propagate(nnet_in, &nnet_out);

計算cost，這里支持交叉熵和平方差和multitask。結果為obj_diff

// evaluate objective function we've chosen,
if (objective_function == "xent") {
// gradients re-scaled by weights in Eval,
xent.Eval(frm_weights, nnet_out, nnet_tgt, &obj_diff);
} else if (objective_function == "mse") {
// gradients re-scaled by weights in Eval,
mse.Eval(frm_weights, nnet_out, nnet_tgt, &obj_diff);
}
...