從原理到實戰英偉達教你用PyTorch搭建RNN（下）

本文作者：三川

2017-05-06 20:36

導語：動態計算圖加持，PyTorch 相比 TensorFlow 是否有優勢？

雷鋒網按：本文為《從原理到實戰英偉達教你用PyTorch搭建RNN》的下篇，閱讀上篇請點擊這里。文章原載于英偉達博客，雷鋒網編譯。

代碼實操

在開始創建神經網絡之前，我需要設置一個 data loader。對于深度學習而言，在數據樣例的 batch 上運行模型十分常見，這能通過并行計算加速訓練，并在每一步有更加平滑的梯度。現在我們就開始，下文會解釋上篇描述的如何對 stack-manipulation 進行 batch。 PyTorch text library 內置的系統，能把相近長度的樣例組合起來自動生成 batch，以下 Python 代碼便向該系統加載了一些數據。運行這些代碼之后，, train_iter、dev_iter、test_iter 中的迭代器，會在 SNLI 訓練、驗證、測試階段在 batch 上循環。

from torchtext import data, datasets
TEXT = datasets.snli.ParsedTextField(lower=True)
TRANSITIONS = datasets.snli.ShiftReduceField()
LABELS = data.Field(sequential=False)
train, dev, test = datasets.SNLI.splits(
TEXT, TRANSITIONS, LABELS, wv_type='glove.42B')
TEXT.build_vocab(train, dev, test)
train_iter, dev_iter, test_iter = data.BucketIterator.splits(
(train, dev, test), batch_size=64)

你可以在 train.py 找到其余代碼，包括訓練循環（loop）的和衡量精度的。現在講模型。如同上篇所描述，一個 SPINN 編碼器包含一個參數化的 Reduce 層，以及可選的 recurrent Tracker，以追蹤語境。這通過在神經網絡每讀取一個詞語、或應用 Reduce 的時候，更新隱藏狀態來實現。下面的代碼其實表示了，創建一個 SPINN 只是意味著創建這兩個子模塊而已，以及把它們放到容器里面以日后使用。

import torch
from torch import nn
# subclass the Module class from PyTorch’s neural network package
class SPINN(nn.Module):
   def __init__(self, config):
       super(SPINN, self).__init__()
       self.config = config
       self.reduce = Reduce(config.d_hidden, config.d_tracker)
       if config.d_tracker is not None:
           self.tracker = Tracker(config.d_hidden, config.d_tracker)

創建模型時，SPINN.__init__被調用一次。它分配、初始化參數，但不進行任何神經網絡運算，也不涉及創建計算圖。每組新數據 batch 上運行的代碼，在 SPINN 中定義。PyTorch 里，用戶定義模型前饋通道的方法名為 “forward”。事實上，它是對上文提到的 stack-manipulation 算法的實現，在普通 Python 里，它運行于 Buffer 和堆棧的 batch 上——對每個樣例使用兩者之一。在轉換過程包含的“shift” 和 “reduce” op 上迭代，如果它存在，就運行 Tracker，并運行于 batch 中的每個樣例以應用 “shift”op，或加入需要 “reduce” op 的樣例列表。然后在列表所有的樣例上運行 Reduce 層，把結果 push 回相關堆棧。

def forward(self, buffers, transitions):
       # The input comes in as a single tensor of word embeddings;
       # I need it to be a list of stacks, one for each example in
       # the batch, that we can pop from independently. The words in
       # each example have already been reversed, so that they can
       # be read from left to right by popping from the end of each
       # list; they have also been prefixed with a null value.
       buffers = [list(torch.split(b.squeeze(1), 1, 0))
                  for b in torch.split(buffers, 1, 1)]
       # we also need two null values at the bottom of each stack,
       # so we can copy from the nulls in the input; these nulls
       # are all needed so that the tracker can run even if the
       # buffer or stack is empty
       stacks = [[buf[0], buf[0]] for buf in buffers]
       if hasattr(self, 'tracker'):
           self.tracker.reset_state()
       for trans_batch in transitions:
           if hasattr(self, 'tracker'):
               # I described the Tracker earlier as taking 4
               # arguments (context_t, b, s1, s2), but here I
               # provide the stack contents as a single argument
               # while storing the context inside the Tracker
               # object itself.
               tracker_states, _ = self.tracker(buffers, stacks)
           else:
               tracker_states = itertools.repeat(None)
           lefts, rights, trackings = [], [], []
           batch = zip(trans_batch, buffers, stacks, tracker_states)
           for transition, buf, stack, tracking in batch:
               if transition == SHIFT:
                   stack.append(buf.pop())
               elif transition == REDUCE:
                   rights.append(stack.pop())
                   lefts.append(stack.pop())
                   trackings.append(tracking)
           if rights:
               reduced = iter(self.reduce(lefts, rights, trackings))
               for transition, stack in zip(trans_batch, stacks):
                   if transition == REDUCE:
                       stack.append(next(reduced))
       return [stack.pop() for stack in stacks]

調用 self.tracker 或 self.reduce，會相對應地運行 Tracker 中的“forward”方式，或 Reduce 子模塊。這需要在一個樣例列表來執行該 op。所有數學運算密集、用 GPU 加速、收益用 batch 的 op 都發生在 Tracker 和 Reduce 之中。因此，在主要的“forward”方式中，單獨在不同樣例上運行；對 batch 中的每個樣例保持獨立的 buffer 和堆棧，都是意義的。為了更干凈地寫這些函數，我會用一些輔助，把這些樣例列表轉為 batch 化的張量，反之亦然。

我傾向于讓 Reduce 模塊自動 batch 參數來加速計算，然后 unbatch 它們，這樣之后能單獨地 push、pop。把每一組左右子短語放到一起，來表示母短語的合成函數是 TreeLSTM，一個常規 LSTM 的變種。此合成函數要求，所有子樹的狀態要由兩個張量組成，一個隱藏狀態 h 和一個內存單元狀態 c。定義該函數的因素有兩個：運行于子樹隱藏狀態中的兩個線性層 (nn.Linear)，以及非線性合成函數 tree_lstm，后者把線性層的結果和子樹內存單元的狀態組合起來。在 SPINN 中，這通過加入第三個運行于 Tracker 隱藏狀態的線性層來拓展。

從原理到實戰英偉達教你用PyTorch搭建RNN（下）

def tree_lstm(c1, c2, lstm_in):
   # Takes the memory cell states (c1, c2) of the two children, as
   # well as the sum of linear transformations of the children’s
   # hidden states (lstm_in)
   # That sum of transformed hidden states is broken up into a
   # candidate output a and four gates (i, f1, f2, and o).
   a, i, f1, f2, o = lstm_in.chunk(5, 1)
   c = a.tanh() * i.sigmoid() + f1.sigmoid() * c1 + f2.sigmoid() * c2
   h = o.sigmoid() * c.tanh()
   return h, c

class Reduce(nn.Module):
   def __init__(self, size, tracker_size=None):
       super(Reduce, self).__init__()
       self.left = nn.Linear(size, 5 * size)
       self.right = nn.Linear(size, 5 * size, bias=False)
       if tracker_size is not None:
           self.track = nn.Linear(tracker_size, 5 * size, bias=False)

   def forward(self, left_in, right_in, tracking=None):
       left, right = batch(left_in), batch(right_in)
       tracking = batch(tracking)
       lstm_in = self.left(left[0])
       lstm_in += self.right(right[0])
       if hasattr(self, 'track'):
           lstm_in += self.track(tracking[0])
       return unbatch(tree_lstm(left[1], right[1], lstm_in))

由于 Reduce 層和以與之類似方式執行的 Tracker 都在 LSTM 上運行，batch 和 unbatch 輔助函數會在成對隱藏、內存狀態上運行。

def batch(states):
   if states is None:
       return None
   states = tuple(states)
   if states[0] is None:
       return None
   # states is a list of B tensors of dimension (1, 2H)
   # this returns two tensors of dimension (B, H)
   return torch.cat(states, 0).chunk(2, 1)

def unbatch(state):
   if state is None:
       return itertools.repeat(None)
   # state is a pair of tensors of dimension (B, H)
   # this returns a list of B tensors of dimension (1, 2H)
   return torch.split(torch.cat(state, 1), 1, 0)

這就是全部的實操講解了。其余代碼，包含 Tracker，都在 spinn.py 里。至于從兩個句子編碼上計算 SNLI 類別、并把結果與目標做對比，以給出最終損失變量的分類層，在 model.py 里。 SPINN 的 “forward”代碼及其子模塊，所產生的是極度復雜的計算圖（下圖），在損失上達到高潮。其細節與數據集中的每一個 batch 都完全不同，但每次都可簡單地調用 loss.backward() 以自動反向傳播，其成本很低。loss.backward() 是 PyTorch 內置的一個函數，能在計算圖的任意一個點上進行反向傳播。

完整代碼里的模型和超參數，其性能可與原始 SPINN 論文相提并論。但在 GPU 上，它更快好幾倍——它的實現充分利用了 batch 和以及 Pytorch 的高效率。原始的 SPINN 編譯計算圖花費了 21 分鐘（意味著執行時的修補漏洞周期至少也這么長），訓練花了大約五天。本文描述的這一版本并沒有便宜步驟，在 Tesla K40 GPU 上訓練只用了 13 小時，相當于 Quadro GP100 上的九個小時。

從原理到實戰英偉達教你用PyTorch搭建RNN（下）

整合強化學習

上文描述的、該模型不含 Tracker 的版本，其實特別適合 TensorFlow 的 tf.fold，針對動態計算圖特殊情形的 TensorFlow 新專用語言。包含 Tracker 的版本實現起來要難得多。這背后的原因是：加入 Tracker，就意味著從 recursive 模式切換為基于堆棧的模式。在上面的代碼里，這以最直觀的形式表現了出來，這使用的是取決于輸入值的 conditional branches。 Fold 并沒有內建的 conditional branch op，所以模型里的圖結構只取決于輸入的結構而非值。另外，創建一個由 Tracker 決定如何解析輸入語句的 SPINN 實際上是不可能的。這是因為 Fold 里的圖結構——雖然它們取決于輸入樣例的結構，在一個輸入樣例加載之后，它必須完全固定下來。

DeepMind 和谷歌大腦的研究人員正在摸索一個類似的模型。他們用強化學習來訓練一個 SPINN 的 Tracker，來解析輸入語句，而不需要任何外部解析數據。本質上，這樣的模型以隨機的猜想開始，當它的解析在整體分類任務上生成較好精度時，獎勵它自己，以此來學習。研究人員們寫道，他們“使用 batch size 1，因為取決于 policy network [Tracker] 的樣本，對于每個樣例，計算圖需要在每次迭代后重建。”但即便在像本文這么復雜、結構有隨機變化特性的神經網絡上，在 PyTorch 上，研究人員們也能只用 batch 訓練。

PyTorch 還是第一個在算法庫內置了強化學習的框架，即它的 stochastic computation graphs （隨機計算圖）。這使得 policy gradient 強化學習像反向傳播一樣易于使用。若想要把它加入上面描述的模型，你只需要像重寫主 SPINN 的頭幾行代碼，生成下面一樣的循環，讓 Tracker 來定義做任何一種解析器（parser）轉換的概率。

!# nn.functional contains neural network operations without parameters
from torch.nn import functional as F
transitions = []
for i in range(len(buffers[0]) * 2 - 3): # we know how many steps
   # obtain raw scores for each kind of parser transition
   tracker_states, transition_scores = self.tracker(buffers, stacks)
   # use a softmax function to normalize scores into probabilities,
   # then sample from the distribution these probabilities define
   transition_batch = F.softmax(transition_scores).multinomial()
   transitions.append(transition_batch

當 batch 一路運行下來，模型知道了它的類別預測精確程度之后，我可以在反向傳播之外，用傳統方式通過圖的其余部分把獎勵信號傳回這些隨機計算圖節點：

# losses should contain a loss per example, while mean and std
# represent averages across many batches
rewards = (-losses - mean) / std
for transition in transitions:
transition.reinforce(rewards)
# connect the stochastic nodes to the final loss variable
# so that backpropagation can find them, multiplying by zero
# because this trick shouldn’t change the loss value
loss = losses.mean() + 0 * sum(transitions).sum()
# perform backpropagation through deterministic nodes and
# policy gradient RL for stochastic nodes
loss.backward()

谷歌研究人員從 SPINN+增強學習報告的結果，比在 SNLI 獲得的原始 SPINN 要好一點，雖然它的增強學習版并沒有預計算語法樹。深度增強學習在 NLP 的應用是一個全新的領域，其中的研究問題十分廣泛。通過把增強學習整合到框架里，PyTorch 極大降低了使用門檻。

via nvidia，雷鋒網編譯。

“TensorFlow & 神經網絡算法高級應用班”要開課啦！

從原理到實戰英偉達教你用PyTorch搭建RNN（下）

從初級到高級，理論+實戰，一站式深度了解 TensorFlow！

本課程面向深度學習開發者，講授如何利用 TensorFlow 解決圖像識別、文本分析等具體問題。課程跨度為 10 周，將從 TensorFlow 的原理與基礎實戰技巧開始，一步步教授學員如何在 TensorFlow 上搭建 CNN、自編碼、RNN、GAN 等模型，并最終掌握一整套基于 TensorFlow 做深度學習開發的專業技能。

兩名授課老師佟達、白發川身為 ThoughtWorks 的資深技術專家，具有豐富的大數據平臺搭建、深度學習系統開發項目經驗。

時間：每周二、四晚 20：00-21：00

開課時長：總學時 20 小時，分 10 周完成，每周2次，每次 1 小時

線上授課地址：http://www.qcxyk.com/special/custom/mooc04.html

從原理到實戰英偉達教你用PyTorch搭建RNN（上）

PyTorch 的預訓練，是時候學習一下了

GAN 很復雜？如何用不到 50 行代碼訓練 GAN（基于 PyTorch）

雷峰網版權文章，未經授權禁止轉載。詳情見轉載須知。