Seq2seq pay Attention to Self Attention: Part 2(中文版)

21 min readOct 3, 2018

【Series link】

Part 1 https://medium.com/%40bgg/seq2seq-pay-attention-to-self-attention-part-1-%E4%B8%AD%E6%96%87%E7%89%88-2714bbd92727
English Version https://medium.com/@bgg/seq2seq-pay-attention-to-self-attention-part-2-cf81bf32c73d

Part 1 介紹了Seq2seq和 Attention model。這篇文章將重點擺在Google於2017年發表論文“Attention is all you need”中提出的 “”The transformer模型。”The transformer”模型中主要的概念有2項：1. Self attention 2. Multi-head，此外，模型更解決了傳統attention model中無法平行化的缺點，並帶來優異的成效。

前言

Part 1中，我們學到attention model是如何運作的，缺點就是不能平行化，且忽略了輸入句中文字間和目標句中文字間的關係。

為了解決此問題，2017年，Self attention誕生了。

Self Attention

Self attention是Google在 “Attention is all you need”論文中提出的”The transformer”模型中主要的概念之一，我們可以把”The transformer”想成是個黑盒子，將輸入句輸入這個黑盒子，就會產生目標句。

最特別的地方是，”The transformer”完全捨棄了RNN、CNN的架構。

The transformer

”The transformer”和Seq2seq模型皆包含兩部分：Encoder和Decoder。比較特別的是，”The transformer”中的Encoder是由6個Encoder堆積而成(paper當中N=6)，Deocder亦然，這和過去的attention model只使用一個encoder/decoder是不同的。

Query, Key, Value

進入”The transformer”前，我們重新複習attention model，attention model是從輸入句<X1,X2,X3…Xm>產生h1,h2,h….hm的hidden state，透過attention score α 乘上input 的序列加權求和得到Context vector c_{i}，有了context vector和hidden state vector，便可計算目標句<y1…yn>。換言之，就是將輸入句作為input而目標句作為output。

如果用另一種說法重新詮釋：

輸入句中的每個文字是由一系列成對的 <地址Key, 元素Value>所構成，而目標中的每個文字是Query，那麼就可以用Key, Value, Query去重新解釋如何計算context vector，透過計算Query和各個Key的相似性，得到每個Key對應Value的權重係數，權重係數代表訊息的重要性，亦即attention score；Value則是對應的訊息，再對Value進行加權求和，得到最終的Attention/context vector。

筆者認為這概念非常創新，特別是從attention model到”The transformer”間，鮮少有論文解釋這種想法是如何連結的，間接導致”attention is all you need”這篇論文難以入門，有興趣可以參考key、value的起源論文 Key-Value Memory Networks for Directly Reading Documents。

在NLP的領域中，Key, Value通常就是指向同一個文字隱向量(word embedding vector)。

有了Key, Value, Query的概念，我們可以將attention model中的Decoder公式重新改寫。1. score e_{ij}= Similarity(Query, Key_{i})，上一篇有提到3種計算權重的方式，而我們選擇用內積。2. 有了Similarity(Query, Key_{i})，便可以透過softmax算出Softmax(sim_{i})=a_{i}，接著就可以透過attention score a_{i}乘上Value_{i}的序列和加總所得 = Attention(Query, Source)，也就是context/attention vector。

在了解Key, Value, Query的概念後，我們可以進入”the transformer”的世界了。

Scaled Dot-Product Attention

如果仔細觀察，其實“The transformer”計算 attention score的方法和attention model如出一轍，但”The transformer”還要除上分母=根號d_{k}，目的是避免內積過大時，softmax產出的結果非0即1。

Three kinds of Attention

“The transformer”在計算attention的方式有三種，1. encoder self attention，存在於encoder間. 2. decoder self attention，存在於decoder間，3. encoder-decoder attention, 這種attention算法和過去的attention model相似。

接下來我們透過encoder和decoder兩部份，來分別介紹encoder/decoder self attention。

Encoder

我們將”The transformer”模型分為左右兩部分，左邊是Encoder，如前述，”Attention is all you need”當中N=6，代表Encoder部分是由6個encoder堆積而成的。其中在計算encoder self attention時，更透過multi-head的方式去學習不同空間的特徵，在後續內容會探討multi-head的部分。

如何計算encoder self attention?

我們先用微觀的角度來觀察Attention(q_{t}, K, V)，也就是輸入句中的某個文字，再將所有輸入句中的文字一次用矩陣Attention(Q,K,V)來解決。

第一步是創造三個encoder的輸入向量Q,K,V，舉例來說，“Are you very big?”中的每一個字的隱向量都有各自的Q,K,V，接著我們會乘上一個初始化矩陣，論文中輸出維度d_{model}=512。

第二步是透過內積來計算score <q_{t}, k_{s}>，類似attention model 中的score e_{ij}。假設我們在計算第一個字”Are”的self-attention，我們可能會將輸入句中的每個文字”Are”, ”you”, ‘very’, ‘big’分別和”Are”去做比較，這個分數決定了我們在encode某個特定位置的文字時，應該給予多少注意力(attention)。所以當我們在計算#位置1的self-attention，第一個分數是q1、k1的內積 (“Are vs Are”)，第二個分數則是q1、k2 (“Are vs you”)，以此類推。

第三步是將算出的分數除以根號d_{k}，論文當中假定d_{k}=64，接著傳遞至exponential函數中並乘上1/Z，其實這結果就是attention/softmax score，我們可以把1/Z看成是softmax時，所除上的exponential總和，最終的總分數就是attention score，代表我們應該放多少注意力在這個位置上，也就是attention model的概念，有趣的是，怎麼算一定都會發現自己位置上的分數永遠最高，但有時候可以發現和其他位置的文字是有關聯的。

最後一步就是把attention score再乘上value，然後加總得到attention vector(z_{I})，這就是#位置1的attention vector z1，概念都和以往的attention model類似。

以上就是self-attention的計算，算出來的向量我們可以往前傳遞至feed-forward neural network，實際的運作上，是直接將每個文字同時處理，因此會變成一個矩陣，而非單一詞向量，計算後的結果attention vector也會變成attention matrix Z。

Multi-head attention

有趣的是，如果我們只計算一個attention，很難捕捉輸入句中所有空間的訊息，為了優化模型，論文當中提出了一個新穎的做法：Multi-head attention，概念是不要只用d_{model}維度的key, value, query們做單一個attention，而是把key, value, query們線性投射到不同空間h次，分別變成維度d_{q}, d_{k} and d_{v}，再各自做attention，其中，d_{k}=d_{v}=d_{model}/h=64，概念就是投射到h個head上。

此外，”The transformer”用了8個attention head，所以我們會產生8組encoder/decoder，每一組都代表將輸入文字的隱向量投射到不同空間，如果我們重複計算剛剛所講的self-attention，我們就會得到8個不同的矩陣Z，可是呢，feed-forward layer期望的是一個矩陣而非8個，所以我們要把這8個矩陣併在一起，透過乘上一個權重矩陣，還原成一個矩陣Z。

Residual Connections

Encoder還有一個特別的架構，Multihead-attention完再接到feed-forward layer中間，還有一個sub-layer，會需要經過residual connection和layer normalization。

Residual connection 就是構建一種新的殘差結構，將輸出改寫成和輸入的殘差，使得模型在訓練時，微小的變化可以被注意到，這種架構很常用在電腦視覺(computer vision)，有興趣可以參考神人Kaiming He的Deep Residual Learning for Image Recognition。

Layer normalization則是在深度學習領域中，其中一種正規化方法，最常和batch normalization進行比較，layer normalization的優點在於它是獨立計算的，也就是針對單一樣本進行正規化，batch normalization則是針對各維度，因此和batch size有所關聯，可以參考layer normalization。

圖. 13. Layer Normalization和Residual Connections

Position-wise Feed-Forward Networks

Encoder/Decoder中的attention sublayers都會接到一層feed-forward networks(FFN)：兩層線性轉換和一個RELU，論文中是根據各個位置(輸入句中的每個文字)分別做FFN，舉例來說，如果輸入文字是<x1,x2…xm>，代表文字共有m個。

其中，每個位置進行相同的線性轉換，這邊使用的是convolution1D，也就是kernel size=1，原因是convolution1D才能保持位置的完整性，可參考CNN，模型的輸入/輸出維度d_{model}=512，但中間層的維度是2048，目的是為了減少計算量，這部分一樣參考神人Kaiming He的Deep Residual Learning for Image Recognition。

圖. 14. Position-wise Feed-Forward Networks

Positional Encoding

和RNN不同的是，multi-head attention不能學到輸入句中每個文字的位置，舉例來說，“Are you very big?” and “Are big very you?”，對multi-head而言，是一樣的語句，因此，”The transformer”透過positional encoding，來學習每個文字的相對/絕對位置，最後再和輸入句中文字的隱向量相加。

論文使用了方程式PE(pos, 2i)=sin(pos/10000^{2i/d_{model}})、PE(pos, 2i+1)=cos(pos/10000^{2i/d_{model}})來計算positional encoding，pos代表的是位置，i代表的是維度，偶數位置的文字會透過sin函數進行轉換，奇數位置的文字則透過cos函數進行轉換，藉由三角函數，可以發現positional encoding 是個有週期性的波長；舉例來說，[pos+k]可以寫成PE[pos]的線性轉換，使得模型可以學到不同位置文字間的相對位置。

如下圖，假設embedding 的維度為4：

每列對應的是經過positional encoding後的向量，以第一列而言，就是輸入句中第一個文字隱向量和positioncal encoding後的向量和，所以每列維度都是d_{model}，總共有pos列，也就是代表輸入句中有幾個文字。

下圖為含有20字的輸入句，文字向量維度為512，可以發現圖層隨著位置產生變化。

圖. 15. Positional Encoding 例子(embedding dim=4)

Encoder內容告一段落，接下來讓我們看Decoder的運作模式。

Decoder

Masked multi-head attention

Decoder的運作模式和Encoder大同小異，也都是經過residual connections再到layer normalization。Encoder中的self attention在計算時，key, value, query都是來自encoder前一層的輸出，Decoder亦然。

不同的地方是，為了避免在解碼的時後，還在翻譯前半段時，就突然翻譯到後半段的句子，會在計算self-attention時的softmax前先mask掉未來的位置(設定成-∞)。這個步驟確保在預測位置i的時候只能根據i之前位置的輸出，其實這個是因應Encoder-Decoder attention 的特性而做的配套措施，因為Encoder-Decoder attention可以看到encoder的整個句子，

Encoder-Decoder Attention

“Encoder-Decoder Attention”和Encoder/Decoder self attention不一樣，它的Query來自於decoder self-attention，而Key、Value則是encoder的output。

至此，我們講完了三種attention，接著看整體運作模式。

從輸入文字的序列給Encoder開始，Encoder的output會變成attention vectors的Key、Value，接著傳送至encoder-decoder attention layer，幫助Decoder該將注意力擺在輸入文字序列的哪個位置進行解碼。

The Final Linear and Softmax Layer

Decoder最後會產出一個向量，傳到最後一層linear layer後做softmax。Linear layer只是單純的全連接層網絡，並產生每個文字對應的分數，softmax layer會將分數轉成機率值，最高機率的值就是在這個時間順序時所要產生的文字。

圖. 18. The transformer中的Linear layer + Softmax layer

Why self attention?

過去，Encoder和Decoder的核心架構都是RNN，RNN把輸入句的文字序列 (x1…, xn)一個個有序地轉成hidden encodings (h1…hn)，接著在產出目標句的文字序列(y1…yn)。然而，RNN的序列性導致模型不可能平行計算，此外，也導致計算複雜度很高，而且，很難捕捉長序列中詞語的依賴關係(long-range dependencies)。

透過 “the transformer”，我們可以用multi-head attention來解決平行化和計算複雜度過高的問題，依賴關係也能透過self-attention中詞語與詞語比較時，長度只有1的方式來克服。

Future

在金融業，企業可以透過客戶歷程，深入了解客戶行為企業，進而提供更好的商品與服務、提升客戶滿意度，藉此創造價值。然而，和以往的基本特徵不同，從序列化的客戶歷程資料去萃取資訊是非常困難的，在有了self-attention的知識後，我們可以將這種處理序列資料的概念應用在複雜的客戶歷程上，探索客戶潛在行為背後無限的商機。

筆者也推薦有興趣鑽研self-attention概念的讀者，可以參考阿里巴巴所提出的論文ATrank，此篇論文將self-attention應用在產品推薦上，並帶來更好的成效。

Reference

[1] Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translationr. arXiv:1406.1078v3 (2014).

[2] Sequence to Sequence Learning with Neural Networks. arXiv:1409.3215v3 (2014).

[3] Neural machine translation by joint learning to align and translate. arXiv:1409.0473v7 (2016).

[4] Effective Approaches to Attention-based Neural Machine Translation. arXiv:1508.0402v5 (2015).

[5] Convolutional Sequence to Sequence learning. arXiv:1705.03122v3(2017).

[6] Attention Is All You Need. arXiv:1706.03762v5 (2017).

[7] ATRank: An Attention-Based User Behavior Modeling Framework for Recommendation. arXiv:1711.06632v2 (2017).

[8] Key-Value Memory Networks for Directly Reading Documents. arXiv:1606.03126v2 (2016).

[9] Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. arXiv:1502.03044v3 (2016).

[10] Deep Residual Learning for Image Recognition. arXiv:1512.03385v1 (2015).

[11] Layer Normalization. arXiv:1607.06450v1 (2016).

深度学习中的注意力机制(2017版) — 张俊林的博客 — CSDN博客

版权声明：可以任意转载，转载时请标明文章原始出处和作者信息 .*/ 张俊林（…

blog.csdn.net

The Illustrated Transformer

In the previous post, we looked at Attention — a ubiquitous method in modern deep learning models. Attention is a…

jalammar.github.io

論文解説 Attention Is All You Need (Transformer) — ディープラーニングブログ

こんにちは Ryobot (りょぼっと) です．本紙は RNN や CNN を使わず Attention のみ使用したニューラル機械翻訳 Transformer を提案している．わずかな訓練で圧倒的な…

deeplearning.hatenablog.com

tensor-to-tensor[理论篇] — daiwk-github博客

tensor-to-tensor[理论篇] — daiwk-github博客 — 作者:daiwk

daiwk.github.io

Attention? Attention!

Attention has been a fairly popular concept and a useful tool in the deep learning community in recent years. In this…

lilianweng.github.io

Paper Dissected: “Attention is All You Need” Explained

Reading papers can be very intimidating, particularly when you have no prior research experience. I remember when I…

mlexplained.com

Weight Normalization and Layer Normalization Explained (Normalization in Deep Learning Part 2)

Batch normalization is one of the reasons why deep learning has made such outstanding progress in recent years. Based…