Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
attention [2018/08/20 19:55]
admin
attention [2018/10/02 19:57] (current)
admin
Line 290: Line 290:
  
 The Linear Attention Recurrent Neural Network (LARNN) is a recurrent attention module derived from the Long Short-Term Memory (LSTM) cell and ideas from the consciousness Recurrent Neural Network (RNN). Yes, it LARNNs. The LARNN uses attention on its past cell state values for a limited window size k. The formulas are also derived from the Batch Normalized LSTM (BN-LSTM) cell and the Transformer Network for its Multi-Head Attention Mechanism. The Multi-Head Attention Mechanism is used inside the cell such that it can query its own k past values with the attention window. https://​github.com/​guillaume-chevalier/​Linear-Attention-Recurrent-Neural-Network The Linear Attention Recurrent Neural Network (LARNN) is a recurrent attention module derived from the Long Short-Term Memory (LSTM) cell and ideas from the consciousness Recurrent Neural Network (RNN). Yes, it LARNNs. The LARNN uses attention on its past cell state values for a limited window size k. The formulas are also derived from the Batch Normalized LSTM (BN-LSTM) cell and the Transformer Network for its Multi-Head Attention Mechanism. The Multi-Head Attention Mechanism is used inside the cell such that it can query its own k past values with the attention window. https://​github.com/​guillaume-chevalier/​Linear-Attention-Recurrent-Neural-Network
 +
 +http://​petar-v.com/​GAT/​ Graph Attention Networks
 +
 +https://​arxiv.org/​abs/​1808.04444 Character-Level Language Modeling with Deeper Self-Attention
 +
 +. In this paper, we show that a deep (64-layer) transformer model with fixed context outperforms RNN variants by a large margin, achieving state of the art on two popular benchmarks- 1.13 bits per character on text8 and 1.06 on enwik8. To get good results at this depth, we show that it is important to add auxiliary losses, both at intermediate network layers and intermediate sequence positions.
 +
 +https://​arxiv.org/​abs/​1808.08946v2 Why Self-Attention?​ A Targeted Evaluation of Neural Machine Translation Architectures
 +
 +Our experimental results show that: 1) self-attentional networks and CNNs do not outperform RNNs in modeling subject-verb agreement over long distances; 2) self-attentional networks perform distinctly better than RNNs and CNNs on word sense disambiguation.
 +
 +https://​arxiv.org/​abs/​1808.03867 Pervasive Attention: 2D Convolutional Neural Networks for Sequence-to-Sequence Prediction
 +
 +https://​arxiv.org/​abs/​1809.11087 Learning to Remember, Forget and Ignore using Attention Control in Memory
 +
 +Applying knowledge gained from psychological studies, we designed a new model called Differentiable Working Memory (DWM) in order to specifically emulate human working memory. As it shows the same functional characteristics as working memory, it robustly learns psychology inspired tasks and converges faster than comparable state-of-the-art models. Moreover, the DWM model successfully generalizes to sequences two orders of magnitude longer than the ones used in training. Our in-depth analysis shows that the behavior of DWM is interpretable and that it learns to have fine control over memory, allowing it to retain, ignore or forget information based on its relevance.
 +
 +https://​openreview.net/​forum?​id=rJxHsjRqFQ Hyperbolic Attention Networks ​
 +
 +By only changing the geometry of embedding of object representations,​ we can use the embedding space more efficiently without increasing the number of parameters of the model. Mainly as the number of objects grows exponentially for any semantic distance from the query, hyperbolic geometry ​ --as opposed to Euclidean geometry-- can encode those objects without having any interference. Our method shows improvements in generalization on neural machine translation on WMT'14 (English to German), learning on graphs (both on synthetic and real-world graph tasks) and visual question answering (CLEVR) tasks while keeping the neural representations compact.