# Long Short Term Memory

**Alias** Recurrent Neural Network

**Intent**

Handle data sequences by folding the network into itself.

**Motivation**

How can we handle data that consists of sequences of data?

**Structure**

A schematic illustration of a LSTM neuron. Each LSTM neuron has an input gate, a forget gate, and an output gate.

**Discussion**

**Known Uses**

**Related Patterns**

<Diagram>

**References**

http://r2rt.com/written-memories-understanding-deriving-and-extending-the-lstm.html

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

http://arxiv.org/abs/1602.02218 Strongly-Typed Recurrent Neural Networks

This paper imports ideas from physics and functional programming into RNN design to provide guiding principles. From physics, we introduce type constraints, analogous to the constraints that forbids adding meters to seconds. From functional programming, we require that strongly-typed architectures factorize into stateless learnware and state-dependent firmware, reducing the impact of side-effects. The features learned by strongly-typed nets have a simple semantic interpretation via dynamic average-pooling on one-dimensional convolutions. We also show that strongly-typed gradients are better behaved than in classical architectures, and characterize the representational power of strongly-typed nets. Finally, experiments show that, despite being more constrained, strongly-typed architectures achieve lower training and comparable generalization error to classical architectures.

Two recent papers provide empirical evidence that recurrent (horizontal) connections are problematic even after gradients are stabilized: (Zaremba et al., 2015) find that Dropout performs better when restricted to vertical connections and (Laurent et al., 2015) find that Batch Normalization fails unless restricted to vertical connections (Ioffe & Szegedy, 2015). More precisely, (Laurent et al., 2015) find that Batch Normalization improves training but not test error when restricted to vertical connections; it fails completely when also applied to horizontal connections http://colah.github.io/posts/2015-09-NN-Types-FP/

https://arxiv.org/pdf/1603.08983v4.pdf Adaptive Computation Time for Recurrent Neural Networks This paper has introduced Adaptive Computation time (ACT), a method that allows recurrent neural networks to learn how many updates to perform for each input they receive. Experiments on synthetic data prove that ACT can make otherwise inaccessible problems straightforward for RNNs to learn, and that it is able to dynamically adapt the amount of computation it uses to the demands of the data.

http://arxiv.org/abs/1503.04069 LSTM: A Search Space Odyssey

http://papers.nips.cc/paper/5642-parallel-multi-dimensional-lstm-with-application-to-fast-biomedical-volumetric-image-segmentation.pdf Parallel Multi-Dimensional LSTM, With Application to Fast Biomedical Volumetric Image Segmentation

http://torch.ch/blog/2016/07/25/nce.html

http://arxiv.org/pdf/1607.03474v2.pdf Recurrent Highway Networks

http://arxiv.org/abs/1511.07916 Natural Language Understanding with Distributed Representation

http://arxiv.org/abs/1506.00019v4 A Critical Review of Recurrent Neural Networks for Sequence Learning

https://arxiv.org/abs/1607.03085v3 Recurrent Memory Array Structures

https://arxiv.org/abs/1602.06291 Contextual LSTM (CLSTM) models for Large scale NLP tasks

We present CLSTM (Contextual LSTM), an extension of the recurrent neural network LSTM (Long-Short Term Memory) model, where we incorporate contextual features (e.g., topics) into the model.

http://openreview.net/pdf?id=rJsiFTYex A WAY OUT OF THE ODYSSEY: ANALYZING AND COMBINING RECENT INSIGHTS FOR LSTMS

We propose and analyze a series of architectural modifications for LSTM networks resulting in improved performance for text classification datasets. We observe compounding improvements on traditional LSTMs using Monte Carlo test-time model averaging, deep vector averaging (DVA), and residual connections, along with four other suggested modifications.

When exploring a new problem, having a simple yet competitive off-the-shelf baseline is fundamental to new research. For instance, Caruana et al. (2008) showed random forests to be a strong baseline for many high-dimensional supervised learning tasks. For computer vision, off-the-shelf convolutional neural networks (CNNs) have earned their reputation as a strong baseline (Sharif Razavian et al., 2014) and basic building block for more complex models like visual question answering (Xiong et al., 2016). For natural language processing (NLP) and other sequential modeling tasks, recurrent neural networks (RNNs), and in particular Long Short-Term Memory (LSTM) networks, with a linear projection layer at the end have begun to attain a similar status. However, the standard LSTM is in many ways lacking as a baseline. Zaremba (2015), Gal (2015), and others show that large improvements are possible using a forget bias, inverted dropout regularization or bidirectionality. We add three major additions with similar improvements to off-the-shelf LSTMs: Monte Carlo model averaging, deep vector averaging, and residual connections. We analyze these and other more common improvements.

http://jmlr.org/proceedings/papers/v37/jozefowicz15.pdf An Empirical Exploration of Recurrent Network Architectures

The Recurrent Neural Network (RNN) is an extremely
powerful sequence model that is often
difficult to train. The Long Short-Term Memory
(LSTM) is a specific RNN architecture whose
design makes it much easier to train. While
wildly successful in practice, the LSTM’s architecture
appears to be ad-hoc so it is not clear if it
is optimal, and the significance of its individual
components is unclear.
In this work, we aim to determine whether the
LSTM architecture is optimal or whether much
better architectures exist. We conducted a thorough
architecture search where we evaluated
over ten thousand different RNN architectures,
and identified an architecture that outperforms
both the LSTM and the recently-introduced
Gated Recurrent Unit (GRU) on some but not all
tasks. **We found that adding a bias of 1 to the
LSTM’s forget gate closes the gap between the
LSTM and the GRU.**

http://www.aclweb.org/anthology/P15-1150 Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks

Because of their superior ability to preserve sequence information over time, Long Short-Term Memory (LSTM) networks, a type of recurrent neural network with a more complex computational unit, have obtained strong results on a variety of sequence modeling tasks. The only underlying LSTM structure that has been explored so far is a linear chain. However, natural language exhibits syntactic properties that would naturally combine words to phrases. We introduce the Tree-LSTM, a generalization of LSTMs to tree-structured network topologies. TreeLSTMs outperform all existing systems and strong LSTM baselines on two tasks: predicting the semantic relatedness of two sentences (SemEval 2014, Task 1) and sentiment classification (Stanford Sentiment Treebank).

http://videolectures.net/deeplearning2016_grefenstette_augmented_rnn/ Beyond Sequence to Sequence with Augmented RNNs

https://arxiv.org/abs/1602.08210v3 Architectural Complexity Measures of Recurrent Neural Networks

Our main contribution is twofold: first, we present a rigorous graph-theoretic framework describing the connecting architectures of RNNs in general. Second, we propose three architecture complexity measures of RNNs:

(a) the recurrent depth, which captures the RNN's over-time nonlinear complexity,

(b) the feedforward depth, which captures the local input-output nonlinearity (similar to the “depth” in feedforward neural networks (FNNs)), and

© the recurrent skip coefficient which captures how rapidly the information propagates over time.

We rigorously prove each measure's existence and computability. Our experimental results show that RNNs might benefit from larger recurrent depth and feedforward depth.

https://arxiv.org/abs/1707.05589 On the State of the Art of Evaluation in Neural Language Models

We reevaluate several popular architectures and regularisation methods with large-scale automatic black-box hyperparameter tuning and arrive at the somewhat surprising conclusion that standard LSTM architectures, when properly regularised, outperform more recent models. We establish a new state of the art on the Penn Treebank and Wikitext-2 corpora, as well as strong baselines on the Hutter Prize dataset.

https://arxiv.org/pdf/1706.02222v1.pdf Gated Recurrent Neural Tensor Network

Representing the hidden layers of an RNN with more expressive operations (i.e., tensor products) helps it learn a more complex relationship between the current input and the previous hidden layer information.

https://arxiv.org/abs/1801.10308v1 Nested LSTMs Nested LSTMs outperform both stacked and single-layer LSTMs with similar numbers of parameters in our experiments on various character-level language modeling tasks, and the inner memories of an LSTM learn longer term dependencies compared with the higher-level units of a stacked LSTM.

https://arxiv.org/abs/1803.01271 An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling

Our results indicate that a simple convolutional architecture outperforms canonical recurrent networks such as LSTMs across a diverse range of tasks and datasets, while demonstrating longer effective memory. We conclude that the common association between sequence modeling and recurrent networks should be reconsidered, and convolutional networks should be regarded as a natural starting point for sequence modeling tasks.

we have described a simple temporal convolutional network (TCN) that combines best practices such as dilations and residual connections with the causal convolutions needed for autoregressive prediction. The experimental results indicate that TCN models substantially outperform generic recurrent architectures such as LSTMs and GRUs.

The distinguishing characteristics of TCNs are: 1) the convolutions in the architecture are causal, meaning that there is no information “leakage” from future to past; 2) the architecture can take a sequence of any length and map it to an output sequence of the same length, just as with an RNN. Beyond this, we emphasize how to build very long effective history sizes (i.e., the ability for the networks to look very far into the past to make a prediction) using a combination of very deep networks (augmented with residual layers) and dilated convolutions.

https://arxiv.org/abs/1803.02839 The emergent algebraic structure of RNNs and embeddings in NLP

We conclude that words naturally embed themselves in a Lie group and that RNNs form a nonlinear representation of the group. Appealing to these results, we propose a novel class of recurrent-like neural networks and a word embedding scheme.

https://arxiv.org/abs/1805.09692 Been There, Done That: Meta-Learning with Episodic Recall

Meta-learning agents excel at rapidly learning new tasks from open-ended task distributions; yet, they forget what they learn about each task as soon as the next begins. When tasks reoccur - as they do in natural environments - metalearning agents must explore again instead of immediately exploiting previously discovered solutions. We propose a formalism for generating open-ended yet repetitious environments, then develop a meta-learning architecture for solving these environments. This architecture melds the standard LSTM working memory with a differentiable neural episodic memory. We explore the capabilities of agents with this episodic LSTM in five meta-learning environments with reoccurring tasks, ranging from bandits to navigation and stochastic sequential decision problems.

https://arxiv.org/abs/1808.09357 Rational Recurrences

Recently, connections have been shown between convolutional neural networks (CNNs) and weighted finite state automata (WFSAs), leading to new interpretations and insights. In this work, we show that some recurrent neural networks also share this connection to WFSAs. We characterize this connection formally, defining rational recurrences to be recurrent hidden state update functions that can be written as the Forward calculation of a finite set of WFSAs. We show that several recent neural models use rational recurrences. Our analysis provides a fresh view of these models and facilitates devising new neural architectures that draw inspiration from WFSAs. We present one such model, which performs better than two recent baselines on language modeling and text classification. Our results demonstrate that transferring intuitions from classical models like WFSAs can be an effective approach to designing and understanding neural models. https://github.com/Noahs-ARK/soft_patterns

https://arxiv.org/pdf/1810.12456.pdf Learning Distributed Representations of Symbolic Structure Using Binding and Unbinding Operations

— we propose the TPRU, a recurrent unit that, at each time step, explicitly executes structural-role binding and unbinding operations to incorporate structural information into learning. Experiments are conducted on both the Logical Entailment task and the Multi-genre Natural Language Inference (MNLI) task, and our TPR-derived recurrent unit provides strong performance with significantly fewer parameters than LSTM and GRU baselines. Furthermore, our learnt TPRU trained on MNLI demonstrates solid generalisation ability on downstream tasks.