Name Layer Sharing

Intent Improve accuracy by using implicit ensembles that span network layers.

Motivation How can we share weights across several layers?




Known Uses

Related Patterns

Batch Normalization


References Deep Networks with Stochastic Depth

Stochastic depth, a training procedure that enables the seemingly contradictory setup to train short networks and obtain deep networks. We start with very deep networks but during training, for each mini-batch, randomly drop a subset of layers and bypass them with the identity function. Swapout: Learning an ensemble of deep architectures

When viewed as a regularization method swapout not only inhibits co-adaptation of units in a layer, similar to dropout, but also across network layers. We conjecture that swapout achieves strong regularization by implicitly tying the parameters across layers. When viewed as an ensemble training method, it samples a much richer set of architectures than existing methods such as dropout or stochastic depth. Bridging the Gaps Between Residual Learning, Recurrent Neural Networks and Visual Cortex We begin with the observation that a shallow RNN is exactly equivalent to a very deep ResNet with weight sharing among the layers.

A ResNet can be reformulated into a recurrent form that is almost identical to a conventional RNN. Low-rank passthrough neural networks We observe that these architectures, hereby characterized as Passthrough Networks, in addition to the mitigation of the vanishing gradient problem, enable the decoupling of the network state size from the number of parameters of the network, a possibility that is exploited in some recent works but not thoroughly explored. FractalNet: Ultra-Deep Neural Networks without Residuals

Drop-path. A fractal network block functions with some connections between layers disabled, provided some path from input to output is still available. Drop-path guarantees at least one such path, while sampling a subnetwork with many other paths disabled. During training, presenting a different active subnetwork to each mini-batch prevents co-adaptation of parallel paths. A global sampling strategy returns a single column as a subnetwork. Alternating it with local sampling encourages the development of individual columns as performant stand-alone subnetworks.

A related paper that I think is more significant is this one: [1603.09382] Deep Networks with Stochastic Depth

The stochastic depth paper has broader significance because it shows that deep models that don’t fall under the paradigm of “deep neural networks learn a different representation at every layer” can work really well. This gives a lot of evidence in favor of the “deep neural networks learn a multi-step program” paradigm. Previously, people were aware of both interpretations. Most deep learning models could be described by both paradigms equally well but I feel like the representation learning interpretation was more popular (considering one of the main deep learning conferences is called the International Conference on Learning Representations). The stochastic depth paper shows that you can have a single representation that gets updated by many steps of a program and that that works really well. It also shows that just letting this program run longer is helpful, even if it wasn’t trained to run that long. This suggests that the relatively neglected multi-step program interpretation may have been the more important one all along. Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations

We propose zoneout, a novel method for regularizing RNNs. At each timestep, zoneout stochastically forces some hidden units to maintain their previous values. Like dropout, zoneout uses random noise to train a pseudo-ensemble, improving generalization. Hierarchical Multiscale Recurrent Neural Networks

In this paper, we propose a novel multiscale approach, called the hierarchical multiscale recurrent neural networks, which can capture the latent hierarchical structure in the sequence by encoding the temporal dependencies with different timescales using a novel update mechanism. We show some evidence that our proposed multiscale architecture can discover underlying hierarchical structure in the sequences without using explicit boundary information.