Decoupled Back-Propagation

There are many variants of “back-propagation”, the most common is gradient descent with a variant called RProp (see: https://en.wikipedia.org/wiki/Rprop ) which is an extreme simplification that uses only the sign of the gradient to perform its update. Natural Gradient based methods that are second order update mechanism an interesting variant called NES (https://en.wikipedia.org/wiki/Natural_evolution_strategy) employs genetic evolution methods. Field Alignment is another simplistic method that is extremely efficient ( see: http://arxiv.org/pdf/1411.0247.pdf ). In general, back-propagation does not necessarily require that the implementation is performed by a strict application of an analytic gradient calculation. What is essential is that there is some approximation of an appropriate weight change update and a corresponding structure to propagate the updates. Incidentally, recent research (see: http://cbmm.mit.edu/publications/how-important-weight-symmetry-backpropagation-0 ) appears to conclude that the magnitude of gradient update isn’t as important as the sign of the update.

https://arxiv.org/abs/1502.04156 Towards Biologically Plausible Deep Learning

A recently discovered method called feedback-alignment shows that the weights used for propagating the error backward don't have to be symmetric with the weights used for propagation the activation forward. In fact, random feedback weights work evenly well, because the network learns how to make the feedback useful. In this work, the feedback alignment principle is used for training hidden layers more independently from the rest of the network, and from a zero initial condition. The error is propagated through fixed random feedback connections directly from the output layer to each hidden layer. This simple method is able to achieve zero training error even in convolutional networks and very deep networks, completely without error back-propagation.

http://arxiv.org/abs/1407.7906 How Auto-Encoders Could Provide Credit Assignment in Deep Networks via Target Propagation

http://arxiv.org/abs/1505.05424 Weight Uncertainty in Neural Networks

Regularises the weights by minimising a compression cost, known as the variational free energy or the expected lower bound on the marginal likelihood. We show that this principled kind of regularisation yields comparable performance to dropout on MNIST classification.

http://deliprao.com/archives/187 https://arxiv.org/abs/1608.05343 Decoupled Neural Interfaces using Synthetic Gradients

https://deepmind.com/blog/decoupled-neural-networks-using-synthetic-gradients/

The true gradient information does percolate backwards through the network, but just slower and over many training iterations, through the losses of the synthetic gradient models. The synthetic gradient models approximate and smooth over the absence of true gradients.

https://arxiv.org/pdf/1609.01596v1.pdf Direct Feedback Alignment Provides Learning in Deep Neural Networks

In this work, the feedback alignment principle is used for training hidden layers more independently from the rest of the network, and from a zero initial condition. The error is propagated through fixed random feedback connections directly from the output layer to each hidden layer. This simple method is able to achieve zero training error even in convolutional networks and very deep networks, completely without error backpropagation.

http://www.breloff.com/no-backprop-part2/

https://arxiv.org/abs/1602.05179 Equilibrium Propagation: Bridging the Gap Between Energy-Based Models and Backpropagation

This algorithm involves only one kind of neural computation both for the first phase (when the prediction is made) and the second phase (after the target is revealed) of training.

https://arxiv.org/abs/1612.02734v1 Learning in the Machine: Random Backpropagation and the Learning Channel

Random backpropagation (RBP) is a variant of the backpropagation algorithm for training neural networks, where the transpose of the forward matrices are replaced by fixed random matrices in the calculation of the weight updates. It is remarkable both because of its effectiveness, in spite of using random matrices to communicate error information, and because it completely removes the taxing requirement of maintaining symmetric weights in a physical neural system. To better understand random backpropagation, we first connect it to the notions of local learning and the learning channel. Through this connection, we derive several alternatives to RBP, including skipped RBP (SRPB), adaptive RBP (ARBP), sparse RBP, and their combinations (e.g. ASRBP) and analyze their computational complexity. We then study their behavior through simulations using the MNIST and CIFAR-10 bechnmark datasets. These simulations show that most of these variants work robustly, almost as well as backpropagation, and that multiplication by the derivatives of the activation functions is important. As a follow-up, we study also the low-end of the number of bits required to communicate error information over the learning channel. We then provide partial intuitive explanations for some of the remarkable properties of RBP and its variations. Finally, we prove several mathematical results, including the convergence to fixed points of linear chains of arbitrary length, the convergence to fixed points of linear autoencoders with decorrelated data, the long-term existence of solutions for linear systems with a single hidden layer, and the convergence to fixed points of non-linear chains, when the derivative of the activation functions is included.

https://arxiv.org/pdf/1702.06463v1.pdf Predicting non-linear dynamics: a stable local learning scheme for recurrent spiking neural networks

The error in the output is fed back through fixed random connections with a negative gain, causing the network to follow the desired dynamics, while an online and local rule changes the weights; hence we call the scheme FOLLOW (Feedback-based Online Local Learning Of Weights) The rule is local in the sense that weight changes depend on the presynaptic activity and the error signal projected onto the post-synaptic neuron. We provide examples of learning linear, non-linear and chaotic dynamics, as well as the dynamics of a two-link arm. Using the Lyapunov method, and under reasonable assumptions and approximations, we show that FOLLOW learning is uniformly stable, with the error going to zero asymptotically.

https://arxiv.org/abs/1702.07097 Bidirectional Backpropagation: Towards Biologically Plausible Error Signal Transmission in Neural Networks

The back-propagation (BP) algorithm has been considered the de facto method for training deep neural networks. It back-propagates errors from the output layer to the hidden layers in an exact manner using feedforward weights. In this work, we propose a more biologically plausible paradigm of neural architecture according to biological findings. Specifically, we propose two bidirectional learning algorithms with two sets of trainable weights. Preliminary results show that our models perform best on the MNIST and the CIFAR10 datasets among the asymmetric error signal passing methods, and their performance is more close to that of BP.

https://arxiv.org/abs/1703.00522v1 Understanding Synthetic Gradients and Decoupled Neural Interfaces

We show that the incorporation of SGs does not affect the representational strength of the learning system for a neural network, and prove the convergence of the learning system for linear and deep linear models. On practical problems we investigate the mechanism by which synthetic gradient estimators approximate the true loss, and, surprisingly, how that leads to drastically different layer-wise representations. Finally, we also expose the relationship of using synthetic gradients to other error approximation techniques and find a unifying language for discussion and comparison.

https://iamtrask.github.io/2017/03/21/synthetic-gradients/

https://openreview.net/pdf?id=HkXKUTVFl EXPLAINING THE LEARNING DYNAMICS OF DIRECT FEEDBACK ALIGNMENT

https://arxiv.org/abs/1707.04585v1 The Reversible Residual Network: Backpropagation Without Storing Activations

Deep residual networks (ResNets) have significantly pushed forward the state-of-the-art on image classification, increasing in performance as networks grow both deeper and wider. However, memory consumption becomes a bottleneck, as one needs to store the activations in order to calculate gradients using backpropagation. We present the Reversible Residual Network (RevNet), a variant of ResNets where each layer's activations can be reconstructed exactly from the next layer's. Therefore, the activations for most layers need not be stored in memory during backpropagation. https://github.com/renmengye/revnet-public

https://arxiv.org/abs/1710.05958 Gradient-free Policy Architecture Search and Adaptation

We develop a method for policy architecture search and adaptation via gradient-free optimization which can learn to perform autonomous driving tasks. By learning from both demonstration and environmental reward we develop a model that can learn with relatively few early catastrophic failures. We first learn an architecture of appropriate complexity to perceive aspects of world state relevant to the expert demonstration, and then mitigate the effect of domain-shift during deployment by adapting a policy demonstrated in a source domain to rewards obtained in a target environment. We show that our approach allows safer learning than baseline methods, offering a reduced cumulative crash metric over the agent's lifetime as it learns to drive in a realistic simulated environment.

https://arxiv.org/abs/1702.07817 Unsupervised Sequence Classification using Sequential Output Statistics

we propose an unsupervised learning cost function and study its properties. We show that, compared to earlier works, it is less inclined to be stuck in trivial solutions and avoids the need for a strong generative model. Although it is harder to optimize in its functional form, a stochastic primal-dual gradient method is developed to effectively solve the problem.

https://arxiv.org/abs/1605.02026v1 Training Neural Networks Without Gradients: A Scalable ADMM Approach

This paper explores an unconventional training method that uses alternating direction methods and Bregman iteration to train networks without gradient descent steps.

https://arxiv.org/pdf/1802.05642v1.pdf The Mechanics of n-Player Differentiable Games

http://www.jmlr.org/papers/volume18/17-653/17-653.pdf Maximum Principle Based Algorithms for Deep Learning

https://github.com/facebookresearch/nevergrad