Learning Patterns

This chapter covers mechanisms that are known to lead to a trained model. Why are neural networks able to generalize? Why does back-propagation eventually lead to convergence? There are many questions that still are looking for a good theoretical explanation. However, DL is an experimental science and it is known that the simplistic method of back-propagation is surprisingly effective.

Early objections with regards to neural networks were that the equivalent optimization problem was likely to be convex. What this meant was that it would be extremely difficult to train a model to reach convergence. However recent research disproves this original intuition. Rather, in high-dimensional spaces, it is more likely to find that a local minima is a saddle point and thus the higher probability that gradient descent will eventually find a way to continue to roll down the optimization hill.

The requirements for back-propagation in Deep Learning is surprisingly simplistic. If one is able to calculate the divergence of each of the layers with respect to its model parameters then one can apply it. Back-propagation works extremely well in discovering a convergence basin where a model has learned to generalize.

This chapter covers recurring learning patterns we find in different neural network architectures. At its most abstract form, learning is a credit assignment problem. As a consequence of observed data, which parts of a model do we need to change and by how much? We will explore many of techniques that have been shown to be effective in practice.

Relaxed Backpropagation =Credit Assignment

Stochastic Gradient Descent

Natural Gradient Descent

Random Orthogonal Initialization

Transfer Learning

Curriculum Training


Domain Adaptation

Unsupervised Pretraining

Differential Training

Genetic Algorithm

Unsupervised Learning

Mutable Layer

Program Induction

Learning to Optimize note: Different from Meta-learning

Simulated Annealing


Continuous Learning

Feedback Network

Network Generation

Learning to Purpose

Planning to Learn


Learning to Communicate

Predictive Learning

Temporal Learning

Intrinsic Decomposition


Active Learning

Primal Dual

Transport Related

Structure Evolution

Self-Supervised Learning

Knowledge Gradient

Option Discovery

Infusion Learning

Ensemble Reinforcement Learning

Learning from Demonstration


Iterative Teaching

Reasoning by Analogy

References Optimization Methods for Large-Scale Machine Learning

we present a comprehensive theory of a straightforward, yet versatile SG algorithm, discuss its practical behavior, and highlight opportunities for designing algorithms with improved performance. This leads to a discussion about the next generation of optimization methods for large-scale machine learning, including an investigation of two main streams of research on techniques that diminish noise in the stochastic directions and methods that make use of second-order derivative approximations.

Recent Advances in Non-Convex Optimization and its Implications to Learning Anima Anandkumar ICML 2016 Tutorial Simple Statistical Gradient Following for Connectionist Reinforcement Learning