Empirical Risk Minimization


Generalization is a measure that is based on predictive accuracy against non-training data.


How can we condition a network to accurately perform predictions on information it has previously never encountered?


This section provides alternative descriptions of the pattern in the form of an illustration or alternative formal expression. By looking at the sketch a reader may quickly understand the essence of the pattern.


Generalization is an interesting term in deep learning. We all talk about it, however we don't have a precise definition.

We can define it as the behavior of our system in response to validation data. That is against data that we have not included as part of the training set. We be a bit more ambitious and define it as behavior when the system is deployed to analyze real world data. We essentially would like to see our trained system perform accurately in the context of data it has never seen. This is one definition.

Another generalization revolves around the idea of minimizing risk. When we train our system, there is an uncertainty in the context in which it will be deployed. So we train our models with mechanisms to anticipate unpredictable situations. The hope is that the system is robust to contexts that have not been previously predicted. This is kind of a game theoretic definition. We can envision an environment where information will always remain imperfect and generalization effectively means executing a particular strategy within the environment. This may be the most abstract definition of generalization that we have.

A third definition is based on the idea of Occam's Razor. That is, the most simplest of explanations is like the best explanation. Here we make certain assumptions of the form of the data and we drive our regularization to constrain the solution toward our assumptions. So for example in the field of compressive sensing, we assume that a sparse basis exists. From there we can drive an optimization problem that searches solutions that have a sparse basis.

A fourth definition is based on the systems ability to recreate or reconstruct the features. This is the approach taken by generative models. If a neural network is able to accurately generate realistic images, then it able to capture the concept of images in its entirety.

A fifth definition involves the notion of ignoring invariant features or nuisance variables. That is, a system is able to generalize well if it able to ignore invariant features for its tasks. Remove away as much features as possible until you can't remove anymore. This is somewhat similar to the third definition however it tackles the problem from another perspective.

In looking through these definitions, it becomes apparent that generalization concerns itself with ignorance of a future observation. This differs from entropy which is a measure of our ignorance of a current observation.

A useful schema in discussing ignorance can be stated as follows. Knowable knowns, meaning models that will converge on training data. Knowable unknowns, are trainable models that are able to make accurate predictions on non-training data. This is the domain of generalization. This is what drives our preference of models and regularization choices. Unknowable knowns, reflects an inability to learn a known system, this covers the area of computational irreducible computations and involves an appreciation of anti-causality. Unknowable unknowns which I will leave to the reader to figure out.

You can divide the potential capabilities of deep learning into 4 classes. You can divide the potential capabilities of deep learning into 4 classes. (1) Knowable Knowns - Given a large enough set of knowns (i.e. training data) we can get good convergence of our prediction errors. (2) Knowable Unknowns - With good generalization, we can know about test data that we a machine has never seen. (3) Unknowable Knowns - However, there are certain classes of system that deep learning can never learn. These are in the class of computational irreducible systems. However a Deep Learning system may detect the direction of causality and be able to know that is able to learn from the system. What it will not be able to know (unknowable) is if a system is in the class of computational irreducible systems. (4) Unknowable Unknowns - Finally, there is a class of total ignorance. This is really just a meta-physical statement. The class of unknowables that cannot be known is unknowable.

Even though the above schema of ignorance looks complicated, it is but the tip of the iceberg of understanding ignorance. Consider more complexities such as misinformation, detecting model bias, ambiguity, disagreement, or accommodating for change.

This leads us to a conclusion about achieving good generalization. Given that it can be measured only through a future observation, an appropriate learning process will therefore require iterative feedback. This gets us into the realm of reinforcement learning which we discuss only briefly in this book and likely the main topic of a future sequel.

Known Uses

Here we review several projects or papers that have used this pattern.

Related Patterns In this section we describe in a diagram how this pattern is conceptually related to other patterns. The relationships may be as precise or may be fuzzy, so we provide further explanation into the nature of the relationship. We also describe other patterns may not be conceptually related but work well in combination with this pattern.

Relationship to other canonical patterns:

Relationship to other Patterns

Cited in these patterns:

Further Reading

We provide here some additional external material that will help in exploring this pattern in more detail.

References Generalization Bounds and Stability Relative Entropy in Biological Systems

In this paper we review various information-theoretic characterizations of the approach to equilibrium in biological systems. The replicator equation, evolutionary game theory, Markov processes and chemical reaction networks all describe the dynamics of a population or probability distribution. Under suitable assumptions, the distribution will approach an equilibrium with the passage of time. Relative entropy - that is, the Kullback–Leibler divergence, or various generalizations of this - provides a quantitative measure of how far from equilibrium the system is. We explain various theorems that give conditions under which relative entropy is nonincreasing. In biochemical applications these results can be seen as versions of the Second Law of Thermodynamics, stating that free energy can never increase with the passage of time. In ecological applications, they make precise the notion that a population gains information from its environment as it approaches equilibrium. Algorithmic Game Theory Semantics, Representations and Grammars for Deep Learning

The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees). Deep Reinforcement Learning from Self-Play in Imperfect-Information Games

In this paper we introduce the first scalable end-to-end approach to learning approximate Nash equilibria without prior domain knowledge. Causal Entropic Forces Taxonomies of chance, risk and unpredictability. AdaNet: Adaptive Structural Learning of Artificial Neural Networks

Our method optimizes for generalization performance, and it explicitly and automatically addresses the trade-off between model architecture and empirical risk minimization, ideas that have been under-explored in deep learning. GENERALIZABLE FEATURES FROM UNSUPERVISED LEARNING

In this work, we explore the potential of unsupervised learning to find features that promote better generalization to settings outside the supervised training distribution. Our task is predicting the stability of towers of square blocks. We demonstrate that an unsupervised model, trained to predict future frames of a video sequence of stable and unstable block configurations, can yield features that support extrapolating stability prediction to blocks configurations outside the training set distribution.

In this paper, we showed that data generated from an unsupervised model could help a supervised learner to generalize to unseen scenarios. We argue that this ability of transfer learning and generalization by observing the world could be one of the ingredients to construct a model of the world that could have applications in many tasks, such as model-based RL. UNDERSTANDING DEEP LEARNING REQUIRES RETHINKING GENERALIZATION

What is it then that distinguishes neural networks that generalize well from those that don’t? A satisfying answer to this question would not only help to make neural networks more interpretable, but it might also lead to more principled and reliable model architecture design. Sharp Minima Can Generalize For Deep Nets

This paper argues that most notions of flatness are problematic for deep models and can not be directly applied to explain generalization. Specifically, when focusing on deep networks with rectifier units, we can exploit the particular geometry of parameter space induced by the inherent symmetries that these architectures exhibit to build equivalent models corresponding to arbitrarily sharper minima. Furthermore, if we allow to reparametrize a function, the geometry of its parameters can change drastically without affecting its generalization properties. Theory II: Landscape of the Empirical Risk in Deep Learning

In this work, we characterize with a mix of theory and experiments, the landscape of the empirical risk of overparametrized DCNNs. We first prove the existence of a large number of degenerate global minimizers with zero empirical error (modulo inconsistent equations). The zero-minimizers – in the case of classification – have a non-zero margin. The same minimizers are degenerate and thus very likely to be found by SGD that will furthermore select with higher probability the zero-minimizer with larger margin, as discussed in Theory III (to be released). On Generalization and Regularization in Deep Learning Adaptive Learning with Robust Generalization Guarantees Geometry of Optimization and Implicit Regularization in Deep Learning

In this work we show proof-of-concept success, and we expect Path-SGD to be beneficial also in large-scale training for very deep convolutional networks. Combining Path-SGD with AdaGrad, with momentum or with other optimization heuristics might further enhance results. On the Emergence of Invariance and Disentangling in Deep Representations

Using classical notions of statistical decision and information theory, we show that invariance in a deep neural network is equivalent to minimality of the representation it computes, and can be achieved by stacking layers and injecting noise in the computation, under realistic and empirically validated assumptions. We use an Information Decomposition of the empirical loss to show that overfitting can be reduced by limiting the information content stored in the weights. We then present a sharp inequality that relates the information content in the weights – which are a representation of the training set and inferred by generic optimization agnostic of invariance and disentanglement – and the minimality and total correlation of the activation functions, which are a representation of the test datum. This allows us to tackle recent puzzles concerning the generalization properties of deep networks and their relation to the geometry of the optimization residual. Exploring Generalization in Deep Learning

Learning with deep neural networks displays good generalization behavior in practice, a phenomenon that remains largely unexplained. In this paper we discussed different candidate complexity measures that might explain generalization in neural networks. We outline a concrete methodology for investigating such measures, and report on experiments studying how well the measures explain different phenomena. While there is no clear choice yet, some combination of expected sharpness and norms do seem to capture much of the generalization behavior of neural networks. A major issue still left unresolved is how the choice of optimization algorithm biases such complexity to be low, and what is the precise relationship between optimization and implicit regularization Implicit Regularization in Deep Learning Generalization in Deep Learning On the Discrimination-Generalization Tradeoff in GANs

Generative adversarial training can be generally understood as minimizing certain moment matching loss defined by a set of discriminator functions, typically neural networks. The discriminator set should be large enough to be able to uniquely identify the true distribution (discriminative), and also be small enough to go beyond memorizing samples (generalizable). In this paper, we show that a discriminator set is guaranteed to be discriminative whenever its linear span is dense in the set of bounded continuous functions. This is a very mild condition satisfied even by neural networks with a single neuron. Further, we develop generalization bounds between the learned distribution and true distribution under different evaluation metrics. When evaluated with neural or Wasserstein distances, our bounds show that generalization is guaranteed as long as the discriminator set is small enough, regardless of the size of the generator or hypothesis set. When evaluated with KL divergence, our bound provides an explanation on the counter-intuitive behaviors of testing likelihood in GAN training. Our analysis sheds lights on understanding the practical performance of GANs. On the importance of single directions for generalization

We find that deep networks which generalize poorly are more reliant on single directions than those that generalize well, and evaluate the impact of dropout and batch normalization, as well as class selectivity on single direction reliance. mixup: Beyond Empirical Risk Minimization Deep learning generalizes because the parameter-function map is biased towards simple functions

We show explicitly for supervised learning of Boolean functions that the intrinsic simplicity bias of deep neural networks means that they generalize significantly better than an unbiased learning algorithm does. A Capacity Scaling Law for Artificial Neural Networks

First, we derive the calculation of what we call the lossless memory (LM) dimension. The LM dimension is a generalization of the Vapnik-Chervonenkis (VC) dimension that avoids structured data and therefore provides an upper bound for perfectly fitting any training data. Second, we derive what we call the MacKay (MK) dimension. This limit indicates necessary forgetting, that is, the lower limit for most generalization uses of the network. Generalisation in humans and deep neural networks

We compare the robustness of humans and current convolutional deep neural networks (DNNs) on object recognition under twelve different types of image degradations. First, using three well known DNNs (ResNet-152, VGG-19, GoogLeNet) we find the human visual system to be more robust to nearly all of the tested image manipulations, and we observe progressively diverging classification error-patterns between humans and DNNs when the signal gets weaker. Secondly, we show that DNNs trained directly on distorted images consistently surpass human performance on the exact distortion types they were trained on, yet they display extremely poor generalisation abilities when tested on other distortion types. For example, training on salt-and-pepper noise does not imply robustness on uniform white noise and vice versa. Thus, changes in the noise distribution between training and testing constitutes a crucial challenge to deep learning vision systems that can be systematically addressed in a lifelong machine learning approach. Our new dataset consisting of 83K carefully measured human psychophysical trials provide a useful reference for lifelong robustness against image degradations set by the human visual system. Learning Weighted Representations for Generalization Across Designs Towards Generalization and Simplicity in Continuous Control