# Regularization

Intent

Improve generalization of the fitness criteria by introducing an additional constraint term.

Motivation

A network with a high number of model parameters will tend to over fit the training data and thus reduce generalization.

Structure

<Diagram>

Discussion

A general issue with numerical optimization is that they can over fit the Model and as a consequence perform poor generalization. Finding the right balance for your Model so that it fits your data optimally is a difficult problem. In practice, we always begin with networks that have Models that are large compared to our data then we try our best to prevent the Model from overfitting.   The most basic way to prevent over fitting is by observing the Model's performance under the validation set, and stopping to train as soon as we stop improving. This is referred to as early termination, and remains to be a basic method that prevents your Model from over-fitting on the training set. The more advanced way is called Regularization. Regularizing means applying artificial constraints on your learning machine that as a consequence reduces the number of free parameters of a the Model.

Regularization is actually a very ancient method known as “Lagrange Multipliers” invented by Joseph-Louis Lagrange in the late 18th century. The method including additional constraint expressions to the objective function (see: Fitness Function). The constraint expression are termed Regularization in machine language literature. Two of the more well known Regularizations are the L1 and L2 forms. The L1 regularization leads to more sparse solutions. The L2 regularization leads to more smooth solutions. Both regularizations penalizes large weights.

In its more generalized sense there are two kinds of Regularization. There is the Regularization by Training that is the conventional use of the term and is what we describe above. There is also the Regularization by Construction which is a consequence of the Model choices we select as we construct the elements our network. The reason why there is a distinction, when mathematically they do appear equivalently as constraint terms, is that Regularization conventionally is not present after training, that is in the inference path. Regularization by Construction is always present, both in the training and the inference stages.

What we will find is that there are many more kinds of Regularizations of both kinds that go beyond the common L1 and L2 regularizations found in introductory machine learning literature. One example is variational regularization which constrains the representation to be as close to a specified prior. This is a stronger constraints than point-wise L1 and L2 regularizations.

The influence of Regularization is in achieving generalization. Generalization is a term used often in machine learning yet it is actually not well defined. Conventionally it is a measure of how well a machines predictions align with either validation data or data that is not included in the training set. In the most abstract sense though, we think of generalization as that ability to predict because the learned model has capture the essence of the problem. Furthermore, in the sciences there is a general a bias that more simple models are more generalized models than more complex models. So we could for example measure Generalization as to how well we can perform Transfer Learning from a larger Model to a smaller Model. The importance however of being able to have a measure of generalization is that we can leveraged this as part of our Regularization schemes. The more advanced Regularization schemes do in fact use more advanced information theoretic measure to compose better Regularization schemes.

Known Uses

http://arxiv.org/pdf/1511.06328v3.pdf MANIFOLD REGULARIZED DISCRIMINATIVE NEURAL NETWORKS We have proposed two regularizers that make use of the data manifold to guide the learning of a discriminative neural network. By encouraging the flatness of either the loss function or the prediction scores along the data manifold, we are able to significantly improve a DNN’s generalization ability. Moreover, our label independent manifold regularization allows one to incorporate unlabeled data directly with the supervised learning task and effectively performs semi-supervised learning.

http://arxiv.org/abs/1511.06068 Reducing Overfitting in Deep Networks by Decorrelating Representations

We propose a new regularizer called DeCov which leads to significantly reduced overfitting (as indicated by the difference between train and val performance), and better generalization. Our regularizer encourages diverse or non-redundant representations in Deep Neural Networks by minimizing the cross-covariance of hidden activations. This simple intuition has been explored in a number of past works but surprisingly has never been applied as a regularizer in supervised learning.

The regularizer encourages diverse or non-redundant representations in Deep Neural Networks by minimizing the cross-covariance of hidden activations.

Motivation also comes from the classical literature on bagging and ensemble averaging which suggests that decorrelated ensembles perform better than correlated ones.

We develop a family of gradient regularization methods that effectively penalize the gradient of loss function w.r.t. inputs.

http://arxiv.org/pdf/1604.06985v3.pdf Deep Learning with Eigenvalue Decay Regularizer

http://arxiv.org/abs/1609.06693v1 SoftTarget Regularization: An Effective Technique to Reduce Over-Fitting in Neural Networks

In this paper we introduce a new form of regularization that guides the learning problem in a way that reduces over-fitting without sacrificing the capacity of the model. The mistakes that models make in early stages of training carry information about the learning problem. By adjusting the labels of the current epoch of training through a weighted average of the real labels, and an exponential average of the past soft-targets we achieved a regularization scheme as powerful as Dropout without necessarily reducing the capacity of the model, and simplified the complexity of the learning problem.

https://arxiv.org/abs/1606.01305 Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations

We propose zoneout, a novel method for regularizing RNNs. At each timestep, zoneout stochastically forces some hidden units to maintain their previous values. Like dropout, zoneout uses random noise to train a pseudo-ensemble, improving generalization.

https://arxiv.org/abs/1608.03665v4 Learning Structured Sparsity in Deep Neural Networks

In this work, we propose a Structured Sparsity Learning (SSL) method to regularize the structures (i.e., filters, channels, filter shapes, and layer depth) of DNNs. SSL can: (1) learn a compact structure from a bigger DNN to reduce computation cost; (2) obtain a hardware-friendly structured sparsity of DNN to efficiently accelerate the DNNs evaluation.

Related Patterns

<Diagram>

Relationship to Canonical Patterns

• Entropy participates in conjunction with regularization terms to define the optimization problem's objective function.
• Random Matrix that are constrained in certain ways can lead to structure despite randomness.
• Structured Factorization * What is the connection between regularization and factorization
• Hyperuniformity is a reflection of hidden structure random like distributions.
• Dissipative Adaptation has regularization expressions that place constraints on the final state.
• Mutual Information influence regularization has been shown to be effective in generative models.
• Disentangled Basis can be encouraged using regularization for example L1 regularization.
• Compressed Sensing algorithms use sparsity inducing regularization for signal recovery.
• Hierarchical Abstraction can be finely tuned by defining regularization on a per layer basis.
• Risk Minimization or generalization achieved through regularization.

References

Graph-Guided Banding of the Covariance Matrix

We develop convex regularizers occupying the broad middle ground between the former approach of “patternless sparsity” and the latter reliance on having a known ordering. Our framework defines bandedness with respect to a known graph on the measured variables. Such a graph is available in diverse situations, and we provide a theoretical, computational, and applied treatment of two new estimators.

http://arxiv.org/pdf/1606.07326v2.pdf DropNeuron: Simplifying the Structure of Deep Neural Networks

http://jmlr.org/proceedings/papers/v37/blundell15.pdf Weight Uncertainty in Neural Networks

It regularises the weights by minimising a compression cost, known as the variational free energy or the expected lower bound on the marginal likelihood. We show that this principled kind of regularisation yields comparable performance to dropout on MNIST classification.

https://arxiv.org/abs/1610.07675v4 Surprisal-Driven Zoneout

In this method, states zoneout (maintain their previous value rather than updating), when the suprisal (discrepancy between the last state’s prediction and target) is small. Thus regularization is adaptive and input-driven on a per-neuron basis.

http://openreview.net/pdf?id=Sy8gdB9xx UNDERSTANDING DEEP LEARNING REQUIRES RETHINKING GENERALIZATION

The role of explicit regularization. If the model architecture itself isn’t a sufficient regularizer, it remains to see how much explicit regularization helps. We show that explicit forms of regularization, such as weight decay, dropout, and data augmentation, do not adequately explain the generalization error of neural networks. Put differently: Explicit regularization may improve generalization performance, but is neither necessary nor by itself sufficient for controlling generalization error. In contrast with classicial convex empirical risk minimization, where explicit regularization is necessary to rule out trivial solutions, we found that regularization plays a rather different role in deep learning. It appears to be more of a tuning parameter that often helps improve the final test error of a model, but the absence of all regularization does not necessarily imply poor generalization error. As reported by Krizhevsky et al. (2012), `2-regularization (weight decay) sometimes even helps optimization, illustrating its poorly understood nature in deep learning.

REGULARIZATION Various regularization techniques have been applied to neural networks for the purpose of improving generalization and reduce overfitting. They can be roughly divided into two categories, depending on whether they regularize the weights or the activations.

Regularization on Weights: The most common regularizer on weights is weight decay which just amounts to using the L2 norm squared of the weight vector. An L1 regularizer on the weights can also be adopted to push the learned weights to become sparse. Scardapane et al. (2016) investigated mixed norms in order to promote group sparsity.

Regularization on Activations: Several regularizers have been proposed that act directly on the neural activations. Glorot et al. (2011) add a sparse regularizer on the activations after ReLU to encourage sparse representations. Dropout developed by Srivastava et al. (2014) applies random masks to the activations in order to discourage them to co-adapt. DeCov proposed by Cogswell et al. (2015) tries to minimize the off-diagonal terms of the sample covariance matrix of activations, thus encouraging the activations to be as decorrelated as possible. Liao et al. (2016b) utilize a clustering-based regularizer to encourage the representations to be compact.

https://papers.nips.cc/paper/2647-non-local-manifold-tangent-learning.pdf Non Local Manifold Tangent Learning

It quantizes the features, and encodes them via an adaptive arithmetic coding scheme applied on their binary expansions. An adaptive codelength regularization is introduced to penalize the entropy of the features, which the coding scheme exploits to achieve better compression.

https://arxiv.org/abs/1704.03976 Virtual Adversarial Training: a Regularization Method for Supervised and Semi-supervised Learning

Virtual adversarial loss is defined as the robustness of the model's posterior distribution against local perturbation around each input data point. Our method is similar to adversarial training, but differs from adversarial training in that it determines the adversarial direction based only on the output distribution and that it is applicable to a semi-supervised setting.

https://arxiv.org/pdf/1708.06742.pdf Twin Networks: Using the Future as a Regularizer

https://arxiv.org/pdf/1709.06680.pdf Deep Lattice Networks and Partial Monotonic Functions

https://openreview.net/pdf?id=SkHkeixAW Regularization for Deep Learning: A Taxonomy

https://openreview.net/pdf?id=SJSVuReCZ SHADE: SHANNON DECAY INFORMATION-BASED REGULARIZATION FOR DEEP LEARNING

https://arxiv.org/pdf/1805.06440.pdf Regularization Learning Networks

Despite their impressive performance, Deep Neural Networks (DNNs) typically underperform Gradient Boosting Trees (GBTs) on many tabular-dataset learning tasks. We propose that applying a different regularization coefficient to each weight might boost the performance of DNNs by allowing them to make more use of the more relevant inputs. However, this will lead to an intractable number of hyperparameters. Here, we introduce Regularization Learning Networks (RLNs), which overcome this challenge by introducing an efficient hyperparameter tuning scheme that minimizes a new Counterfactual Loss. Our results show that RLNs significantly improve DNNs on tabular datasets, and achieve comparable results to GBTs, with the best performance achieved with an ensemble that combines GBTs and RLNs. RLNs produce extremely sparse networks, eliminating up to 99.8% of the network edges and 82% of the input features, thus providing more interpretable models and reveal the importance that the network assigns to different inputs. RLNs could efficiently learn a single network in datasets that comprise both tabular and unstructured data, such as in the setting of medical imaging accompanied by electronic health records

https://arxiv.org/abs/1705.07485 Shake-Shake regularization

http://www.fast.ai/2018/07/02/adam-weight-decay/ AdamW and Super-convergence is now the fastest way to train neural nets

https://arxiv.org/abs/1710.10686 Regularization for Deep Learning: A Taxonomy