# Entropy

“The functional order maintained within living systems seems to defy the Second Law; nonequilibrium thermodynamics describes how such systems come to terms with entropy. ” - Ilya Prigogine

Aliases Relative Entropy, Perplexity

Intent

A generalization of similarity metric that measures the different between two distributions.

Motivation

How do we calculate the similarity between a model and its predicted observations?

Structure

<Diagram>

Discussion

Entropy is a very general concept that has wide implications in Deep Learning. Deep Learning and specifically Machine Learning revolves around an understanding of information flow. Entropy is essential a measure of information. It is a measure of our ignorance about observations. In fact, you can argue that every time we are in a situation where we need to make a decision given complete ignorance then a random choice is as good as any other choice!

Relative entropy, specifically, is the difference between two probability distributions. This difference can be described mathematically as the Kullback–Leibler divergence (KL divergence). The KL divergence, or an approximation of it, is commonly used as the fitness (or objective) function of a machine learning problem. The goal of then is to minimize this divergence. The KL divergence is actually the relative entropy between two systems. This actually has an analogy in living systems in that, unlike inanimate systems that tend to always evolve towards maximum entropy, a learning system tends to evolve towards minimizing entropy.

The goal of learning is to minimize entropy. This is done through training through gradient descent that evolves in the direction of minimal entropy. The equilibrium state of any system is one of a high entropy, however learning happens in the regime of non-equilibrium. High entropy is simply due to the observation that systems of high entropy exists with higher probability.

This high entropy manifests itself in the initialization conditions of a network. It turns out that the best guess for initializing a network is to choose a random one.

It is also curious that the sampling of the training set is best done in a random manner. In fact, we see randomness in many places in deep learning. We see it in hyper-parameter tuning where a favored method is to employ random search. We see it the construction of auto-encoders in that training data is augmented with noise. Simulated annealing which is a method to add additional randomness in gradient descent is employed in improving learning. Dropout is another method that employs randomness in training that is meant to improve generality. Randomness seems to be conspicuous in too many places. That is because entropy, or our measure of ignorance, is a state more likely than average. You could therefore make the claim that deep learning would be improbable without randomness. This is a very counter-intuitive claim.

There are many kinds of Entropy that can be found in the literature.

How does the objective of minimal entropy lead to self-organization?

Known Uses

Related Patterns

Related to Canonical Patterns: • Similarity shares the general concept of calculating similarity.
• Irreversibility may be interpreted as evaluating a fitness function. Entropy is commonly the network's fitness function.
• Merge leads to higher local entropy.
• Geometry leads to the Fisher Information Matrix which can be interpreted as a metric in model space. Entropy is related to KL divergence.
• Distributed Representation may be a reflection of a system of higher entropy and therefore models with distribute representations occur with higher probability.
• Dissipative Adaptation is globally driven by relative entropy minimization.
• Ensembles or mixture of experts on average leads to an increase in entropy.
• Mutual Information that is related to Shannon entropy.
• Hierarchical Abstraction should ideally lead to lower entropy at each higher layer of the network.
• Regularization, Risk Minimization and entropy should form the constraints that the drive network training (i.e. optimization)
• Anti-causality can be analyzed from the perspective of entropy.

Cited in other Patterns:

References

Show relationship with Shannon entropy and maximum likelihood. http://thirdorderscientist.org/homoclinic-orbit/2013/4/1/maximum-likelihood-and-entropy Maximum Likelihood and Entropy

the mean negative log-likelihood converges to the differential entropy under the true distribution plus the Kullback-Leibler divergence between the true distribution and the distribution we guess at.

https://quantivity.wordpress.com/2011/05/23/why-minimize-negative-log-likelihood/ Why Minimize Negative Log Likelihood?

http://www.stat.cmu.edu/~cshalizi/754/2006/notes/lecture-28.pdf Shannon Entropy and KL Divergence

https://arxiv.org/abs/1512.02742 Relative Entropy in Biological Systems

In this paper we review various information-theoretic characterizations of the approach to equilibrium in biological systems. The replicator equation, evolutionary game theory, Markov processes and chemical reaction networks all describe the dynamics of a population or probability distribution. Under suitable assumptions, the distribution will approach an equilibrium with the passage of time. Relative entropy - that is, the Kullback–Leibler divergence, or various generalizations of this - provides a quantitative measure of how far from equilibrium the system is. We explain various theorems that give conditions under which relative entropy is nonincreasing. In biochemical applications these results can be seen as versions of the Second Law of Thermodynamics, stating that free energy can never increase with the passage of time. In ecological applications, they make precise the notion that a population gains information from its environment as it approaches equilibrium.

http://arxiv.org/abs/1603.06653 Information Theoretic-Learning Auto-Encoder

A Characterization of Entropy in Terms of Information Loss

Instead of focusing on the entropy of a probability measure on a finite set, this characterization focuses on the `information loss', or change in entropy, associated with a measure-preserving function. Information loss is a special case of conditional entropy: namely, it is the entropy of a random variable conditioned on some function of that variable. We show that Shannon entropy gives the only concept of information loss that is functorial, convex-linear and continuous.

Category theorem properties of Entropy

http://www.ttic.edu/dl/dark14.pdf Dark Knowledge

Soft targets Softened outputs reveal the dark knowledge in the ensemble.

http://arxiv.org/abs/1106.1791 A Characterization of Entropy in Terms of Information Loss

Information loss is a special case of conditional entropy: namely, it is the entropy of a random variable conditioned on some function of that variable. We show that Shannon entropy gives the only concept of information loss that is functorial, convex-linear and continuous.

https://arxiv.org/abs/1606.00709 f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization

We show that the generative-adversarial approach is a special case of an existing more general variational divergence estimation approach.

Statistical divergences such as the well-known Kullback-Leibler divergence measure the difference between two given probability distributions. A large class of different divergences are the so called f-divergences

http://arxiv.org/abs/1606.06996v1 The Word Entropy of Natural Languages

The Kullback-Leibler divergence between two mixture models is a core primitive in many signal processing tasks. Since the Kullback-Leibler divergence of mixtures does not admit closed-form formula, it is in practice either estimated using costly Monte-Carlo stochastic integration or approximated using various techniques. We present a fast and generic method that builds algorithmically closed-form lower and upper bounds on the entropy, the cross-entropy and the Kullback-Leibler divergence of mixtures. We illustrate the versatile method by reporting on our experiments for approximating the Kullback-Leibler divergence between univariate exponential mixtures, Gaussian mixtures and Rayleigh mixtures.

http://arxiv.org/abs/1512.02742v3 Relative Entropy in Biological Systems

In this paper we review various information-theoretic characterizations of the approach to equilibrium in biological systems. The replicator equation, evolutionary game theory, Markov processes and chemical reaction networks all describe the dynamics of a population or probability distribution. Under suitable assumptions, the distribution will approach an equilibrium with the passage of time. Relative entropy - that is, the Kullback–Leibler divergence, or various generalizations of this - provides a quantitative measure of how far from equilibrium the system is. We explain various theorems that give conditions under which relative entropy is nonincreasing. In biochemical applications these results can be seen as versions of the Second Law of Thermodynamics, stating that free energy can never increase with the passage of time. In ecological applications, they make precise the notion that a population gains information from its environment as it approaches equilibrium.

http://arxiv.org/pdf/1604.04451v2.pdf Delta divergence: A novel decision cognizant measure of classifier incongruence

http://arxiv.org/pdf/1001.0785.pdf On the Origin of Gravity and the Laws of Newton

http://arxiv.org/pdf/1310.4139.pdf Entropic Forces and Brownian Motion

http://arxiv.org/pdf/1102.2468v1.pdf Algorithmic Randomness as Foundation of Inductive Reasoning and Artificial Intelligence

https://arxiv.org/abs/1104.1110 Randomness and Multi-level Interactions in Biology

https://ganguli-gang.stanford.edu/pdf/14.DeepNoiseAnalysis.pdf Analyzing noise in autoencoders and deep networks

We show that a wide variety of previous methods, including denoising, contractive, and sparse autoencoders, as well as dropout can be interpreted using this framework. This noise injection framework reaps practical benefits by providing a unified strategy to develop new internal representations by designing the nature of the injected noise. We show that noisy autoencoders outperform denoising autoencoders at the very task of denoising, and are competitive with other single-layer techniques on MNIST, and CIFAR- 10. We also show that types of noise other than dropout improve performance in a deep network through sparsifying, decorrelating, and spreading information across representations.

http://arxiv.org/abs/1609.04846 A Tutorial about Random Neural Networks in Supervised Learning

We give a general description of these models using almost indistinctly the terminology of Queuing Theory and the neural one.

http://openreview.net/pdf?id=SkYbF1slg An Information-Theoretic Framework for Fast and Robust Unsupervised Learning via Neural Population Infomax

We present a novel information-theoretic framework for fast and robust unsupervised Learning via information maximization for neural population coding.

http://openreview.net/pdf?id=B1YfAfcgl ENTROPY-SGD: BIASING GRADIENT DESCENT INTO WIDE VALLEYS

Local extrema with low generalization error have a large proportion of almost-zero eigenvalues in the Hessian with very few positive or negative eigenvalues. We leverage upon this observation to construct a local entropy based objective that favors well-generalizable solutions lying in the flat regions of the energy landscape, while avoiding poorly-generalizable solutions located in the sharp valleys. Our algorithm resembles two nested loops of SGD, where we use Langevin dynamics to compute the gradient of local entropy at each update of the weights.

We introduced an algorithm named Entropy-SGD for optimization of deep networks. This was motivated from the observation that the energy landscape near a local minimum discovered by SGD is almost flat for a wide variety of deep networks irrespective of their architecture, input data or training methods. We connected this observation to the concept of local entropy which we used to bias the optimization towards flat regions that have low generalization error. Our experiments showed that this algorithm is applicable to large deep networks used in practice.

http://openreview.net/pdf?id=HkCjNI5ex Regularizing Neural Networks by Penalizing Confident Output Distributions

We show that penalizing low entropy output distributions, which has been shown to improve exploration in reinforcement learning, acts as a strong regularizer in supervised learning.

http://www.mdpi.com/1099-4300/18/7/251 Maximum Entropy Learning with Deep Belief Networks

Conventionally, the maximum likelihood (ML) criterion is applied to train a deep belief network (DBN). We present a maximum entropy (ME) learning algorithm for DBNs, designed specifically to handle limited training data. Maximizing only the entropy of parameters in the DBN allows more effective generalization capability, less bias towards data distributions, and robustness to over-fitting compared to ML learning. Results of text classification and object recognition tasks demonstrate ME-trained DBN outperforms ML-trained DBN when training data is limited.

http://www.bourbaphy.fr/jarzynskitemps.pdf Equalities and Inequalities : Irreversibility and the Second Law of Thermodynamics at the Nanoscale

https://arxiv.org/abs/0911.3984 The Physical Origins of Entropy Production, Free Energy Dissipation and their Mathematical Representations

http://csc.ucdavis.edu/~cmg/papers/if.pdf Information Flows? A Critique of Transfer Entropies

The proliferation of networks as a now-common theoretical model for large-scale systems, in concert with the use of transfer-like entropies, has shoehorned dyadic relationships into our structural interpretation of the organization and behavior of complex systems. This interpretation thus fails to include the effects of polyadic dependencies. The net result is that much of the sophisticated organization of complex systems may go undetected.

https://arxiv.org/pdf/1706.10295v1.pdf Noisy Networks for Exploration

We introduce NoisyNet, a deep reinforcement learning agent with parametric noise added to its weights, and show that the induced stochasticity of the agent’s policy can be used to aid efficient exploration. The parameters of the noise are learned with gradient descent along with the remaining network weights. NoisyNet is straightforward to implement and adds little computational overhead. We find that replacing the conventional exploration heuristics for A3C, DQN and dueling agents (entropy reward and -greedy respectively) with NoisyNet yields substantially higher scores for a wide range of Atari games, in some cases advancing the agent from sub to super-human performance.

https://arxiv.org/pdf/1708.03030.pdf Above and Beyond the Landauer Bound: Thermodynamics of Modularity

https://arxiv.org/abs/1703.04379 Langevin Dynamics with Continuous Tempering for Training Deep Neural Networks

n this paper, a novel approach is proposed which divides the training process into two consecutive phases to obtain better generalization performance: Bayesian sampling and stochastic optimization. The first phase is to explore the energy landscape and to capture the “fat” modes; and the second one is to fine-tune the parameter learned from the first phase. In the Bayesian learning phase, we apply continuous tempering and stochastic approximation into the Langevin dynamics to create an efficient and effective sampler, in which the temperature is adjusted automatically according to the designed “temperature dynamics”.