This is an old revision of the document!

# Decision Operator

Aliases Selection, Information Loss Function, Compressive Function

Intent

A generalization of an operator that organizes information by the process of discarding information.

Motivation

Sketch

$Irreversible: R \times R \rightarrow R$

<Diagram>

Discussion

The Irreversible Operator is conventionally found in neural networks as the Activation Function. The Activation Function was conjectured to be a required component in the original perceptron model. It is essentially a selection mechanism where it takes as input a measure (usually a scalar) and outputs a decision (i.e. one of two numbers). The perceptron model originally proposed a hard binary function. This was later relaxed to functions that were continuous such as the sigmoidal function and its close approximations. Non-linear activation functions became dogma, until the experimental discovery that ReLU based activation functions performed surprisingly well. Non-linearity doesn't appear to be relevant for activation functions, what is relevant is that some kind of selection function (alternatively, decision function) is performed such that a similarity measure below a certain threshold is ignored. This selection function is an irreversible process and that irreversibility is an essential requirement to achieve any kind of hierarchical learning.

In the absence of irreversibility, you have machine learning systems that are template matching machines that are unable to construct high level abstractions required for generalized inference. However, it is important to note that the kind of irreversibility found in Dissipative Adaptation does not require an irreversible operator. Irreversibility emerges out of the dynamic behavior of the system rather than by the construction of the system.

How is this related to invariance discovery and nuisance variable reduction? Invariance, we are trying to eliminate from the features those facts that we don't care about. The conjecture here is that this operator applies a kind of symmetry breaking. Furthermore, in the case of the ReLU, which has been shown to be more effective with deep networks, that we would like to preserve as much information as we can that is above the threshold.

[ decision is necessary to reduce the size of attractor basin ]

Are spatial transforms a generalization of this?

Known Uses

Related Patterns

Relationship to other Canonical Patterns:

• Similarity and Merge serve as the essential building block for any deep learning system.
• Clustering defines the cutoff distance as to what belongs in or out of a set (i.e. membership).
• Entropy, specifically local entropy is lowered due to decision operators. Decision boundaries is a mechanism to remove nuisance features or invariant features.
• Geometry provides an intuition of how the decision operator folds information spaces into itself.
• Dissipative Adaptation shows why the effect of a decision boundary is irreversible and can lead to self-replication of more complex information matching structures.
• Hierarchical Abstraction is enabled by a decision operator. Only subspaces of a lower layer may influence a higher layer. Backpropagation to a lower layer is cutoff when the decision operator is below a threshold. The process of evolving information abstraction is a irreversible process.

Pattern is cited in:

References

http://arxiv.org/pdf/cond-mat/9901352v4.pdf Entropy production fluctuation theorem and the nonequilibrium work relation for free energy differences

Gavin Crooks mathematically described microscopic irreversibility. Crooks showed that a small open system driven by an external source of energy could change in an irreversible way, as long as it dissipates its energy as it changes.

http://www.nobelprize.org/nobel_prizes/chemistry/laureates/1977/prigogine-lecture.pdf Time, Structure and Fluctuations

http://arxiv.org/pdf/1606.05990v1.pdf A New Training Method for Feedforward Neural Networks Based on Geometric Contraction Property of Activation Functions

We have presented a new method for training a feedforward neural network. The core of the method relies on the contraction property of some activation functions (e.g. sigmoid function) and the geometry underlying the training of FNN. As a result, we have obtained a new cost function that needs to be minimized during the training process. The main advantage of the new functional resides in the fact that we have less non-linearities introduced by activation functions

http://arxiv.org/pdf/1106.1791v3.pdf A Characterization of Entropy in Terms of Information Loss

There are numerous characterizations of Shannon entropy and Tsallis entropy as measures of information obeying certain properties. Using work by Faddeev and Furuichi, we derive a very simple characterization. Instead of focusing on the entropy of a probability measure on a finite set, this characterization focuses on the `information loss', or change in entropy, associated with a measure-preserving function. Information loss is a special case of conditional entropy: namely, it is the entropy of a random variable conditioned on some function of that variable. We show that Shannon entropy gives the only concept of information loss that is functorial, convex-linear and continuous. This characterization naturally generalizes to Tsallis entropy as well.

http://arxiv.org/pdf/1604.01952v1.pdf Deep Online Convex Optimization with Gated Games

http://s3.amazonaws.com/burjorjee/www/hypomixability_elimination_foga2015.pdf Efficient Hypomixability Elimination in Recombinative Evolutionary Systems

http://arxiv.org/abs/1607.05966 Onsager-corrected deep learning for sparse linear inverse problems

We consider the application of deep learning to the sparse linear inverse problem encountered in compressive sensing, where one seeks to recover a sparse signal from a small number of noisy linear measurements. In this paper, we propose a novel neural-network architecture that decouples prediction errors across layers in the same way that the approximate message passing (AMP) algorithm decouples them across iterations: through Onsager correction.

https://arxiv.org/pdf/1510.00831v1.pdf Partial Information Decomposition as a Unified Approach to the Specification of Neural Goal Functions

We argue that neural information processing crucially depends on the combination of multiple inputs to create the output of a processor. To account for this, we use a very recent extension of Shannon Information theory, called partial information decomposition (PID). PID allows to quantify the information that several inputs provide individually (unique information), redundantly (shared information) or only jointly (synergistic information) about the output

https://arxiv.org/pdf/1612.03902v5.pdf Design of General-Purpose Artificial Pattern Recognition Machines

The main purpose of the paper is to rigorously define three classes of learning machines which are fundamental building blocks for Bayes’ minimax classification systems, where any given learning machine minimizes Bayes’ minimax risk. A decision rule that minimizes Bayes’ minimax risk is called an equalizer rule or a minimax test. Equalizer rules divide two-class feature spaces into decision regions that have equivalent probabilities of classification error, i.e., Bayes’ error. In some cases, the threshold for a minimax test is identical to the threshold for a minimum probability of error test (VanTrees, 1968; Poor, 1994; Srinath et al., 1996). Bayes’ minimax learning machines are general-purpose artificial pattern recognition machines that are capable of acting accurately on a wide variety of tasks