Name Activation Function aka Threshold, Partition, Space Folding, Squash Function

Intent

Motivation

How do we divide representation space?

Structure

<Diagram>

Discussion

An activation function serves as a threshold, alternatively called classification or a partition. Bengio et al. refers to this as “Space Folding”. It essentially divides the original space into typically two partitions. Activation functions are usually introduced as requiring to be a non-linear function. This requirement may be too restrictive, rather recently piecewise linear functions (i.e. ReLU, MaxOut) have been shown to work just as well in practice.

The purpose of an activation function in a Deep Learning context (i.e. multiple layers) is to ensure that the representation in the input space is mapped to a different space in the output. In all cases a similarity function between the input and the weights are performed by a neural network. This can be an inner product, a correlation function or a convolution function. In all cases it is a measure of similarity between the learned weights and the input. This is then followed by a activation function that performs a threshold on the calculated similarity measure. In its most general sense, a neural network layer performs a projection that is followed by a selection.

Both projection and selection are necessary for the dynamics learning. Without selection and only projection, a network will thus remain in the same space and be unable to create higher levels of abstraction between the layers. The projection operation may in fact be non-linear, but without the threshold function, there will be no mechanism to consolidate information. The selection operation is enforces information irreversibility, an necessary criteria for learning.

There have been many kinds of activation functions that have been proposed over the years. Duch and Jankowski (1999) had documented over 640 different activation function proposals. Best practice confines the use to only a limited kind of activation functions. The hyperbolic tangent and the ReLU activation function have seen considerable more mileage as compared to many others.

Previous ANN research had made the assumption that the activation function had to be non-linear, however the latest results using the piecewise linear ReLU has shown that this requirement may need to be relaxed. The ReLU is able to perform better with much deeper networks and it is conjectured that the information preserving linear region may be an important capability. Historically, the introduction of the ReLU has made it less necessary to perform unsupervised pre-training.

It is also surprising that neural networks are actually more expressive than boolean circuits. A shallow network can perform classification that require many more layers of boolean circuits. One reason DL networks are more expressive than boolean circuits is due to the presence of the activation function which cannot be performed by a simple boolean circuit.

Known Uses

In the GradNet proposal, the activation functions are not fixed and are able to evolve over time, from computational cheaper activation functions early in the learning process into more complicated smoother functions.

Related Patterns

<Diagram>

  • Irreversible Operator - Activation function have the property of being irreversible.
  • Fitness Function - At a micro-level the Activation function acts like a fitness function
  • Hierarchical Models - Activation functions between layers ensure that the entire network is not equivalent to a single layer.

**References*

http://www.bcp.psych.ualberta.ca/~mike/Pearl_Street/Dictionary/contents/A/activation.html Duch, W., & Jankowski, N. (1999). Survey of neural transfer functions. Neural Computing Surveys, 2, 163-212.

http://ai.stanford.edu/~amaas/papers/relu_hybrid_icml2013_final.pdf Rectifier Nonlinearities Improve Neural Network Acoustic Models

https://www.academia.edu/7826776/Mathematical_Intuition_for_Performance_of_Rectified_Linear_Unit_in_Deep_Neural_Networks Some Intuition about Activation Functions in Feed-Forward Neural Networks Everyone thought it was great to use differentiable, symmetric, non-linear activation functions in feed-forward neural networks, until Alex Krizhevsky [8] found that Rectifier Linear Units, despite being not entirely differentiable, nor symmetric, and most of all, piece-wise linear, were computationally cheaper and worth the trade-off with their more sophisticated counterparts. Here are just a few thoughts on the properties of these activation functions, a potential explanation for why using ReLUs speeds up training, and possible ways of applying these insights for better learning strategies.

https://en.wikipedia.org/wiki/Activation_function

http://papers.nips.cc/paper/5422-on-the-number-of-linear-regions-of-deep-neural-networks.pdf On the Number of Linear Regions of Deep Neural Networks

http://jmlr.org/proceedings/papers/v15/glorot11a/glorot11a.pdf Deep Sparse Rectifier Neural Networks

While logistic sigmoid neurons are more biologically plausible than hyperbolic tangent neurons, the latter work better for training multi-layer neural networks. This paper shows that rectifying neurons are an even better model of biological neurons and yield equal or better performance than hyperbolic tangent networks in spite of the hard non-linearity and non-differentiability at zero.

http://jmlr.org/proceedings/papers/v28/goodfellow13.pdf Maxout Networks

Maxout abandons many of the mainstays of traditional activation function design. The representation it produces is not sparse at all (see Fig. 2), though the gradient is highly sparse and dropout will artificially sparsify the effective representation during training. While maxout may learn to saturate on one side or the other this is a measure zero event (so it is almost never bounded from above). While a significant proportion of parameter space corresponds to the function being bounded from below, maxout is not constrained to learn to be bounded at all. Maxout is locally linear almost everywhere, while many popular activation functions have signficant curvature. Given all of these departures from standard practice, it may seem surprising that maxout activation functions work at all, but we find that they are very robust and easy to train with dropout, and achieve excellent performance.

https://arxiv.org/abs/1611.00740v1 Why and When Can Deep – but Not Shallow – Networks Avoid the Curse of Dimensionality

We describe how networks with a univariate ReLU nonlinearity may perform multivariate function approximation with a polynomial basis and with a spline basis respectively. The first result is known and we give it for completeness. The second is simple but new.

The effect of the ReLU nonlinearity is to select a subset of the training points that satisfy positivity conditions. The selection is discontinuous with respect to the parameters values.

https://arxiv.org/pdf/1703.01775v1.pdf Building a Regular Decision Boundary with Deep Networks

. Our first contribution is to introduce a state-of-the-art framework that depends upon few hyper parameters and to study the network when we vary them. It has no max pooling, no biases, only 13 layers, is purely convolutional and yields up to 95.4% and 79.6% accuracy respectively on CIFAR10 and CIFAR100. We show that the nonlinearity of a deep network does not need to be continuous, non expansive or point-wise, to achieve good performance. We show that increasing the width of our network permits being competitive with very deep networks. Our second contribution is an analysis of the contraction and separation properties of this network. Indeed, a 1-nearest neighbor classifier applied on deep features progressively improves with depth, which indicates that the representation is progressively more regular.

https://arxiv.org/abs/1603.05201 Understanding and Improving Convolutional Neural Networks via Concatenated Rectified Linear Units

pecifically, we first examine existing CNN models and observe an intriguing property that the filters in the lower layers form pairs (i.e., filters with opposite phase). Inspired by our observation, we propose a novel, simple yet effective activation scheme called concatenated ReLU (CRelu) and theoretically analyze its reconstruction property in CNNs. We integrate CRelu into several state-of-the-art CNN architectures and demonstrate improvement in their recognition performance on CIFAR-10/100 and ImageNet datasets with fewer trainable parameters.

https://github.com/szagoruyko/diracnets PyTorch code and models for DiracNets: Training Very Deep Neural Networks Without Skip-Connections https://arxiv.org/abs/1706.00388

https://github.com/kevinzakka/research-paper-notes/blob/master/snn.md Self-Normalizing Neural Networks

he authors introduce self-normalizing neural networks (SNNs) whose layer activations automatically converge towards zero mean and unit variance and are robust to noise and perturbations. Significance: Removes the need for the finicky batch normalization and permits training deeper networks with a robust training scheme.

https://stats.stackexchange.com/questions/115258/comprehensive-list-of-activation-functions-in-neural-networks-with-pros-cons

https://arxiv.org/abs/1707.04199v1 Be Careful What You Backpropagate: A Case For Linear Output Activations & Gradient Boosting

In this work, we show that saturating output activation functions, such as the softmax, impede learning on a number of standard classification tasks. Moreover, we present results showing that the utility of softmax does not stem from the normalization, as some have speculated. In fact, the normalization makes things worse. Rather, the advantage is in the exponentiation of error gradients. This exponential gradient boosting is shown to speed up convergence and improve generalization. To this end, we demonstrate faster convergence and better performance on diverse classification tasks: image classification using CIFAR-10 and ImageNet, and semantic segmentation using PASCAL VOC 2012. In the latter case, using the state-of-the-art neural network architecture, the model converged 33% faster with our method (roughly two days of training less) than with the standard softmax activation, and with a slightly better performance to boot.

Taking the consequence of this, by e.g. skipping the normalization term of the softmax, we get significant improvement in our NN training—and at no other cost than a few minutes of coding. The only drawback is the introduction of some new hyper-paramters, α, β, and the target values. However, these have been easy to choose, and we do not expect that a lot of tedious fine-tuning is required in the general case.

https://arxiv.org/abs/1710.05941 Swish: a Self-Gated Activation Function

https://arxiv.org/pdf/1712.01897.pdf Online Learning with Gated Linear Networks Rather than relying on non-linear transfer functions, our method gains representational power by the use of data conditioning. We state under general conditions a learnable capacity theorem that shows this approach can in principle learn any bounded Borel-measurable function on a compact subset of euclidean space; the result is stronger than many universality results for connectionist architectures because we provide both the model and the learning procedure for which convergence is guaranteed.

https://arxiv.org/abs/1811.05381v1 Sorting out Lipschitz function approximation

Training neural networks subject to a Lipschitz constraint is useful for generalization bounds, provable adversarial robustness, interpretable gradients, and Wasserstein distance estimation. By the composition property of Lipschitz functions, it suffices to ensure that each individual affine transformation or nonlinear activation function is 1-Lipschitz. The challenge is to do this while maintaining the expressive power. We identify a necessary property for such an architecture: each of the layers must preserve the gradient norm during backpropagation. Based on this, we propose to combine a gradient norm preserving activation function, GroupSort, with norm-constrained weight matrices. We show that norm-constrained GroupSort architectures are universal Lipschitz function approximators. Empirically, we show that norm-constrained GroupSort networks achieve tighter estimates of Wasserstein distance than their ReLU counterparts and can achieve provable adversarial robustness guarantees with little cost to accuracy.