Name Differentiable Layer

Intent

Be able to calculate the gradient for every layer in a network.

Motivation

Relax the constraints on the definition of a layer to that of requiring that it be piecewise differentiable.

Structure

<Diagram>

Discussion

Neural Networks in its original form consisted of neurons that were defines as a sum of products as input parameters feeding into an activation function. Current practice however has relaxed that requirement into requiring only that each neuron be a function that is differentiable. Actually, more accurately, the function should be locally differentiable. That means, there may be some points where there the differential is undefined, for those cracks we fill in with some patch that ignores the problem. So for RELU, the differential at the origin is actually undefined, in practice we patch that over and assign it a value (say zero).

Differential layers are the minimum requirement for all proposed DL system. The key requirement is that a natural transformation is present. Credit assignments via layer-wise differentiation ensures that this requirement is satisfied.

Known Uses

Convolution Networks have differentiable layers. Spatial Networks are differentiable. Every network that employ gradient descent as its credit assignment mechanism requires differentiable layers.

http://arxiv.org/abs/1511.05946 ACDC: A Structured Efficient Linear Layer

Here, we introduce a deep, differentiable, fully-connected neural network module composed of diagonal matrices of parameters, A and D, and the discrete cosine transform C. The core module, structured as ACDC−1, has O(N) parameters and incurs O(NlogN) operations.

Many DL frameworks have auto-differentiation capability as a convenience for developers. Auto-differentiation is a symbolic mathematics capability that permits to calculate differentials without explicitly requiring a developer to derive to correct math transformation

Related Patterns

<Diagram>

• Stochastic Gradient Descent - Requires that layers be differentiable.
• Similarity Operator - The operation that calculates similarity with the internal model (i.e. a projection) are differentiable. This includes generalizations like Convolution.
• Irreversible Operator - This operation should be in theory be differentiable, but in practice, piecewise differentiability can be used (see: ReLU).

References

http://arxiv.org/abs/1404.7456 Automatic Differentiation of Algorithms for Machine Learning

http://arxiv.org/pdf/1206.5533v2.pdf - Automatic Diffentiation The gradient can be either computed manually or through automatic differentiation. Either way, it helps to structure this computation as a flow graph, in order to prevent mathematical mistakes and make sure an implementation is computationally efficient. The computation of the loss L(z, θ) as a function of θ is laid out in a graph whose nodes correspond to elementary operations such as addition, multiplication, and non-linear operations such as the neural networks activation function (e.g., sigmoid or hyperbolic tangent), possibly at the level of vectors, matrices or tensors. The flow graph is directed and acyclic and has three types of nodes: input nodes, internal nodes, and output nodes. Each of its nodes is associated with a numerical output which is the result of the application of that computation (none in the case of input nodes), taking as input the output of previous nodes in a directed acyclic graph. Example z and parameter vector θ (or their elements) are the input nodes of the graph (i.e., they do not have inputs themselves) and L(z, θ) is a scalar output of the graph. Note that here, in the supervised case, z can include an input part x (e.g. an image) and a target part y (e.g. a target class associated with an object in the image). In the unsupervised case z = x. In a semi-supervised case, there is a mix of labeled and unlabeled examples, and z includes y on the labeled examples but not on the unlabeled ones.

https://arxiv.org/abs/1611.02109v2 Differentiable Programs with Neural Libraries

We develop a framework for combining differentiable programming languages with neural networks. Using this framework we create end-to-end trainable systems that learn to write interpretable algorithms with perceptual components. We explore the benefits of inductive biases for strong generalization and modularity that come from the program-like structure of our models. In particular, modularity allows us to learn a library of (neural) functions which grows and improves as more tasks are solved. Empirically, we show that this leads to lifelong learning systems that transfer knowledge to new tasks more effectively than baselines.

https://arxiv.org/abs/1710.08717 Auto-Differentiating Linear Algebra

However, it is currently not easy to implement many basic machine learning primitives in these systems (such as Gaussian processes, least squares estimation, principal components analysis, Kalman smoothing), mainly because they lack efficient support of linear algebra primitives as differentiable operators. We detail how a number of matrix decompositions (Cholesky, LQ, symmetric eigen) can be implemented as differentiable operators.