** Name ** Convolution Layer

**Intent**
Convolution operations provide more expressive and efficient spatial processing than conventional neural networks.

**Motivation** How can we more efficiently and effectively process spatial data?

**Structure**

<Diagram>

**Discussion**

Convolution layers are found in Convolution Networks (ConvNet) that were originally designed for image classifications. When they are introduced for the first time they don't appear to be anything like an artificial neural network. They are however conceptually based on the same principles. An artificial neural network (ANN) layer consists of a units that perform a similarity operation followed by a irreversible operation. The similarity operation in an ANN is an inner product between the inputs and the Model weights. Similarly, the similarity operation in a ConvNet is a convolution operation. The convolution operation is a more general form of the inner product. The convolution operation in addition to template matching has the capability of performing smoothing and change detection.

Convnets are also more space efficient in that they share their parameters across space. Let's say that you have a color image, that has a width, a height and depth in the form of color channels (i.e. red, green, blue). Now imagine taking a patch of this image and running a inner product against a patch of weights. The outputs of this operation we represent as a vector. Slide that inner product across the image while keeping the same weights. As we slide across the image, we construct at the output another image but with a transformed and different width and height. This operation is described as a convolution. Now perform the same operation over several neurons, this results in several images one per each patch of weights.

The convolution in identical to an ANN if the patch size were the entire image. However, by using a smaller patch we have fewer weights than the fully connected network and in addition we are constrained to sharing these weights as the convolution operation slides across the space.

Convolution layers form a kind of pyramid structure where at the base consists of the original image. As we apply convolutions the layers on top squeeze the image spatial dimensions at the same time increasing the depth. This reflects the a kind of image abstraction operation. The patches is what are called convolution kernels and the transformed image is called a feature map.

There are additional knobs that are involved in Convolution Layers. There is the notion of “stride”, which refers to the number of pixels to shift as a kernel is moved. A stride of 1 makes the outputs the approximately the same size as the input. A stride of 2 outputs a feature map approximately half the input size. The output size is also dependent on whether the convolution goes past the edge of the original image. This is referred to as “padding”. A valid padding is when you don't go past an edge. Alternatively, if you go off the edge and pad with zeros in such a way that the output feature map size is the same size as the input map, then that is to referred to a same padding.

ConvNets were original designed to process images. In fact, the two dimensional convolution acts as regularization that emphasizes to the network the importance of images. Two dimensional locality is encouraged due to the subsampling pooling layer that aggregates two dimensional patches. Translation invariance courtesy of the convolution's sliding operation across an image. These are side effects of the standard convolution method, but as we shall see in In Layer Transform, additional invariant finding methods can be constructed.

The accidental discovery of the effectiveness of ConvNets led to the realization of one of the real flexibility and power of deep learning. ConvNets can be attributed to almost all of benchmarked super-human capabilities in the last five years. The flexibility of introducing regularizations that are able to ignore invariant properties of the training data is unmatched by any other machine learning method.

ConvNets have been employed in applications outside of image classification and is now regarded as a general technique that can be applied to most any kind of data.

**Known Uses**

https://arxiv.org/abs/1502.01710 Text Understanding from Scratch In this article we provide a first evidence on ConvNets’ applicability to text understanding tasks from scratch, that is, ConvNets do not need any knowledge on the syntactic or semantic structure of a language to give good benchmarks

text understanding.

http://arxiv.org/abs/1606.01781v1 Very Deep Convolutional Networks for Natural Language Processing

The dominant approach for many NLP tasks are recurrent neural networks, in particular LSTMs, and convolutional neural networks. However, these architectures are rather shallow in comparison to the deep convolutional networks which are very successful in computer vision. We present a new architecture for text processing which operates directly on the character level and uses only small convolutions and pooling operations. We are able to show that the performance of this model increases with the depth: using up to 29 convolutional layers, we report significant improvements over the state-of-the-art on several public text classification tasks.

https://arxiv.org/pdf/1609.03193v2.pdf Wav2Letter: an End-to-End ConvNet-based Speech Recognition System

https://arxiv.org/abs/1609.03499v2 WaveNet: A Generative Model for Raw Audio

This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones; nonetheless we show that it can be efficiently trained on data with tens of thousands of samples per second of audio. When applied to text-to-speech, it yields state-of-the-art performance, with human listeners rating it as significantly more natural sounding than the best parametric and concatenative systems for both English and Mandarin. A single WaveNet can capture the characteristics of many different speakers with equal fidelity, and can switch between them by conditioning on the speaker identity. When trained to model music, we find that it generates novel and often highly realistic musical fragments. We also show that it can be employed as a discriminative model, returning promising results for phoneme recognition.

https://arxiv.org/abs/1610.03017 Fully Character-Level Neural Machine Translation without Explicit Segmentation

We introduce a neural machine translation (NMT) model that maps a source character sequence to a target character sequence without any segmentation. We employ a character-level convolutional network with max-pooling at the encoder to reduce the length of source representation, allowing the model to be trained at a speed comparable to subword-level models while capturing local regularities. Our character-to-character model outperforms a recently proposed baseline with a subword-level encoder on WMT'15 DE-EN and CS-EN, and gives comparable performance on FI-EN and RU-EN. We then demonstrate that it is possible to share a single character-level encoder across multiple languages by training a model on a many-to-one translation task. In this multilingual setting, the character-level encoder significantly outperforms the subword-level encoder on all the language pairs.

https://arxiv.org/pdf/1610.10099v1.pdf Neural Machine Translation in Linear Time

We present a neural translation model, the ByteNet, and a neural language model, the ByteNet Decoder, that aim at addressing these drawbacks. The ByteNet uses convolutional neural networks with dilation for both the source network and the target network. The ByteNet connects the source and target networks via stacking and unfolds the target network dynamically to generate variable length output sequences. We view the ByteNet as an instance of a wider family of sequence-mapping architectures that stack the sub-networks and use dynamic unfolding. The sub-networks themselves may be convolutional or recurrent.

https://arxiv.org/abs/1611.02344 A Convolutional Encoder Model for Neural Machine Translation

The prevalent approach to neural machine translation relies on bi-directional LSTMs to encode the source sentence. In this paper we present a faster and conceptually simpler architecture based on a succession of convolutional layers. This allows to encode the entire source sentence simultaneously compared to recurrent networks for which computation is constrained by temporal dependencies.

http://metamind.io/research/new-neural-network-building-block-allows-faster-and-more-accurate-text-understanding/ New neural network building block allows faster and more accurate text understanding

https://blog.acolyer.org/2016/11/22/achieving-human-parity-in-conversational-speech-recognition/

**Related Patterns**

<Diagram>

**Further Reading**

**References**

http://colah.github.io/posts/2014-07-Conv-Nets-Modular/

http://neuralnetworksanddeeplearning.com/chap6.html

http://arxiv.org/pdf/1605.06489v1.pdf

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going Deeper with Convolutions. In: eprint arXiv:1409.4842 (2014)

GoogLeNet. In contrast to much other work, Szegedy et al. [35] propose a CNN architecture that is highly optimized for computational efficiency. GoogLeNet uses, as a basic building block, a mixture of low-dimensional embeddings [27] and heterogeneously sized spatial filters – collectively an ‘inception’ module. There are two distinct forms of convolutional layers in the inception module, low-dimensional embeddings (1×1) and spatial (3×3, 5×5). GoogLeNet keeps large, expensive spatial convolutions (i.e. 5×5) to a minimum by using few of these filters, using more 3×3 convolutions, and even more 1×1 filters again. The motivation is that most of the convolutional filters respond to localized patterns in a small receptive field, with few requiring a larger receptive field. The number Deep Roots: Improving CNN Efficiency with Hierarchical Filter Groups 5 of filters in each successive inception unit increase slowly with decreasing feature map size, in order to maintain computational performance. GoogLeNet is by far the most efficient state-of-the-art network for ILSVRC, achieving near state-of-the-art accuracy with the lowest computation/model size. However, we will show that even such an efficient and optimized network architecture benefits from our method

http://cs231n.github.io/convolutional-networks/#layerpat

http://arxiv.org/pdf/1512.07108.pdf Recent Advances in Convolutional Neural Networks

https://www.quora.com/What-is-an-intuitive-explanation-for-convolution

https://arxiv.org/pdf/1410.0781.pdf The SimNet architecture consists of two operators – a “similarity” operator that generalizes the inner-product operator found in ConvNets, and a soft max-average-min operator called MEX that replaces the ConvNet ReLU activation and max/average pooling layers.

(a) SimNet similarity layer (b) SimNet MEX layer © ConvNet ReLU activation layer (d) ConvNet max/average pooling layer (e) SimNet MLP with multiple outputs (f) SimNet with locality, sharing and pooling (g) Patch-labeling SimNet (h) SimNet lp-similarity layer followed by MEX layer

http://www.trivialorwrong.com/2016/06/01/laws-sausages-and-convnets.html

Full-blown ConvNets may incorporate a variety of ideas and mechanisms, but in the following I’m going to focus on their very core: convolutional layers. Convolution is a simple mathematical operation, so the enormous complexity involved in implementing convolutional layers may be surprising.

http://timdettmers.com/2015/03/26/convolution-deep-learning/

http://arxiv.org/pdf/1603.07285v1.pdf A guide to convolution arithmetic for deep learning

https://github.com/vdumoulin/conv_arithmetic

http://arxiv.org/abs/1606.02228

Systematic evaluation of CNN advances on the ImageNet

https://culurciello.github.io/tech/2016/06/04/nets.html

http://arxiv.org/pdf/1508.01983v4.pdf DIGGING DEEP INTO THE LAYERS OF CNNS: IN SEARCH OF HOW CNNS ACHIEVE VIEW INVARIANCE

We find that a pre-trained network captures representations that highly preserve the manifold structure at most of the network layers, including the fully connected layers, except the final layer. Although the model is pre-trained on ImageNet, not a densely sampled multi-view dataset, still, the layers have the capacity to encode view manifold structure. It is clear from the analysis that, except of the last layer, the representation tries to achieve view invariance by separating individual instances’ view manifolds while preserving them, instead of collapsing the view manifolds to degenerate representations. This is violated at the last layer which enforces view invariance.

http://brohrer.github.io/how_convolutional_neural_networks_work.html

http://arxiv.org/abs/1606.00915 DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs

http://www.matthewzeiler.com/pubs/cvpr2010/cvpr2010.pdf Deconvolutional Networks

https://arxiv.org/abs/1605.06743v2 Inductive Bias of Deep Convolutional Networks through Pooling Geometry

Our formal understanding of the inductive bias that drives the success of convolutional networks on computer vision tasks is limited. In particular, it is unclear what makes hypotheses spaces born from convolution and pooling operations so suitable for natural images. In this paper we study the ability of convolutional networks to model correlations among regions of their input. We theoretically analyze convolutional arithmetic circuits, and empirically validate our findings on other types of convolutional networks as well. **Correlations are formalized through the notion of separation rank, which for a given partition of the input, measures how far a function is from being separable.** We show that a polynomially sized deep network supports exponentially high separation ranks for certain input partitions, while being limited to polynomial separation ranks for others. The network's pooling geometry effectively determines which input partitions are favored, thus serves as a means for controlling the inductive bias. Contiguous pooling windows as commonly employed in practice favor interleaved partitions over coarse ones, orienting the inductive bias towards the statistics of natural images. Other pooling schemes lead to different preferences, and this allows tailoring the network to data that departs from the usual domain of natural imagery. In addition to analyzing deep networks, we show that shallow ones support only linear separation ranks, and by this gain insight into the benefit of functions brought forth by depth - they are able to efficiently model strong correlation under favored partitions of the input.

https://arxiv.org/pdf/1512.07108v5.pdf Recent Advances in Convolutional Neural Networks

http://arxiv.org/pdf/1702.07664v1.pdf How ConvNets model Non-linear Transformations

, we theoretically address three fundamental problems involving deep convolutional networks regarding invariance, depth and hierarchy. We introduce the paradigm of Transformation Networks (TN) which are a direct generalization of Convolutional Networks (ConvNets). Theoretically, we show that TNs (and thereby ConvNets) are can be invariant to non-linear transformations of the input despite pooling over mere local translations.

https://arxiv.org/pdf/1412.6806.pdf Striving for Simplicity: The All Convolution net.

Alternating convolution and max-pooling layers followed by a small number of fully connected layers. We re-evaluate the state of the art for object recognition from small images with convolutional networks, questioning the necessity of different components in the pipeline. We find that max-pooling can simply be replaced by a convolutional layer with increased stride without loss in accuracy on several image recognition benchmarks. Following this finding – and building on other recent work for finding simple network structures – we propose a new architecture that consists solely of convolutional layers and yields competitive or state of the art performance on several object recognition datasets (CIFAR-10, CIFAR-100, ImageNet). To analyze the network we introduce a new variant of the “deconvolution approach” for visualizing features learned by CNNs, which can be applied to a broader range of network structures than existing approaches.

https://arxiv.org/abs/1601.04920 Understanding Deep Convolutional Networks

Deep convolutional networks provide state of the art classifications and regressions results over many high-dimensional problems. We review their architecture, which scatters data with a cascade of linear filter weights and non-linearities. A mathematical framework is introduced to analyze their properties. Computations of invariants involve multiscale contractions, the linearization of hierarchical symmetries, and sparse separations. Applications are discussed.

https://arxiv.org/pdf/1705.06820v1.pdf Pixel Deconvolutional Networks

https://arxiv.org/abs/1705.03122 Convolutional Sequence to Sequence Learning

The prevalent approach to sequence to sequence learning maps an input sequence to a variable length output sequence via recurrent neural networks. We introduce an architecture based entirely on convolutional neural networks. Compared to recurrent models, computations over all elements can be fully parallelized during training and optimization is easier since the number of non-linearities is fixed and independent of the input length. Our use of gated linear units eases gradient propagation and we equip each decoder layer with a separate attention module. We outperform the accuracy of the deep LSTM setup of Wu et al. (2016) on both WMT'14 English-German and WMT'14 English-French translation at an order of magnitude faster speed, both on GPU and CPU.

https://arxiv.org/pdf/1711.08920.pdf SplineCNN: Fast Geometric Deep Learning with Continuous B-Spline Kernels

Our main contribution is a novel convolution operator based on B-splines, that makes the computation time independent from the kernel size due to the local support property of the B-spline basis functions. As a result, we obtain a generalization of the traditional CNN convolution operator by using continuous kernel functions parametrized by a fixed number of trainable weights. In contrast to related approaches that filter in the spectral domain, the proposed method aggregates features purely in the spatial domain. As a main advantage, SplineCNN allows entire endto-end training of deep architectures, using only the geometric structure as input, instead of handcrafted feature descriptors.

https://arxiv.org/abs/1712.09662 CNN Is All You Need

In this work we introduce an extended CNN model with strengthen position-sensitivity, called PoseNet. A notable feature of PoseNet is the asymmetric treatment of position information in the encoder and the decoder. Experiments shows that PoseNet allows us to improve the accuracy of CNN based sequence-to-sequence learning significantly, achieving around 33-36 BLEU scores on the WMT 2014 English-to-German translation task, and around 44-46 BLEU scores on the English-to-French translation task.

https://arxiv.org/abs/1804.11191v1 How convolutional neural network see the world - A survey of convolutional neural network visualization methods

https://arxiv.org/abs/1512.06293v3 A Mathematical Theory of Deep Convolutional Neural Networks for Feature Extraction