Multiscale Hierarchical Convolutional Networks

Deep neural network algorithms are difficult to analyze because they lack structure allowing to understand the properties of underlying transforms and invariants. Multiscale hierarchical convolutional networks are structured deep convolutional networks where layers are indexed by progressively higher dimensional attributes, which are learned from training data. Each new layer is computed with multidimensional convolutions along spatial and attribute variables. We introduce an efficient implementation of such networks where the dimensionality is progressively reduced by averaging intermediate layers along attribute indices. Hierarchical networks are tested on CIFAR image data bases where they obtain comparable precisions to state of the art networks, with much fewer parameters. We study some properties of the attributes learned from these databases.

Multiscale Hierarchical convolutional networks give a mathematical framework to study invariants computed by deep neural networks. Layers are parameterized in progressively higher dimensional spaces of hierarchical attributes, which are learned from training data. All network operators are multidimensional convolutions along attribute indices, so that invariants can be computed by summations along these attributes. Self-Normalizing Neural Networks

he authors introduce self-normalizing neural networks (SNNs) whose layer activations automatically converge towards zero mean and unit variance and are robust to noise and perturbations. Significance: Removes the need for the finicky batch normalization and permits training deeper networks with a robust training scheme. Multiscale sequence modeling with a learned dictionary

Instead of predicting one symbol at a time, our multi-scale model makes predictions over multiple, potentially overlapping multi-symbol tokens. A variation of the byte-pair encoding (BPE) compression algorithm is used to learn the dictionary of tokens that the model is trained with. When applied to language modelling, our model has the flexibility of character-level models while maintaining many of the performance benefits of word-level models. Our experiments show that this model performs better than a regular LSTM on language modeling tasks, especially for smaller models. An Analysis of Neural Language Modeling at Multiple Scales

Many of the leading approaches in language modeling introduce novel, complex and specialized architectures. We take existing state-of-the-art word level language models based on LSTMs and QRNNs and extend them to both larger vocabularies as well as character-level granularity. When properly tuned, LSTMs and QRNNs achieve state-of-the-art results on character-level (Penn Treebank, enwik8) and word-level (WikiText-103) datasets, respectively. Results are obtained in only 12 hours (WikiText-103) to 2 days (enwik8) using a single modern GPU.

By extending an existing state-of-the-art word level language model based on LSTMs and QRNNs, we show that a well tuned baseline can achieve state-of-the-art results on both character-level (Penn Treebank, enwik8) and word-level (WikiText-103) datasets without relying on complex or specialized architectures. We additionally perform an empirical investigation of the learning and network dynamics of both LSTM and QRNN cells across different language modeling tasks, highlighting the differences between the learned character and word level models. Finally, we present results which shed light on the relative importance of the various hyperparameters in neural language models. On the WikiText-2 data set, the AWD-QRNN model exhibited higher sensitivity to the hidden-to-hidden weight dropout and input dropout terms and relative insensitivity to the embedding and hidden layer sizes. We hope that this insight would be useful for practitioners intending to tune similar models on new datasets.

We analyze the relative importance of the hyperparameters defining the model using a Random Forest approach for the word-level task on the smaller WikiText-2 data set for AWDQRNN model. The results show that weight dropout, hidden dropout and embedding dropout impact performance the most while the number of layers and the embedding and hidden dimension sizes matters relatively less. Similar results are observed on the PTB word level data set.