This is an old revision of the document!

Model Factorization


This identifies the pattern and should be representative of the concept that it describes. The name should be a noun that should be easily usable within a sentence. We would like the pattern to be easily referenceable in conversation between practitioners.


Describes in a single concise sentence the meaning of the pattern.


This section describes the reason why this pattern is needed in practice. Other pattern languages indicate this as the Problem. In our pattern language, we express this in a question or several questions and then we provide further explanation behind the question.


This section provides alternative descriptions of the pattern in the form of an illustration or alternative formal expression. By looking at the sketch a reader may quickly understand the essence of the pattern.


This is the main section of the pattern that goes in greater detail to explain the pattern. We leverage a vocabulary that we describe in the theory section of this book. We don’t go into intense detail into providing proofs but rather reference the sources of the proofs. How the motivation is addressed is expounded upon in this section. We also include additional questions that may be interesting topics for future research.

Known Uses

Here we review several projects or papers that have used this pattern.

Related Patterns In this section we describe in a diagram how this pattern is conceptually related to other patterns. The relationships may be as precise or may be fuzzy, so we provide further explanation into the nature of the relationship. We also describe other patterns may not be conceptually related but work well in combination with this pattern.

Relationship to Canonical Patterns

Relationship to other Patterns

Further Reading

We provide here some additional external material that will help in exploring this pattern in more detail.

References Global Optimality in Matrix and Tensor Factorization, Deep Learning, and Beyond

Matrix, tensor, and other factorization techniques are used in a wide range of applications and have enjoyed significant empirical success in many fields. However, common to a vast majority of these problems is the significant disadvantage that the associated optimization problems are typically non-convex due to a multilinear form or other convexity destroying transformation. Building on ideas from convex relaxations of matrix factorizations, in this talk I will present a very general framework which allows for the analysis of a wide range of non-convex factorization problems - including matrix factorization, tensor factorization, and deep neural network training formulations. In particular, I will present sufficient conditions under which a local minimum of the non-convex optimization problem is a global minimum and show that if the size of the factorized variables is large enough then from any initialization it is possible to find a global minimizer using a purely local descent algorithm. Our framework also provides a partial theoretical justification for the increasingly common use of Rectified Linear Units (ReLUs) in deep neural networks and offers guidance on deep network architectures and regularization strategies to facilitate efficient optimization. Sparse Matrix Factorization A deep matrix factorization method for learning attribute representations Tensor Decomposition for Signal Processing and Machine Learning See slide 87 A Purely Geometric Approach to Non-Negative Matrix Factorization When Does Non-Negative Matrix Factorization Give a Correct Decomposition into Parts? New developments in nonnegative matrix factorization Alternating optimization method based on nonnegative matrix factorizations for deep neural networks

we proposed an alternating optimization algorithm for computing weight matrices of fully-connected DNNs by using the semi-NMF and the nonlinear semi-NMF. Learning understandable neural networks with non-negative weight constraints Deep Learning of Part-based Representation of Data Using Sparse Autoencoders with Nonnegativity Constraints

The results indicate that the nonnegativity constraints, in the NCAE method, force the autoencoder to learn features that capture a part-based representation of data, while achieving lower reconstruction error and better sparsity in the hidden encoding as compared with SAE and NMF. Hierarchical feature extraction by multi-layer non-negative matrix factorization network for classification task A deep matrix factorization method for learning attribute representations Meta-Prod2Vec - Product Embeddings Using Side-Information for Recommendation

Our method leverages past user interactions with items and their attributes to compute low-dimensional embeddings of items. Specifically, the item metadata is injected into the model as side information to regularize the item embeddings.

This work makes a novel connection between the recent embedding-based methods and consecrated Matrix Factorization methods by introducing learning with side information in the context of embeddings. Higher-Order Factorization Machines

The key idea is to introduce a tensor from user-item-rating, thus being able to recommend even from a negative feedback. Factorized Convolutional Neural Networks

The convolutional layer proposed in this work requires only a quarter of the computation required by normal 1 × 1 convolution, and is able to achieve surprisingly good performance. Don’t interpret linear hidden units, they don’t exist. Factorization Machines

we introduce Factorization Machines (FM) which are a new model class that combines the advantages of Support Vector Machines (SVM) with factorization models. Like SVMs, FMs are a general predictor working with any real valued feature vector. In contrast to SVMs, FMs model all interactions between variables using factorized parameters. Thus they are able to estimate interactions even in problems with huge sparsity (like recommender systems) where SVMs fail. Field-aware Factorization Machines for CTR Prediction

A variant of FMs, fieldaware factorization machines (FFMs), outperforms existing models in some world-wide CTR-prediction competitions. Deep Learning over Multi-field Categorical Data: A Case Study on User Response Prediction

To get our DNNs efficiently work, we propose to leverage three feature transformation methods, i.e., factorisation machines (FMs), restricted Boltzmann machines (RBMs) and denoising auto-encoders (DAEs). This paper presents the structure of our models and their efficient training algorithms. The large-scale experiments with real-world data demonstrate that our methods work better than major state-of-the-art models. Dynamic Matrix Factorization with Priors on Unknown Values

In this work, we build on this assumption, and introduce a novel dynamic matrix factorization framework that allows to set an explicit prior on unknown values. Why does deep and cheap learning work so well? Era of Big Data Processing: A New Approach via Tensor Networks and Tensor Decompositions A Practical Introduction to Tensor Networks: Matrix Product States and Projected Entangled Pair States Semi-Supervised Learning with the Deep Rendering Mixture Model A Probabilistic Framework for Deep Learning Ultimate tensorization: compressing convolutional and FC layers alike

Convolutional neural networks excel in image recognition tasks, but this comes at the cost of high computational and memory complexity. To tackle this problem, [1] developed a tensor factorization framework to compress fully-connected layers. In this paper, we focus on compressing convolutional layers. We show that while the direct application of the tensor framework [1] to the 4-dimensional kernel of convolution does compress the layer, we can do better. We reshape the convolutional kernel into a tensor of higher order and factorize it. We combine the proposed approach with the previous work to compress both convolutional and fully-connected layers of a network and achieve 80x network compression rate with 1.1% accuracy drop on the CIFAR-10 dataset. Introduction to Nonnegative Matrix Factorization

In this paper, we introduce and provide a short overview of nonnegative matrix factorization (NMF). Several aspects of NMF are discussed, namely, the application in hyperspectral imaging, geometry and uniqueness of NMF solutions, complexity, algorithms, and its link with extended formulations of polyhedra. In order to put NMF into perspective, the more general problem class of constrained low-rank matrix approximation problems is first briefly introduced. Orthogonalized ALS: A Theoretically Principled Tensor Decomposition Algorithm for Practical Use

. We propose a modification of the ALS approach that is as efficient as standard ALS, but provably recovers the true factors with random initialization under standard incoherence assumptions on the factors of the tensor. Tractable Generative Convolutional Arithmetic Circuits

Casting neural networks in generative frameworks is a highly sought-after endeavor these days. Existing methods, such as Generative Adversarial Networks, capture some of the generative capabilities, but not all. To truly leverage the power of generative models, tractable marginalization is needed, a feature outside the realm of current methods. We present a generative model based on convolutional arithmetic circuits, a variant of convolutional networks that computes high-dimensional functions through tensor decompositions. Our method admits tractable marginalization, combining the expressive power of convolutional networks with all the abilities that may be offered by a generative framework. We focus on the application of classification under missing data, where unknown portions of classified instances are absent at test time. Our model, which theoretically achieves optimal classification, provides state of the art performance when classifying images with missing pixels, as well as promising results when treating speech with occluded samples. Sharing Residual Units Through Collective Tensor Factorization in Deep Neural Networks

esidual functions can actually be explained with a unified framework based on generalized block term decomposition. Then, based on the new explanation, we propose a new architecture, Collective Residual Unit (CRU), which enhances the parameter effi- ciency of deep neural networks through collective tensor factorization. CRU enables knowledge sharing across different residual units using shared factors. Low-Rank Tensor Networks for Dimensionality Reduction and Large-Scale Optimization Problems: Perspectives and Challenges PART 1 DeepFM: A Factorization-Machine based Neural Network for CTR Prediction

In this paper, we show that it is possible to derive an end-to-end learning model that emphasizes both low- and high-order feature interactions. The proposed model, DeepFM, combines the power of factorization machines for recommendation and deep learning for feature learning in a new neural network architecture. Boosting Dilated Convolutional Networks with Mixed Tensor Decompositions

In this paper we study the expressive efficiency brought forth by the architectural feature of connectivity, motivated by the observation that nearly all state of the art networks these days employ elaborate connection schemes, running layers in parallel while splitting and merging them in various ways. A formal treatment of this question would shed light on the effectiveness of modern connectivity schemes, and in addition, could provide new tools for network design. We focus on dilated convolutional networks, a family of deep models gaining increased attention, underlying state of the art architectures like Google's WaveNet and ByteNet. By introducing and studying the concept of mixed tensor decompositions, we prove that interconnecting dilated convolutional networks can lead to expressive efficiency. In particular, we show that a single connection between intermediate layers can already lead to an almost quadratic gap, which in large-scale settings typically makes the difference between a model that is practical and one that is not. Factorization tricks for LSTM networks

We present two simple ways of reducing the number of parameters and accelerating the training of large Long Short-Term Memory (LSTM) networks: the first one is “matrix factorization by design” of LSTM matrix into the product of two smaller matrices, and the second one is partitioning of LSTM matrix, its inputs and states into the independent groups. Both approaches allow us to train large LSTM networks significantly faster to the state-of the art perplexity. On the One Billion Word Benchmark we improve single model perplexity down to 24.29. Word Embeddings via Tensor Factorization

Tensor factorization appears to be a highly applicable and effective approach to learning generic word embeddings. It is advantageous to encode many different types of information in word embeddings, so we should not restrict ourselves to solutions which only are trained on pairwise word relations. There seems to be much information encoded in tensor-based word embedding vectors that is not encoded in the vectors produced by matrix-based word embeddings, and such information will likely prove useful in downstream NLP tasks. Building effective deep neural network architectures one feature at a time

In consequence we propose a novel algorithm to evolve fixed-depth architectures starting from just a single feature per layer to attain effective representational capacities needed for a specific task by greedily adding feature by feature. We revisit popular CNN architectures and demonstrate how evolved architectures not only converge to similar topologies that benefit from less parameters or improved accuracy, but furthermore exhibit systematic correspondence in representational complexity with the specified task. IN SEARCH OF THE REAL INDUCTIVE BIAS : ON THE ROLE OF IMPLICIT REGULARIZATION IN DEEP LEARNING Implicit Regularization in Matrix Factorization

We conjecture and provide empirical and theoretical evidence that with small enough step sizes and initialization close enough to the origin, gradient descent on a full dimensional factorization converges to the minimum nuclear norm solution.

We also discuss how the bias in the non-convex case is much more delicate then in convex gradient descent: since we are not restricted to a flat manifold, the bias introduced by optimization depends on the step sizes taken. Furthermore, for linear least square problems (i.e. methods based on the gradients w.r.t. X in our formulation), any global optimization method that uses linear combination of gradients, including conjugate gradient descent, Nesterov acceleration and momentum methods, remains on the manifold spanned by the gradients, and so leads to the same minimum norm solution. This is not true if the manifold is curved, as using momentum or passed gradients will lead us to “shoot off” the manifold. Tensor Contraction Layers for Parsimonious Deep Nets

we propose the Tensor Contraction Layer (TCL), the first attempt to incorporate tensor contractions as end-to-end trainable neural network layers. Attentional Factorization Machines: Learning the Weight of Feature Interactions via Attention Networks Neural Factorization Machines for Sparse Predictive Analytics Tensor Regression Networks

To date, most convolutional neural network architectures output predictions by flattening 3rd-order activation tensors, and applying fully-connected output layers. This approach has two drawbacks: (i) we lose rich, multi-modal structure during the flattening process and (ii) fully-connected layers require many parameters. We present the first attempt to circumvent these issues by expressing the output of a neural network directly as the the result of a multi-linear mapping from an activation tensor to the output. By imposing low-rank constraints on the regression tensor, we can efficiently solve problems for which existing solutions are badly parametrized. Our proposed tensor regression layer replaces flattening operations and fullyconnected layers by leveraging multi-modal structure in the data and expressing the regression weights via a low rank tensor decomposition. Additionally, we combine tensor regression with tensor contraction to further increase efficiency. Augmenting the VGG and ResNet architectures, we demonstrate large reductions in the number of parameters with negligible impact on performance on the ImageNet dataset. Domain-adaptive deep network compression

We focus on compression algorithms based on low-rank matrix decomposition. Existing methods base compression solely on learned network weights and ignore the statistics of network activations. We show that domain transfer leads to large shifts in network activations and that it is desirable to take this into account when compressing. We demonstrate that considering activation statistics when compressing weights leads to a rank-constrained regression problem with a closed-form solution. Because our method takes into account the target domain, it can more optimally remove the redundancy in the weights. On Compressing Deep Models by Low Rank and Sparse Decomposition Expressive power of recurrent neural networks

A certain class of deep convolutional networks – namely those that correspond to the Hierarchical Tucker (HT) tensor decomposition – has been proven to have exponentially higher expressive power than shallow networks. Learning Compact Recurrent Neural Networks with Block-Term Tensor Decomposition

Block-Term tensor decomposition, which greatly reduces the parameters of RNNs and improves their training efficiency. Compared with alternative low-rank approximations, such as tensortrain RNN (TT-RNN), our method, Block-Term RNN (BTRNN), is not only more concise (when using the same rank), but also able to attain a better approximation to the original RNNs with much fewer parameters. On three challenging tasks, including Action Recognition in Videos, Image Captioning and Image Generation, BT-RNN outperforms TT-RNN and the standard RNN in terms of both prediction accuracy and convergence rate. Specifically, BT-LSTM utilizes 17,388 times fewer parameters than the standard LSTM to achieve an accuracy improvement over 15.6% in the Action Recognition task on the UCF11 dataset. Covariant Compositional Networks For Learning Graphs

Most existing neural networks for learning graphs address permutation invariance by conceiving of the network as a message passing scheme, where each node sums the feature vectors coming from its neighbors. We argue that this imposes a limitation on their representation power, and instead propose a new general architecture for representing objects consisting of a hierarchy of parts, which we call Covariant Compositional Networks (CCNs). Here, covariance means that the activation of each neuron must transform in a specific way under permutations, similarly to steerability in CNNs. We achieve covariance by making each activation transform according to a tensor representation of the permutation group, and derive the corresponding tensor aggregation rules that each neuron must implement. Experiments show that CCNs can outperform competing methods on standard graph learning benchmarks.