Merge Operator

Aliases Crossover Operator, Entanglement Operator,


A generalization of an operator that combines the outputs of Ensembles.


How do we compose features into features of a higher abstraction?


$Merge: R^N \times R^M \rightarrow R^{N+M}$ <Diagram>


One key concept about neural networks is to realize that an individual neuron does not do much in isolation. One has to consider the collective effect of the all the neurons in the layer. We can treat a layer as an Ensemble of simple classifiers. An entire network is an Ensemble of Ensembles. As a consequence neural networks are self-similar networks of Ensembles.

How one layer provides input into another layer involves a lot of entanglement. As an example, if we look at an individual neuron, we do see its inputs come from all every dimension of the input Feature vector. This is how a fully connected network is constructed. Every input is involved in every neuron in the layer. The output of a neuron is only relevant if it is the final output of the network. Intermediate outputs of internal neurons are only relevant when they are considered collectively. In other words, we treat the outputs collectively as a single Feature vector.

The Merge operator is explicitly present in a fully connected layer and it is traditionally is depicted as the lines that converge into a neuron. You can see it as the fan-out of links emanating from an neuron or the fan-in of links arriving at a neuron. It rarely is ever depicted as a function block and as a result diminishes its importance. In fact, the Merge operator isn't present on a single layer neural network. It is one of those overlooked constructs that as we shall see is crucial to our understanding of why deep learning is able to build more complex abstractions from simpler abstractions.

A single neuron is interested in a collection of inputs from the previous layer. The neuron's function is to perform a similarity operation against its model, followed by a irreversible fitness test to arrive at its final output. A neuron by itself is a classifier, however we are always going to be interested in the collection of classifiers rather a single classifier.

The math of a single layer typically formulated as a matrix equation. The Merge Operator is actually implicitly defined. It is actually the matrix multiplication operator. Matrix multiplication is performed by taking the collection of elements in one row of a matrix and combining it with elements of a column of the other matrix. In fact, who invented matrix multiplication is a bit shrouded in mystery! There are also alternative kinds of matrix multiplication like the hadamard product. Nevertheless, in most conventions, a separate symbol for matrix multiplication is not shown, but that doesn't imply that it isn't there!

Known Uses X-CNN: Cross-modal Convolutional Neural Networks for Sparse Datasets

The constituent networks are individually designed to learn the output function on their own subset of the input data, after which cross-connections between them are introduced after each pooling operation to periodically allow for information exchange between them. This injection of knowledge into a model (by prior partition of the input data through domain knowledge or unsupervised methods) is expected to yield greatest returns in sparse data environments, which are typically less suitable for training.

we have further verified that the introduced cross-connection layers perform rather complex functions (thus they are not limited to simple feature map passing) and are capable of mimicking human vision processes—confirming that the biological inspiration behind such a model is justified.

Related Patterns

Relationship to Canonical Patterns

  • Similarity, Irreversibility with this pattern are essential operators for the construction of deep learning networks. This operator however is the only one that increases local entropy.
  • Ensembles or mixtures of experts in general leads to higher local entropy. However, ensembles are the key notion that leads to better generalization.
  • Distributed Representation, Mutual Information and Disentangled Basis all relate to a notion of mixing of information. We could reduce mixing by employing different expressions of Regularization or constructing merge layers that minimize mixing.
  • Associative Memory provides a model for expresses the degree of interaction between network nodes.
  • Hierarchical Abstraction is enable through mixing. The composition of higher abstraction from more simpler models enables the expressivity of deep networks.
  • Self Similarity implies that a strict demarkation of layers may be unnecessarily constraining and this may have an effect reducing expressibility.
  • Modularity is achieved through layer connected using this operator.

References A Random Matrix Approach to Language Acquisition

Our model of linguistic interaction is analytically studied using methods of statistical physics and simulated by Monte Carlo techniques. The analysis reveals an intricate relationship between the innate propensity for language acquisition β and the lexicon size N, N ∼ exp(β).