# Similarity Operator

Aliases Projection, Inner Product

Intent

A generalization of an operator that computes the similarity between a Model and a Feature.

Motivation

How do we calculate the similarity between the model and input? Features found in practice may require different kinds of measures to determine similarity.

Sketch

Similarity: $R^n \times R^n \rightarrow R$

<Diagram>

Discussion

In its more generalized sense, similarity is a measure of equivalence between two objects. For vectors, it is described as the inner product. For distributions, it can be described as the KL divergence between two distributions. There are many kinds of similarity measures, this is documented in a survey [Cha 2007]. Cha classifies similarity functions into eight different families.

Similarities are also tightly related to hashing functions. Hash algorithms be classified into serveral families: pairwise similarity preserving, multiwise similarity preserving, implicit similarity preserving and quantization.

In its most generalized sense, a neuron can be thought of being composed of a similarity function between input and parameters, the resulting measure is fed through an activation function. The conventional neuron is an inner product between the input vectors and the internal weight vectors. This is equivalent to projecting the inputs to a random matrix of weight vectors.

The convolution can be considered as a generalization of a correlation operation. Convolution is equivalent to correlation when the kernel distribution is symmetric.

Shannon's entropy is a similarity measure equal to the KL divergence between the observed distribution and a random distribution.

Fisher's Information Matrix (FIM) is a multi-dimensional generalization of the similarity measure. The metric resides in a non-euclidean space.

Does the metric have to map to 1-dimensional space?

Does the metric have to be Euclidean?

What are the minimal characteristics for a metric?

Are neural embeddings favorable if the preserve a similarity measure.

Known Uses

Related Patterns

Pattern is related to the following Canonical Patterns:

• Irreversibility and Merge form the essential mechanisms of any DL system.
• Entropy is a global similarity measure that drives the evolution of the aggregate system. The local effect of a similarity operator is to neutral(?) to entropy.
• Distance Measure generalizes the many ways we can define similarity beyond the vector dot product.
• Random Projections shows how an collection of similarity operators can lead to a mapping that is able to preserve distance.
• Clustering is a generalization of how space can be partitioned and at its core requires a heuristic for determining similarity.
• Geometry provides a framework for understanding information spaces.
• Random Orthogonal Initialization is a beneficial initialization that leads to good projections and clustering.
• Dissipative Adaptation, where the energy absorption it equivalent to similarity matching.
• Adversarial Features are a consequence of the use of a linear similarity measure.
• Anti-causality expresses the direction of predictability that is a consequence of performing a similarity measure.

Pattern is cited in:

References

See Sung-Hyuk Cha, “Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions,” International Journal of Mathematical Models and Methods in Applied Sciences, Volume 1 Issue 4, 2007, pp. 300-307 for a survey.[ii] The author identifies 45 PDF distance functions and classifies them into eight families: Lp Minkowski L1 intersection inner product fidelity (squared chord) squared L2 (χ2) Shannon’s entropy combinations.

http://turing.cs.washington.edu/papers/uai11-poon.pdf Sum-Product Networks: A New Deep Architecture

A Survey on Learning to Hash

Learning to hash is one of the major solutions to this problem and has been widely studied recently. In this paper, we present a comprehensive survey of the learning to hash algorithms, and categorize them according to the manners of preserving the similarities into: pairwise similarity preserving, multiwise similarity preserving, implicit similarity preserving, as well as quantization, and discuss their relations.

http://psl.umiacs.umd.edu/files/broecheler-uai10.pdf Probabilistic Similarity Logic

http://arxiv.org/pdf/1606.06086v1.pdf Uncertainty in Neural Network Word Embedding Exploration of Threshold for Similarity

http://arxiv.org/abs/1306.6709v4 A Survey on Metric Learning for Feature Vectors and Structured Data

https://arxiv.org/pdf/1602.01321.pdf A continuum among logarithmic, linear, and exponential functions, and its potential to improve generalization in neural networks

http://openreview.net/pdf?id=r17RD2oxe DEEP NEURAL NETWORKS AND THE TREE OF LIFE

By applying the inner product similarity of the activation vectors at the last fully connected layer for different species, we can roughly build their tree of life. Our work provides a new perspective to the deep representation and sheds light on possible novel applications of deep representation to other areas like Bioinformatics.

Mercer kernels are essentially a generalization of the inner-product for any kind of data — they are symmetric though self-similarity may not be the maximum. They are quite popular in machine learning and Mercer kernels have been defined for text, graphs, time series, images.

https://arxiv.org/abs/1702.05870 Cosine Normalization: Using Cosine Similarity Instead of Dot Product in Neural Networks

To bound dot product and decrease the variance, we propose to use cosine similarity instead of dot product in neural networks, which we call cosine normalization. Our experiments show that cosine normalization in fully-connected neural networks notably reduces the test err with lower divergence, compared to other normalization techniques. Applied to convolutional networks, cosine normalization also significantly enhances the accuracy of classification and accelerates the training.

https://arxiv.org/abs/1708.00138 The differential geometry of perceptual similarity

Human similarity judgments are inconsistent with Euclidean, Hamming, Mahalanobis, and the majority of measures used in the extensive literatures on similarity and dissimilarity. From intrinsic properties of brain circuitry, we derive principles of perceptual metrics, showing their conformance to Riemannian geometry. As a demonstration of their utility, the perceptual metrics are shown to outperform JPEG compression. Unlike machine-learning approaches, the outperformance uses no statistics, and no learning. Beyond the incidental application to compression, the metrics offer broad explanatory accounts of empirical perceptual findings such as Tverskys triangle inequality violations, contradictory human judgments of identical stimuli such as speech sounds, and a broad range of other phenomena on percepts and concepts that may initially appear unrelated. The findings constitute a set of fundamental principles underlying perceptual similarity.

https://arxiv.org/abs/1410.5792v1 Generalized Compression Dictionary Distance as Universal Similarity Measure

https://arxiv.org/abs/1804.08071v1 Decoupled Networks

we first reparametrize the inner product to a decoupled form and then generalize it to the decoupled convolution operator which serves as the building block of our decoupled networks. We present several effective instances of the decoupled convolution operator. Each decoupled operator is well motivated and has an intuitive geometric interpretation. Based on these decoupled operators, we further propose to directly learn the operator from data.

? Decoupling the intra-class and interclass variation gives us the flexibility to design better models that are more suitable for a given ta

https://arxiv.org/pdf/1804.09458v1.pdf Dynamic Few-Shot Visual Learning without Forgetting

we propose a novel attention based few-shot classification weight generator as well as a cosine-similarity based ConvNet classifier. This allows to recognize in a unified way both novel and base categories and also leads to learn feature representations with better generalization capabilities

https://arxiv.org/abs/1712.07136 Low-Shot Learning with Imprinted Weights

by directly setting the final layer weights from novel training examples during low-shot learning. We call this process weight imprinting as it directly sets weights for a new category based on an appropriately scaled copy of the embedding layer activations for that training example.

https://arxiv.org/abs/1805.06576 A Spline Theory of Deep Networks (Extended Version)

We build a rigorous bridge between deep networks (DNs) and approximation theory via spline functions and operators. Our key result is that a large class of DNs can be written as a composition of max-affine spline operators (MASOs), which provide a powerful portal through which to view and analyze their inner workings. For instance, conditioned on the input signal, the output of a MASO DN can be written as a simple affine transformation of the input. This implies that a DN constructs a set of signal-dependent, class-specific templates against which the signal is compared via a simple inner product; we explore the links to the classical theory of optimal classification via matched filters and the effects of data memorization. Going further, we propose a simple penalty term that can be added to the cost function of any DN learning algorithm to force the templates to be orthogonal with each other; this leads to significantly improved classifi- cation performance and reduced overfitting with no change to the DN architecture. The spline partition of the input signal space that is implicitly induced by a MASO directly links DNs to the theory of vector quantization (VQ) and K-means clustering, which opens up new geometric avenue to study how DNs organize signals in a hierarchical fashion. To validate the utility of the VQ interpretation, we develop and validate a new distance metric for signals and images that quantifies the difference between their VQ encodings. (This paper is a significantly expanded version of a paper with the same title that will appear at ICML 2018.).

Orthogonality penalty a term that penalizes non-zero off-diagonal entries in the matrix leading to the new loss with extra penalty.

https://arxiv.org/abs/1807.02873v1 Separability is not the best goal for machine learning

https://arxiv.org/abs/1807.11440v1 Comparator Networks

(i) We propose a Deep Comparator Network (DCN) that can ingest a pair of sets (each may contain a variable number of images) as inputs, and compute a similarity between the pair–this involves attending to multiple discriminative local regions (landmarks), and comparing local descriptors between pairs of faces; (ii) To encourage high-quality representations for each set, internal competition is introduced for recalibration based on the landmark score; (iii) Inspired by image retrieval, a novel hard sample mining regime is proposed to control the sampling process, such that the DCN is complementary to the standard image classification models.

https://arxiv.org/abs/1808.00508v1 Neural Arithmetic Logic Units

Experiments show that NALU-enhanced neural networks can learn to track time, perform arithmetic over images of numbers, translate numerical language into real-valued scalars, execute computer code, and count objects in images. In contrast to conventional architectures, we obtain substantially better generalization both inside and outside of the range of numerical values encountered during training, often extrapolating orders of magnitude beyond trained numerical ranges.

https://arxiv.org/pdf/1808.07526.pdf Deep Neural Network Structures Solving Variational Inequalities∗

We propose a novel theoretical framework to investigate deep neural networks using the formalism of proximal fixed point methods for solving variational inequalities. We first show that almost all activation functions used in neural networks are actually proximity operators. This leads to an algorithmic model alternating firmly nonexpansive and linear operators. We derive new results on averaged operator iterations to establish the convergence of this model, and show that the limit of the resulting algorithm is a solution to a variational inequality

https://arxiv.org/abs/1810.02906v1 Network Distance Based on Laplacian Flows on Graphs

Our key insight is to define a distance based on the long term diffusion behavior of the whole network. We first introduce a dynamic system on graphs called Laplacian flow. Based on this Laplacian flow, a new version of diffusion distance between networks is proposed. We will demonstrate the utility of the distance and its advantage over various existing distances through explicit examples. The distance is also applied to subsequent learning tasks such as clustering network objects.

https://arxiv.org/pdf/1810.13337v1.pdf LEARNING TO REPRESENT EDITS

By combining a “neural editor” with an “edit encoder”, our models learn to represent the salient information of an edit and can be used to apply edits to new inputs. We experiment on natural language and source code edit data.

https://arxiv.org/abs/1808.10584 Learning to Describe Differences Between Pairs of Similar Images

We collect a new dataset by crowd-sourcing difference descriptions for pairs of image frames extracted from video-surveillance footage.