**This is an old revision of the document!**

** Name ** Fitness Function (aka fitness criteria, cost function, loss function, objective function, reward function, profit function, utility function

**Intent**

A learning machine is trained with the goal of finding the extremum of an object function of a complex domain.

**Motivation**

How do we specify the goals or objective function of a learning machine?

How do we formulate an objective function beyond classification? Typically the objective function consists of a measure that quantifies the difference between the observed value and its predicted value. However, alternative fitness measures may be devised to create networks that extend beyond classification.

**Structure**

<Diagram>

**Discussion**

Machine learning (ML) algorithms can be reduced to algorithms with goals to find the best fit between observed data and its internal model of the data. This can be be simplistically defined as curve fitting. The manner is how a internal model, that is the “curve”, fits with the data is the crux of the problem. In ML the expectation is that the internal model is precise enough to be able to predict data that it has not previously seen. The fitness function is also found in Reinforcement Learning and in Genetic Evolution. In fact, the activation function itself can be looked at as also being a fitness function but at the level of an individual neuron.

Is the fitness function required to be a functor that maps R^n to R?

Is the fitness function always a function of the inputs of the highest layer or can intermediate layers be a source of inputs?

Is the fitness function related to evaluation metrics?

**Known Uses**

Fitness function are conventionally the entropy, however we have found in the literature alternative functions.

https://papers.nips.cc/paper/5165-learning-word-embeddings-efficiently-with-noise-contrastive-estimation.pdf Learning word embeddings efficiently with noise-contrastive estimation

We propose a simple and scalable new approach to learning word embeddings based on training log-bilinear models with noise-contrastive estimation. Our approach is simpler, faster, and produces better results than the current state-of-theart method.

**Related Patterns**

<Diagram>

- Entropy - The fitness function typically involves the calculation of an entropy measure.
- Similarity Operator - The fitness function is also a similarity operator in its use of entropy to calculate Model closeness.
- Regularization - The fitness function involves a regularization term that expresses constraints on the desired model.

**References**

http://travelbythenumbers.com/2015/11/18/so-you-want-to-implement-a-custom-loss-function/

http://davmre.github.io/inference/2015/11/13/elbo-in-5min/

http://hunch.net/?p=269 Loss function semantics

http://arxiv.org/pdf/1605.07588v2.pdf Surrogate Loss Function

http://arxiv.org/pdf/1511.06411v2.pdf Training Deep Neural Networks via Direct Loss Minimization

http://lossfunctions.tumblr.com/

https://arxiv.org/pdf/1510.00831v1.pdf Partial Information Decomposition as a Unified Approach to the Specification of Neural Goal Functions

We argue that neural information processing crucially depends on the combination of multiple inputs to create the output of a processor. To account for this, we use a very recent extension of Shannon Information theory, called partial information decomposition (PID). PID allows to quantify the information that several inputs provide individually (unique information), redundantly (shared information) or only jointly (synergistic information) about the output

http://openreview.net/pdf?id=r1aPbsFle Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling

Recurrent neural networks have been very successful at predicting sequences of words in tasks such as language modeling. However, all such models are based on the conventional classification framework, where model is trained against one-hot targets, and each word is represented both as an input and as an output in isolation. This causes inefficiencies in learning both in terms of utilizing all of the information and in terms of the number of parameters needed to train. We introduce a novel theoretical framework that facilitates better learning in language modeling, and show that our framework leads to tying together the input embedding and the output projection matrices, greatly reducing the number of trainable variables.

Since crossentropy and Kullback-Leibler divergence are equivalent when the target distribution is one-hot.We can think of the optimization of the conventional loss in an RNNLM as trying to minimize the distance between the model prediction distribution and the empirical target distribution. In the framework which we will introduce, we utilize Kullback-Leibler divergence as opposed to cross-entropy due to its intuitive interpretation as a distance between distributions, although the two are not equivalent in our framework.

http://openreview.net/pdf?id=rkuDV6iex AN EMPIRICAL ANALYSIS OF DEEP NETWORK LOSS SURFACES

In this paper, we empirically investigate the geometry of the loss functions for state-of-the-art networks with multiple stochastic optimization methods. s. Our experiments show that different optimization methods find different minima, even when switching from one method to another very late during training. In addition, we found that the minima found by different optimization methods have different shapes, but that these minima are similar in terms of the most important measure, generalization accuracy. We confirmed that these minima appear to represent different functions as well, so we were not converging toward equivalent networks.

https://arxiv.org/abs/1611.05134 Cost-Sensitive Deep Learning with Layer-Wise Cost Estimation

we propose a novel framework that can be applied to deep neural networks with any structure to facilitate their learning of meaningful representations for cost-sensitive classification problems. Furthermore, the framework allows end- to-end training of deeper networks directly. The framework is designed by augmenting auxiliary neurons to the output of each hidden layer for layer-wise cost estimation, and including the total estimation loss within the optimization objective.

CSDNN starts with a regular DNN with fully-connected layers, but replaces the softmax layer at the end of the DNN by a cost-estimation layer. Each of the K neurons in the cost-estimation layer provides per-class cost estimation with regression instead of per-class probability estimation.

https://arxiv.org/pdf/1611.03852v2.pdf A Connection Between Generative Adversarial Networks, Inverse Reinforcement Learning, and Energy-Based Models

In particular, we demonstrate an equivalence between a sample-based algorithm for maximum entropy IRL and a GAN in which the generator’s density can be evaluated and is provided as an additional input to the discriminator. Interestingly, maximum entropy IRL is a special case of an energy-based model.

https://arxiv.org/pdf/1507.04888v3.pdf Maximum Entropy Deep Inverse Reinforcement Learning

This paper presents Maximum Entropy Deep IRL, a framework exploiting FCNNs for reward structure approximation in Inverse Reinforcement Learning. Neural networks lend themselves naturally to this task as they combine representational power with computational efficiency compared to state-of-the-art methods. Unlike prior art in this domain DeepIRL can therefore be applied in cases where complex reward structures need to be modelled for large state spaces. Moreover, training can be achieved effectively and efficiently within the popular Maximum Entropy IRL framework. A further advantage of DeepIRL lies in its versatility. Custom network architectures and types can be developed for any given task while exploiting the same cost function in training.

DISCO Nets : DISsimilarity COefficients Networks

https://arxiv.org/abs/1612.01397v1 Implicit Modeling – A Generalization of Discriminative and Generative Approaches

We propose a new modeling approach that is a generalization of generative and discriminative models. The core idea is to use an implicit parameterization of a joint probability distribution by specifying only the conditional distributions. The proposed scheme combines the advantages of both worlds – it can use powerful complex discriminative models as its parts, having at the same time better generalization capabilities. We thoroughly evaluate the proposed method for a simple classification task with artificial data and illustrate its advantages for real-word scenarios on a semantic image segmentation problem.

https://arxiv.org/abs/1612.02780 Improved generator objectives for GANs

We present a framework to understand GAN training as alternating density ratio estimation and approximate divergence minimization. This provides an interpretation for the mismatched GAN generator and discriminator objectives often used in practice, and explains the problem of poor sample diversity. We also derive a family of generator objectives that target arbitrary f-divergences without minimizing a lower bound, and use them to train generative image models that target either improved sample quality or greater sample diversity.

https://arxiv.org/pdf/math/0209021.pdf ON CHOOSING AND BOUNDING PROBABILITY METRICS

https://arxiv.org/abs/1604.08859 The Z-loss: a shift and scale invariant classification loss belonging to the Spherical Family

Furthermore, we show on a word language modeling task that it also outperforms the log-softmax with respect to certain ranking scores, such as top-k scores, suggesting that the Z-loss has the flexibility to better match the task loss. These qualities thus makes the Z-loss an appealing candidate to train very efficiently large output networks such as word-language models or other extreme classification problems. On the One Billion Word (Chelba et al., 2014) dataset, we are able to train a model with the Z-loss 40 times faster than the log-softmax and more than 4 times faster than the hierarchical softmax.

https://arxiv.org/pdf/1611.07573.pdf Relaxed Earth Mover’s Distances for Chain- and Tree-connected Spaces and their use as a Loss Function in Deep Learning

The Earth Mover’s Distance (EMD) computes the optimal cost of transforming one distribution into another, given a known transport metric between them. In deep learning, the EMD loss allows us to embed information during training about the output space structure like hierarchical or semantic relations. This helps in achieving better output smoothness and generalization. However EMD is computationally expensive. Moreover, solving EMD optimization problems usually require complex techniques like lasso. These properties limit the applicability of EMD-based approaches in large scale machine learning.

https://arxiv.org/abs/1611.04076v2 Least Squares Generative Adversarial Networks

Regular GANs hypothesize the discriminator as a classifier with the sigmoid cross entropy loss function. This loss function, however, may lead to the vanishing gradient problem during the learning process. To overcome such problem, here we propose the Least Squares Generative Adversarial Networks (LSGANs) that adopt the least squares loss function for the discriminator. We show that minimizing the objective function of LSGAN yields minimizing the Pearson χ2 divergence.

https://arxiv.org/abs/1703.00573v1 Generalization and Equilibrium in Generative Adversarial Nets (GANs)

We introduce a new metric called neural net distance for which generalization does occur. We also show that an approximate pure equilibrium in the 2-player game exists for a natural training objective (Wasserstein).

Finally, the above theoretical ideas lead us to propose a new training protocol, MIX+GAN, which can be combined with any existing method. We present experiments showing that it stabilizes and improves some existing methods.

Suspecting that a pure equilibrium may not exist for all objectives, we recommend in practice our mix+ gan protocol using a small mixture of discriminators and generators. Our experiments show it improves the quality of several existing GAN training methods. Finally, note that existence of an equilibrium does not imply that a simple algorithm (in this case, backpropagation) would find it easily. That still defies explanation.

https://arxiv.org/abs/1701.03077v3 A More General Robust Loss Function

We present a two-parameter loss function which can be viewed as a generalization of many popular loss functions used in robust statistics: the Cauchy/Lorentzian, Geman-McClure, Welsch, and generalized Charbonnier loss functions (and by transitivity the L2, L1, L1-L2, and pseudo-Huber/Charbonnier loss functions). We describe and visualize this loss, and document several of its useful properties.

https://arxiv.org/pdf/1703.06114v2.pdf Deep Sets

In this paper, we study the problem of designing objective functions for machine learning problems defined on finite sets. In contrast to traditional objective functions defined for machine learning problems operating on finite dimensional vectors, the new objective functions we propose are operating on finite sets and are invariant to permutations.

This family of functions has a special structure which enables us to design a deep network architecture that can operate on sets and which can be deployed on a variety of scenarios including both unsupervised and supervised learning tasks. We demonstrate the applicability of our method on population statistic estimation, point cloud classi- fication, set expansion, and image tagging.

https://arxiv.org/pdf/1702.08165.pdf Reinforcement Learning with Deep Energy-Based Policies

https://arxiv.org/abs/1611.05916v4 Squared Earth Mover's Distance-based Loss for Training Deep Neural Networks

The squared EMD loss uses the predicted probabilities of all classes and penalizes the miss-predictions according to a ground distance matrix that quantifies the dissimilarities between classes. We demonstrate that on datasets with strong inter-class relationships such as an ordering between classes, our exact squared EMD losses yield new state-of-the-art results. Furthermore, we propose a method to automatically learn this matrix using the CNN's own features during training. We show that our method can learn a ground distance matrix efficiently with no inter-class relationship priors and yield the same performance gain. Finally, we show that our method can be generalized to applications that lack strong inter-class relationships and still maintain state-of-the-art performance. Therefore, with limited computational overhead, one can always deploy the proposed loss function on any dataset over the conventional cross-entropy.

https://arxiv.org/pdf/1704.06191v1.pdf Softmax GAN

Softmax GAN is a novel variant of Generative Adversarial Network (GAN). The key idea of Softmax GAN is to replace the classification loss in the original GAN with a softmax cross-entropy loss in the sample space of one single batch. In the adversarial learning of N real training samples and M generated samples, the target of discriminator training is to distribute all the probability mass to the real samples, each with probability 1 M , and distribute zero probability to generated data. In the generator training phase, the target is to assign equal probability to all data points in the batch, each with probability 1 M+N . While the original GAN is closely related to Noise Contrastive Estimation (NCE), we show that Softmax GAN is the Importance Sampling version of GAN. We futher demonstrate with experiments that this simple change stabilizes GAN training.

https://arxiv.org/pdf/1705.05263.pdf Comparison of Maximum Likelihood and GAN-based training of Real NVPs

https://arxiv.org/abs/1605.07110 Deep Learning without Poor Local Minima

1) the function is non-convex and non-concave, 2) every local minimum is a global minimum, 3) every critical point that is not a global minimum is a saddle point, and 4) there exist “bad” saddle points (where the Hessian has no negative eigenvalue) for the deeper networks (with more than three layers), whereas there is no bad saddle point for the shallow networks (with three layers).

https://arxiv.org/abs/1705.07164 Relaxed Wasserstein with Applications to GANs

Generative Adversarial Networks (GANs) provide a versatile class of models for generative modeling. To improve the performance of machine learning models, there has recently been interest in designing objective functions based on Wasserstein distance rather than Jensen-Shannon (JS) divergence. In this paper, we propose a novel asymmetric statistical divergence called Relaxed Wasserstein (RW) divergence as a generalization of Wasserstein-L2 distance of order 2. We show that RW is dominated by Total Variation (TV) and Wasserstein-L2 distance, and establish continuity, differentiability, and duality representation of RW divergence. Finally, we provide a nonasymptotic moment estimate and a concentration inequality for RW divergence. Our experiments show that RWGANs with Kullback-Leibler (KL) divergence produce recognizable images with a ReLU Multi-Layer Perceptron (MLP) generator in fewer iterations, compared to Wasserstein-L1 GAN (WGAN).

https://arxiv.org/pdf/1705.09675v1.pdf Fisher GAN

Generative Adversarial Networks (GANs) are powerful models for learning complex distributions. Stable training of GANs has been addressed in many recent works which explore different metrics between distributions. In this paper we introduce Fisher GAN which fits within the Integral Probability Metrics (IPM) framework for training GANs. Fisher GAN defines a critic with a data dependent constraint on its second order moments. We show in this paper that Fisher GAN allows for stable and time efficient training that does not compromise the capacity of the critic, and does not need data independent constraints such as weight clipping. We analyze our Fisher IPM theoretically and provide an algorithm based on Augmented Lagrangian for Fisher GAN. We validate our claims on both image sample generation and semi-supervised classification using Fisher GAN.

https://arxiv.org/abs/1705.10743v1 The Cramer Distance as a Solution to Biased Wasserstein Gradients

The value of being sensitive to this geometry has been demonstrated, among others, in ordinal regression and generative modelling. In this paper we describe three natural properties of probability divergences that reflect requirements from machine learning: sum invariance, scale sensitivity, and unbiased sample gradients. The Wasserstein metric possesses the first two properties but, unlike the Kullback-Leibler divergence, does not possess the third. We provide empirical evidence suggesting that this is a serious issue in practice.

https://arxiv.org/pdf/1705.06366v1.pdf Automatic Goal Generation for Reinforcement Learning Agents

s. Instead, we propose a method that allows an agent to automatically discover the range of tasks that it is capable of performing. We use a generator network to propose tasks for the agent to try to achieve, specified as goal states. The generator network is optimized using adversarial training to produce tasks that are always at the appropriate level of difficulty for the agent. Our method thus automatically produces a curriculum of tasks for the agent to learn.

https://arxiv.org/abs/1706.04208 Hybrid Reward Architecture for Reinforcement Learning

This paper contributes towards tackling such challenging domains, by proposing a new method, called Hybrid Reward Architecture (HRA). HRA takes as input a decomposed reward function and learns a separate value function for each component reward function. Because each component typically only depends on a subset of all features, the overall value function is much smoother and can be easier approximated by a low-dimensional representation, enabling more effective learning.

https://arxiv.org/abs/1708.08819 Coulomb GANs: Provably Optimal Nash Equilibria via Potential Fields

Generative adversarial networks (GANs) evolved into one of the most successful unsupervised techniques for generating realistic images. Even though it has recently been shown that GAN training converges, GAN models often end up in local Nash equilibria that are associated with mode collapse or otherwise fail to model the target distribution. We introduce Coulomb GANs, which pose the GAN learning problem as a potential field, where generated samples are attracted to training set samples but repel each other. The discriminator learns a potential field while the generator decreases the energy by moving its samples along the vector (force) field determined by the gradient of the potential field. Through decreasing the energy, the GAN model learns to generate samples according to the whole target distribution and does not only cover some of its modes. We prove that Coulomb GANs possess only one Nash equilibrium which is optimal in the sense that the model distribution equals the target distribution. We show the efficacy of Coulomb GANs on a variety of image datasets. On LSUN and celebA, Coulomb GANs set a new state of the art and produce a previously unseen variety of different samples.

https://arxiv.org/abs/1706.00292 Sinkhorn-AutoDiff: Tractable Wasserstein Learning of Generative Models

This paper presents the first tractable computational method to train large scale generative models using an optimal transport loss, and tackles both these issues by relying on two key ideas: (a) entropic smoothing, which turns the original OT loss into one that can be computed using Sinkhorn fixed point iterations; (b) algorithmic (automatic) differentiation of these iterations. These two approximations result in a robust and differentiable approximation of the OT loss with streamlined GPU execution. Entropic smoothing generates a family of losses interpolating between Wasserstein (OT) and Maximum Mean Discrepancy (MMD), thus allowing to find a sweet spot leveraging the geometry of OT and the favorable high-dimensional sample complexity of MMD which comes with unbiased gradient estimates. The resulting computational architecture complements nicely standard deep network generative models by a stack of extra layers implementing the loss function.

https://arxiv.org/pdf/1711.01558v1.pdf Wasserstein Auto-Encoders

https://openreview.net/pdf?id=SJyEH91A- LEARNING WASSERSTEIN EMBEDDINGS

https://arxiv.org/pdf/1804.04268.pdf Incomplete Contracting and AI Alignment

https://arxiv.org/abs/1803.04585v3 Categorizing Variants of Goodhart's Law

http://bmax.im/LovaszSoftmax The Lovász-Softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks

https://danilorezende.com/2018/07/12/short-notes-on-divergence-measures/