fdr Edit: https://docs.google.com/a/codeaudit.com/document/d/1e6V15KdXpgE_gYhZ3z2LoNYmcPP_hTL2gehPnVPS3D4/edit?usp=sharing

Transfer Learning aka Distillaton

Intent

Train a new network using a previously trained network.

Problem

Large and more accurate networks can be composed from smaller networks using the Ensemble pattern. However, the larger network may consume too much computational resources to perform inference.

Structure

<Diagram showing larger network as teacher to smaller network>

Solution

One can employ Transfer Learning to create a much smaller network using the larger network as a teacher.

You can also leverage existing networks to train a network with a different structure.

Rationale

Known Uses

http://arxiv.org/abs/1511.06295 Policy Distillation we present a novel method called policy distillation that can be used to extract the policy of a reinforcement learning agent and train a new network that performs at the expert level while being dramatically smaller and more efficient. References

http://arxiv.org/abs/1503.02531 Distilling the Knowledge in a Neural Network

Hinton - Dark Knowledge

http://www.deeplearningbook.org/contents/representation.html 15.2 Transfer Learning and Domain Adaptation Transfer learning and domain adaptation refer to the situation where what has been learned in one setting is exploited to improve generalization in another setting.

http://www.ttic.edu/dl/dark14.pdf

http://arxiv.org/pdf/1503.02531v1.pdf

Model compression: transfer the language learned from the ensemble models into a single smaller model to reduce test computation.Specialist Networks: training models specialized on a confusable subset of the classes to reduce the time to train an ensemble.

The result on MNIST is surprisingly good, even when they tried omitting some digit. When they omitted all examples of the digit 3 during the transfer training, the distilled net gets 98.6% of the test 3s correct even though 3 is a mythical digit it has never seen.

The transfer training uses the same training set but with no dropout and no jitterIt is just vanilla backprop (with added soft targets).

Appears to use ensemble models to trains small neural networks. However the notion of specialists NN may be required to provide some kind of encapsulation of functionality.

http://deepdish.io/2014/10/28/hintons-dark-knowledge/ specialist training

Specialist networks Specialist networks are a way of using dark knowledge to improve the performance of deep network models regardless of their underlying complexity. They are used in the setting where there are many different classes. As before, deep network is trained over the data and each data point is assigned a target that corresponds to the temperature adjusted softmax output. These softmax outputs are then clustered multiple times using k-means and the resultant clusters indicate easily confuseable data points that come from a subset of classes. Specialist networks are then trained only on the data in these clusters using a restricted number of classes. They treat all classes not contained in the cluster as coming from a single “other” class. These specialist networks are then trained using alternating one-hot, temperature-adjusted technique. The ensemble method constructed by combining the various specialist networks creates benefits for the overall network.

One technical hiccup created by the specialist network is that the specialist networks are trained using different classes than the full network so combining the softmax outputs from multiple networks requires a combination trick. Essentially there is an optimization problem to solve: ensure that the catchall “dustbin” classes for the specialist networks match a sum of your softmax outputs. So that if you have cars and cheetahs grouped together in one class for your dog detector you combine that network with your cars versus cheetahs network by ensuring the output probabilities for cars and cheetahs sum to a probability similar to the catch-all output of the dog detector.

http://deepdish.io/2014/10/28/hintons-dark-knowledge/

ftp://ftp.cs.wisc.edu/machine-learning/shavlik-group/torrey.handbook09.pdf

https://arxiv.org/pdf/1511.03643.pdf UNIFYING DISTILLATION AND PRIVILEGED INFORMATION

https://arxiv.org/abs/1503.02531 Distilling the Knowledge in a Neural Network We introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.

http://arxiv.org/abs/1411.1792

How transferable are features in deep neural networks?

Many deep neural networks trained on natural images exhibit a curious phenomenon in common: on the first layer they learn features similar to Gabor filters and color blobs. Such first-layer features appear not to be specific to a particular dataset or task, but general in that they are applicable to many datasets and tasks. Features must eventually transition from general to specific by the last layer of the network, but this transition has not been studied extensively. In this paper we experimentally quantify the generality versus specificity of neurons in each layer of a deep convolutional neural network and report a few surprising results. Transferability is negatively affected by two distinct issues: (1) the specialization of higher layer neurons to their original task at the expense of performance on the target task, which was expected, and (2) optimization difficulties related to splitting networks between co-adapted neurons, which was not expected. In an example network trained on ImageNet, we demonstrate that either of these two issues may dominate, depending on whether features are transferred from the bottom, middle, or top of the network. We also document that the transferability of features decreases as the distance between the base task and target task increases, but that transferring features even from distant tasks can be better than using random features. A final surprising result is that initializing a network with transferred features from almost any number of layers can produce a boost to generalization that lingers even after fine-tuning to the target dataset

http://arxiv.org/abs/1606.04671v2 Progressive Neural Networks

Continual learning, the ability to accumulate and transfer knowledge to new domains, is a core characteristic of intelligent beings. Progressive neural networks are a stepping stone towards continual learning, and this work has demonstrated their potential through experiments and analysis across three RL domains, including Atari, which contains orthogonal or even adversarial tasks.

Depiction of a three column progressive network. The first two columns on the left (dashed arrows) were trained on task 1 and 2 respectively. The grey box labelled a represent the adapter layers (see text). A third column is added for the final task having access to all previously learned features.

http://www.dsi.unive.it/~srotabul/files/publications/ICML2016.pdf Dropout Distillation

Dropout distillation Bulò, Samuel Rota, Porzi, L., & Kontschieder, P. (2016) [4]

Dropout is a regularization technique that was proposed to prevent neural networks from overfitting [23]. It drops units from the network randomly during training by setting their outputs to zero, thus reducing co-adaptation of the units. This procedure implicitly trains an ensemble of exponentially many smaller networks sharing the same parametrization. The predictions of these networks must then be averaged at test time, which is unfortunately intractable to compute precisely. But the averaging can be approximated by scaling the weights of a single network. However, this approximation may not produce sufficient accuracy in all cases. The authors introduce a better approximation method called dropout distillation that finds a predictor with minimal divergence from the ideal predictor by applying stochastic gradient descent. The distillation procedure can even be applied to networks already trained using dropout by utilizing unlabeled data. Their results on benchmark problems show consistent improvements over standard dropout.

http://arxiv.org/abs/1511.06433v3 Blending LSTMs into CNNs

We show that even more accurate CNNs can be trained under the guidance of LSTMs using a variant of model compression, which we call model blending because the teacher and student models are similar in complexity but different in inductive bias. Blending further improves the accuracy of our CNN, yielding a computationally efficient model of accuracy higher than any of the other individual models. Examining the effect of “dark knowledge” in this model compression task, we find that less than 1% of the highest probability labels are needed for accurate model compression.

https://arxiv.org/abs/1606.07947v4 Sequence-Level Knowledge Distillation

In this paper we consider applying knowledge distillation approaches (Bucila et al., 2006; Hinton et al., 2015) that have proven successful for reducing the size of neural models in other domains to the problem of Neural machine translation (NMT).

https://arxiv.org/abs/1610.04286 Sim-to-Real Robot Learning from Pixels with Progressive Nets

The progressive net approach is a general framework that enables reuse of everything from low-level visual features to high-level policies for transfer to new tasks, enabling a compositional, yet simple, approach to building complex skills. We present an early demonstration of this approach with a number of experiments in the domain of robot manipulation that focus on bridging the reality gap. Unlike other proposed approaches, our real-world experiments demonstrate successful task learning from raw visual input on a fully actuated robot manipulator. Moreover, rather than relying on model-based trajectory optimisation, the task learning is accomplished using only deep reinforcement learning and sparse rewards.

https://arxiv.org/abs/1610.05755v1 Semi-supervised Knowledge Transfer for Deep Learning from Private Training Data

We demonstrate a generally applicable approach to providing strong privacy guarantees for training data. The approach combines, in a black-box fashion, multiple models trained with disjoint datasets, such as records from different subsets of users. Because they rely directly on sensitive data, these models are not published, but instead used as teachers for a student model. The student learns to predict an output chosen by noisy voting among all of the teachers, and cannot directly access an individual teacher or the underlying data or parameters. The student's privacy properties can be understood both intuitively (since no single teacher and thus no single dataset dictates the student's training) and formally, in terms of differential privacy. These properties hold even if an adversary can not only query the student but also inspect its internal workings.

https://devblogs.nvidia.com/parallelforall/detectnet-deep-neural-network-object-detection-digits/

https://arxiv.org/abs/1610.04286 Sim-to-Real Robot Learning from Pixels with Progressive Nets

We propose using progressive networks to bridge the reality gap and transfer learned policies from simulation to the real world. The progressive net approach is a general framework that enables reuse of everything from low-level visual features to high-level policies for transfer to new tasks, enabling a compositional, yet simple, approach to building complex skills.

Depiction of a progressive network, left, and a modified progressive architecture used for robot transfer learning, right. The first column is trained on Task 1, in simulation, the second column is trained on Task 1 on the robot, and the third column is trained on Task 2 on the robot. Columns may differ in capacity, and the adapter functions (marked ‘a’) are not used for the output layers of this non-adversarial sequence of tasks.

https://arxiv.org/pdf/1610.09650v2.pdf Deep Model Compression: Distilling Knowledge from Noisy Teachers

We presented a method based on teacherstudent learning framework for deep model compression, considering both storage and runtime complexities. Our noise-based regularization method helped the shallow student model to do significantly better than the baseline teacher-student algorithm. The proposed method can be viewed also as simulating learning from multiple teachers, thus helping student models to get closer to the teacher’s performance.

http://openreview.net/pdf?id=HyenWc5gx Representation Stability as a Regularizer for Neural Network Transfer Learning

We propose a novel general purpose regularizer to address catastrophic forgetting in neural network sequential transfer learning.

Using logical rules to improve neural network models is a promising new direction for humans to efficiently contribute to increased model performance. Additionally, the large diversity of representations learned from multiple classifiers with the same target task but different source tasks seems to indicate there is potential to see even much greater gains when integrating multiple sources of knowledge transfer.

http://openreview.net/pdf?id=ryPx38qge A HYBRID NETWORK: SCATTERING AND CONVNET

This paper shows how, by combining prior and supervised representations, one can create architectures that lead to nearly state-of-the-art results on standard benchmarks, which mean they perform as well as a deep network learned from scratch. We use scattering as a generic and fixed initialization of the first layers of a deep network, and learn the remaining layers in a supervised manner. We numerically demonstrate that deep hybrid scattering networks generalize better on small datasets than supervised deep networks. Scattering networks could help current systems to save computation time, while guaranteeing the stability to geometric transformations and noise of the first internal layers. We also show that the learned operators explicitly build invariances to geometrical variabilities, such as local rotation and translation, by analyzing the third layer of our architecture. We demonstrate that it is possible to replace the scattering transform by a standard deep network at the cost of having to learn more parameters and potentially adding instabilities. Finally, we release a new software, ScatWave, using GPUs for fast computations of a scattering network that is integrated in Torch. We evaluate our model on the CIFAR10 and CIFAR100 datasets.

http://openreview.net/pdf?id=Sks9_ajex PAYING MORE ATTENTION TO ATTENTION: IMPROVING THE PERFORMANCE OF CONVOLUTIONAL NEURAL NETWORKS VIA ATTENTION TRANSFER

By properly defining attention for convolutional neural networks, we can actually use this type of information in order to significantly improve the performance of a student CNN network by forcing it to mimic the attention maps of a powerful teacher network. To that end, we propose several novel methods of transferring attention.

http://openreview.net/pdf?id=SJZAb5cel A JOINT MANY-TASK MODEL: GROWING A NEURAL NETWORK FOR MULTIPLE NLP TASKS

We introduce such a joint many-task model together with a strategy for successively growing its depth to solve increasingly complex tasks. All layers include shortcut connections to both word representations and lower-level task predictions. We use a simple regularization term to allow for optimizing all model weights to improve one task’s loss without exhibiting catastrophic interference of the other tasks.

https://arxiv.org/pdf/1511.03979v6.pdf Representational Distance Learning for Deep Neural Networks

Representational spaces of the student and the teacher are characterized by representational distance matrices (RDMs). We propose representational distance learning (RDL), a stochastic gradient descent method that drives the RDMs of the student to approximate the RDMs of the teacher.

https://www.semanticscholar.org/paper/How-Transferable-are-Neural-Networks-in-NLP-Mou-Meng/02a77690f08d158e7840e40655892045ed830817 How Transferable are Neural Networks in NLP Applications?

We conducted two series of experiments on six datasets, showing that the transferability of neural NLP models depends largely on the semantic relatedness of the source and target tasks.

https://arxiv.org/abs/1609.07088v1 Learning Modular Neural Network Policies for Multi-Task and Multi-Robot Transfer

We show that neural network policies can be decomposed into “task-specific” and “robot-specific” modules, where the task-specific modules are shared across robots, and the robot-specific modules are shared across all tasks on that robot. This allows for sharing task information, such as perception, between robots and sharing robot information, such as dynamics and kinematics, between tasks. We exploit this decomposition to train mix-and-match modules that can solve new robot-task combinations that were not seen during training. Using a novel neural network architecture, we demonstrate the effectiveness of our transfer method for enabling zero-shot generalization with a variety of robots and tasks in simulation for both visual and non-visual tasks.

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/image_retraining/retrain.py

https://openreview.net/pdf?id=B16dGcqlx THIRD PERSON IMITATION LEARNING

In this paper, we present a method for unsupervised third person imitation learning. By using ideas from domain confusion, we are able to train an agent to correctly achieve a simple goal in a simple environment when it is provided a demonstration of a teacher achieving the same goal but from a different viewpoint. Crucially, the agent receives only these demonstrations, and is not provided a correspondence between teacher states and student states.

https://arxiv.org/abs/1411.3128 Deep Multi-Instance Transfer Learning

We present a new approach for transferring knowledge from groups to individuals that comprise them. We evaluate our method in text, by inferring the ratings of individual sentences using full-review ratings. This approach, which combines ideas from transfer learning, deep learning and multi-instance learning, reduces the need for laborious human labelling of fine-grained data when abundant labels are available at the group level.

In spite of the fact that deep multi-instance transfer learning requires much less supervision, it is able to obtain better sentiment predictions for sentences than a state-of-the-art supervised learning approach.

This work capitalises on the advances and success of deep learning to create a model that considers similarity between embeddings to solve the multi-instance learning problem. In addition, it demonstrates the value of transferring embeddings learned in deep models to reduce the problem of having to label individual data items when group labels are available.

https://arxiv.org/abs/1608.08614v2 What makes ImageNet good for transfer learning?

The tremendous success of ImageNet-trained deep features on a wide range of transfer tasks begs the question: what are the properties of the ImageNet dataset that are critical for learning good, general-purpose features? This work provides an empirical investigation of various facets of this question: Is more pre-training data always better? How does feature quality depend on the number of training examples per class? Does adding more object classes improve performance? For the same data budget, how should the data be split into classes? Is fine-grained recognition necessary for learning good features? Given the same number of training classes, is it better to have coarse classes or fine-grained classes? Which is better: more classes or more examples per class? To answer these and related questions, we pre-trained CNN features on various subsets of the ImageNet dataset and evaluated transfer performance on PASCAL detection, PASCAL action classification, and SUN scene classification tasks. Our overall findings suggest that most changes in the choice of pre-training data long thought to be critical do not significantly affect transfer performance.? Given the same number of training classes, is it better to have coarse classes or fine-grained classes? Which is better: more classes or more examples per class?

In conclusion, while the answer to the titular question “What makes ImageNet good for transfer learning?” still lacks a definitive answer, our results have shown that a lot of “folk wisdom” on why ImageNet works well is not accurate.

The results presented in Table 2 show that having more images per class with fewer number of classes results in features that perform very slightly better on PASCALDET, whereas for SUN-CLS, the performance is comparable across the two settings.

Furthermore, for PASCAL-ACT-CLS and SUN-CLS, finetuning on CNNs pre-trained with class set sizes of 918, and 753 actually results in better performance than using all 1000 classes. This may indicate that having too many classes for pre-training works against learning good generalizable features. Hence, when generating a dataset, one should be attentive of the nomenclature of the classes.

https://arxiv.org/ftp/arxiv/papers/1612/1612.03770.pdf Neurogenesis Deep Learning

Extending deep networks to accommodate new classes

https://arxiv.org/abs/1511.04508 Distillation as a Defense to Adversarial Perturbations against Deep Neural Networks https://www.youtube.com/watch?v=oQr0gODUiZo

Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.

https://fc3696b9-a-62cb3a1a-s-sites.googlegroups.com/site/tlworkshop2015/Paper_10.pdf

Correlational Neural Networks

Common Representation Learning (CRL), wherein different descriptions (or views) of the data are embedded in a common subspace, is one way of achieving Transfer Learning. In this work we propose an AutoEncoder based approach called Correlational Neural Network (CorrNet), that explicitly maximizes correlation among the views when projected to the common subspace. Through experiments, we demonstrate that the proposed CorrNet is better than the above mentioned approaches with respect to its ability to learn correlated common representations that are useful for Transfer Learning.

http://www.iro.umontreal.ca/~bengioy/talks/TL_NIPS_workshop_12Dec2015.pdf

https://arxiv.org/abs/1611.04687v2 Intrinsic Geometric Information Transfer Learning on Multiple Graph-Structured Datasets

We attempt to advance deep learning for graph-structured data by incorporating another component, transfer learning. By transferring the intrinsic geometric information learned in the source domain, our approach can help us to construct a model for a new but related task in the target domain without collecting new data and without training a new model from scratch.

https://arxiv.org/abs/1605.06636v1 Deep Transfer Learning with Joint Adaptation Networks

Transfer learning is enabled in deep convolutional networks, where the dataset shifts may linger in multiple task-specific feature layers and the classifier layer. A set of joint adaptation networks are crafted to match the joint distributions of these layers across domains by minimizing the joint distribution discrepancy, which can be trained efficiently using back-propagation.

This paper presented a novel approach to deep transfer learning, which enables direct adaptation of joint distributions across domains. Different from previous methods, we eliminate the requirement for separate adaptations of marginal and conditional distributions, which are often subject to rather strong independence assumptions. The discrepancy between joint distributions of multiple features and labels can be computed by embedding the joint distributions in a tensor-product Hilbert space, which can be naturally implemented through most deep networks.

https://arxiv.org/abs/1511.06342v4 Actor-Mimic: Deep Multitask and Transfer Reinforcement Learning

The ability to act in multiple environments and transfer previous knowledge to new situations can be considered a critical aspect of any intelligent agent. Towards this goal, we define a novel method of multitask and transfer learning that enables an autonomous agent to learn how to behave in multiple tasks simultaneously, and then generalize its knowledge to new domains. This method, termed “Actor-Mimic”, exploits the use of deep reinforcement learning and model compression techniques to train a single policy network that learns how to act in a set of distinct tasks by using the guidance of several expert teachers. We then show that the representations learnt by the deep policy network are capable of generalizing to new tasks with no prior expert guidance, speeding up learning in novel environments. Although our method can in general be applied to a wide range of problems, we use Atari games as a testing environment to demonstrate these methods.

https://www.cse.ust.hk/~qyang/Docs/2009/tkde_transfer_learning.pdf

http://www.maluuba.com/s/aaai-multi-task1.pdf Transfer Reinforcement Learning with Shared Dynamics

This article addresses a particular Transfer Reinforcement Learning (RL) problem: when dynamics do not change from one task to another, and only the reward function does.

http://www.cs.cmu.edu/~wcohen/postscript/iclr-2017-transfer.pdf TRANSFER LEARNING FOR SEQUENCE TAGGING WITH HIERARCHICAL RECURRENT NETWORKS

Recent papers have shown that neural networks obtain state-of-the-art performance on several different sequence tagging tasks. One appealing property of such systems is their generality, as excellent performance can be achieved with a unified architecture and without task-specific feature engineering. However, it is unclear if such systems can be used for tasks without large amounts of training data. In this paper we explore the problem of transfer learning for neural sequence taggers, where a source task with plentiful annotations (e.g., POS tagging on Penn Treebank) is used to improve performance on a target task with fewer available annotations (e.g., POS tagging for microblogs). We examine the effects of transfer learning for deep hierarchical recurrent networks across domains, applications, and languages, and show that significant improvement can often be obtained. These improvements lead to improvements over the current state-ofthe-art on several well-studied tasks.

we observe that the following factors are crucial for the performance of our transfer learning approach: a) label abundance for the target task, b) relatedness between the source and target tasks, and c) the number of parameters that can be shared. In the future, it will be interesting to combine model-based transfer (as in this work) with resource-based transfer for cross-lingual transfer learning.

http://www.cse.ust.hk/~qyang/Docs/2009/tkde_transfer_learning.pdf A Survey on Transfer Learning

This survey focuses on categorizing and reviewing the current progress on transfer learning for classification, regression and clustering problems. In this survey, we discuss the relationship between transfer learning and other related machine learning techniques such as domain adaptation, multitask learning and sample selection bias, as well as co-variate shift. We also explore some potential future issues in transfer learning research.

https://arxiv.org/pdf/1703.06345v1.pdf Transfer Learning for Sequence Tagging with Hierarchical Recurrent Networks

In this paper we explore the problem of transfer learning for neural sequence taggers, where a source task with plentiful annotations (e.g., POS tagging on Penn Treebank) is used to improve performance on a target task with fewer available annotations (e.g., POS tagging for microblogs). We examine the effects of transfer learning for deep hierarchical recurrent networks across domains, applications, and languages, and show that significant improvement can often be obtained. These improvements lead to improvements over the current state-of-the-art on several well-studied tasks. https://github.com/kimiyoung/transfer

http://sebastianruder.com/transfer-learning

https://arxiv.org/abs/1703.03400v1 Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks

We propose an algorithm for meta-learning that is model-agnostic, in the sense that it is compatible with any model trained with gradient descent and applicable to a variety of different learning problems, including classification, regression, and reinforcement learning. The goal of meta-learning is to train a model on a variety of learning tasks, such that it can solve new learning tasks using only a small number of training samples. In our approach, the parameters of the model are explicitly trained such that a small number of gradient steps with a small amount of training data from a new task will produce good generalization performance on that task. In effect, our method trains the model to be easy to fine-tune. We demonstrate that this approach leads to state-of-the-art performance on a few-shot image classification benchmark, produces good results on few-shot regression, and accelerates fine-tuning for policy gradient reinforcement learning with neural network policies.

https://www.youtube.com/watch?v=GvfKHfcpD4I Spotlght Talk: Performance Guarantees for Transferring Representations

https://arxiv.org/abs/1704.03073 Data-efficient Deep Reinforcement Learning for Dexterous Manipulation

We introduce two extensions to the Deep Deterministic Policy Gradient algorithm (DDPG), a model-free Q-learning based method, which make it significantly more data-efficient and scalable. Our results show that by making extensive use of off-policy data and replay, it is possible to find control policies that robustly grasp objects and stack them. Further, our results hint that it may soon be feasible to train successful stacking policies by collecting interactions on real robots.

https://arxiv.org/abs/1704.03453v1 The Space of Transferable Adversarial Examples

We conclude with a formal study of the limits of transferability. We show (1) sufficient conditions on the data distribution that imply transferability for simple model classes and (2) examples of tasks for which transferability fails to hold. This suggests the existence of defenses making models robust to transferability attacks—even when the model is not robust to its own adversarial examples.

https://arxiv.org/pdf/1705.06273.pdf Transfer Learning for Named-Entity Recognition with Neural Networks

Recent approaches based on artificial neural networks (ANNs) have shown promising results for named-entity recognition (NER). In order to achieve high performances, ANNs need to be trained on a large labeled dataset. However, labels might be difficult to obtain for the dataset on which the user wants to perform NER: label scarcity is particularly pronounced for patient note de-identification, which is an instance of NER. In this work, we analyze to what extent transfer learning may address this issue. In particular, we demonstrate that transferring an ANN model trained on a large labeled dataset to another dataset with a limited number of labels improves upon the state-of-the-art results on two different datasets for patient note de-identification.

https://www.vicarious.com/img/icml2017-schemas.pdf Schema Networks: Zero-shot Transfer with a Generative Causal Model of Intuitive Physics

In pursuit of efficient and robust generalization, we introduce the Schema Network, an objectoriented generative physics simulator capable of disentangling multiple causes of events and reasoning backward through causes to achieve goals. The richly structured architecture of the Schema Network can learn the dynamics of an environment directly from data. We compare Schema Networks with Asynchronous Advantage Actor-Critic and Progressive Networks on a suite of Breakout variations, reporting results on training efficiency and zero-shot generalization, consistently demonstrating faster, more robust learning and better transfer.

https://arxiv.org/abs/1707.01220 DarkRank: Accelerating Deep Metric Learning via Cross Sample Similarities Transfer

We introduce a new type of knowledge – cross sample similarities for model compression and acceleration. This knowledge can be naturally derived from deep metric learning model. To transfer them, we bring the learning to rank technique into deep metric learning formulation. We test our proposed DarkRank on the pedestrian re-identification task.

https://arxiv.org/pdf/1706.09789v1.pdf Two-Stage Synthesis Networks for Transfer Learning in Machine Comprehension

We develop a technique for transfer learning in machine comprehension (MC) using a novel two-stage synthesis network (SynNet). Given a high-performing MC model in one domain, our technique aims to answer questions about documents in another domain, where we use no labeled data of question-answer pairs. Using the proposed SynNet with a pretrained model on the SQuAD dataset, we achieve an F1 measure of 46.6% on the challenging NewsQA dataset, approaching performance of in-domain models (F1 measure of 50.0%) and outperforming the out-of-domain baseline by 7.6%, without use of provided annotations.

https://arxiv.org/abs/1707.04175 Distral: Robust Multitask Reinforcement Learning

We propose a new approach for joint training of multiple tasks, which we refer to as Distral (Distill & transfer learning). Instead of sharing parameters between the different workers, we propose to share a “distilled” policy that captures common behaviour across tasks. Each worker is trained to solve its own task while constrained to stay close to the shared policy, while the shared policy is trained by distillation to be the centroid of all task policies.

https://research.googleblog.com/2017/04/federated-learning-collaborative.html?m=1 Federated Learning: Collaborative Machine Learning without Centralized Training Data

Federated Learning enables mobile phones to collaboratively learn a shared prediction model while keeping all the training data on device, decoupling the ability to do machine learning from the need to store the data in the cloud. This goes beyond the use of local models that make predictions on mobile devices (like the Mobile Vision API and On-Device Smart Reply) by bringing model training to the device as well.

https://arxiv.org/abs/1710.02076v1 On the Effective Use of Pretraining for Natural Language Inference

We show that pretrained embeddings outperform both random and retrofitted ones in a large NLI corpus.

https://arxiv.org/pdf/1708.00630.pdf ProjectionNet: Learning Efficient On-Device Deep Networks Using Neural Projections

https://deepmind.com/blog/high-fidelity-speech-synthesis-wavenet/ Parallel WaveNet: Fast High-Fidelity Speech Synthesis

http://metalearning.ml/papers/metalearn17_furlanello.pdf Born Again Neural Networks

In this paper, we revisit knowledge distillation but with a different objective. Rather than compressing models, we train students that are parameterized identically to their parents. Surprisingly, these born again networks (BANs), tend to outperform their teacher models.

In Marvin Minsky’s Society of Mind [11], in the explanation of human development introduced the idea of a sequence of teaching selves. Minsky suggested that sudden spurts in intelligence during 2 childhood may be due to longer and hidden training of new “student” model under the guidance of the older self. In the same work, Minsky concluded that our perception of a long-term self is constructed by an ensemble of multiple generations of internal models, which we can use for guidance when the most current model falls short.

https://arxiv.org/abs/1803.00443 Knowledge Transfer with Jacobian Matching

In this paper, we first establish an equivalence between Jacobian matching and distillation with input noise, from which we derive appropriate loss functions for Jacobian matching. We then rely on this analysis to apply Jacobian matching to transfer learning by establishing equivalence of a recent transfer learning procedure to distillation. We then show experimentally on standard image datasets that Jacobian-based penalties improve distillation, robustness to noisy inputs, and transfer learning.

https://arxiv.org/abs/1803.10750v1 Adversarial Network Compression

(i) we propose an adversarial network compression approach to train the small student network to mimic the large teacher, without the need for labels during training; (ii) we introduce a regularization scheme to prevent a trivially-strong discriminator without reducing the network capacity and (iii) our approach generalizes on different teacher-student models.

https://arxiv.org/abs/1804.03758 Universal Successor Representations for Transfer Reinforcement Learning

The objective of transfer reinforcement learning is to generalize from a set of previous tasks to unseen new tasks. In this work, we focus on the transfer scenario where the dynamics among tasks are the same, but their goals differ. Although general value function (Sutton et al., 2011) has been shown to be useful for knowledge transfer, learning a universal value function can be challenging in practice. To attack this, we propose (1) to use universal successor representations (USR) to represent the transferable knowledge and (2) a USR approximator (USRA) that can be trained by interacting with the environment. Our experiments show that USR can be effectively applied to new tasks, and the agent initialized by the trained USRA can achieve the goal considerably faster than random initialization.

https://arxiv.org/abs/1804.03235 Large scale distributed neural network training through online distillation

Two neural networks trained on disjoint subsets of the data can share knowledge by encouraging each model to agree with the predictions the other model would have made. These predictions can come from a stale version of the other model so they can be safely computed using weights that only rarely get transmitted.

In this paper, we use codistillation to refer to distillation performed: 1. using the same architecture for all the models; 2. using the same dataset to train all the models; and 3. using the distillation loss during training before any model has fully converged

It is somewhat paradoxical that bad models codistilling from each other can learn faster than models training independently. Somehow the mistakes made by the teacher model carry enough information to help the student model do better than the teacher, and better than just seeing the actual label in the data.

The distillation loss term ψ can be the squared error between the logits of the models, the KL divergence between the predictive distributions, or some other measure of agreement between the model predictions. In this work we use the cross entropy error treating the teacher predictive distribution as soft targets. In the beginning of training, the distillation term in the loss is not very useful or may even be counterproductive, so to maintain model diversity longer and to avoid a complicated loss function schedule we only enable the distillation term in the loss function once training has gotten off the ground.

https://arxiv.org/abs/1804.08328v1 Taskonomy: Disentangling Task Transfer Learning

We proposes a fully computational approach for modeling the structure of space of visual tasks. This is done via finding (first and higher-order) transfer learning dependencies across a dictionary of twenty six 2D, 2.5D, 3D, and semantic tasks in a latent space. The product is a computational taxonomic map for task transfer learning. We study the consequences of this structure, e.g. nontrivial emerged relationships, and exploit them to reduce the demand for labeled data. For example, we show that the total number of labeled datapoints needed for solving a set of 10 tasks can be reduced by roughly 2/3 (compared to training independently) while keeping the performance nearly the same. We provide a set of tools for computing and probing this taxonomical structure including a solver that users can employ to devise efficient supervision policies for their use cases.

https://arxiv.org/pdf/1804.10172.pdf Capsule networks for low-data transfer learning

The generative capsule network uses what we call a memo architecture, which consists of convolving the images into the Digit Capsules, applying convolutional reconstruction, and classifying images based on the reconstruction

https://arxiv.org/abs/1804.10332 Sim-to-Real: Learning Agile Locomotion For Quadruped Robots

The control policies are learned in a physics simulator and then deployed on real robots. In robotics, policies trained in simulation often do not transfer to the real world. We narrow this reality gap by improving the physics simulator and learning robust policies. We improve the simulation using system identification, developing an accurate actuator model and simulating latency. We learn robust controllers by randomizing the physical environments, adding perturbations and designing a compact observation space. We evaluate our system on two agile locomotion gaits: trotting and galloping. After learning in simulation, a quadruped robot can successfully perform both gaits in the real world.

https://arxiv.org/abs/1805.02152v1 Quantization Mimic: Towards Very Tiny CNN for Object Detection

https://arxiv.org/abs/1805.04770v1 Born Again Neural Networks

We study KD from a new perspective: rather than compressing models, we train students parameterized identically to their teachers. Surprisingly, these {Born-Again Networks (BANs), outperform their teachers significantly, both on computer vision and language modeling tasks.

https://arxiv.org/abs/1805.08974 Do Better ImageNet Models Transfer Better?

We achieve state-of-the-art performance on eight image classification tasks simply by fine-tuning state-of-the-art ImageNet architectures, outperforming previous results based on specialized methods for transfer learning. Finally, we observe that, on three small fine-grained image classification datasets, networks trained from random initialization perform similarly to ImageNet-pretrained networks. Together, our results show that ImageNet architectures generalize well across datasets, with small improvements in ImageNet accuracy producing improvements across other tasks, but ImageNet features are less general than previously suggested.

https://arxiv.org/abs/1803.11175 Universal Sentence Encoder

https://arxiv.org/pdf/1806.05662v1.pdf GLoMo: Unsupervisedly Learned Relational Graphs as Transferable Representations

This work explores the possibility of learning generic latent relational graphs that capture dependencies between pairs of data units (e.g., words or pixels) from large-scale unlabeled data and transferring the graphs to downstream tasks.

https://arxiv.org/abs/1706.02596v2 Dynamic Integration of Background Knowledge in Neural NLU Systems

Common-sense or background knowledge is required to understand natural language, but in most neural natural language understanding (NLU) systems, the requisite background knowledge is indirectly acquired from static corpora. We develop a new reading architecture for the dynamic integration of explicit background knowledge in NLU models. A new task-agnostic reading module provides refined word representations to a task-specific NLU architecture by processing background knowledge in the form of free-text statements, together with the task-specific inputs. Strong performance on the tasks of document question answering (DQA) and recognizing textual entailment (RTE) demonstrate the effectiveness and flexibility of our approach. Analysis shows that our models learn to exploit knowledge selectively and in a semantically appropriate way.

https://github.com/IndicoDataSolutions/finetune Scikit-learn style model finetuning for NLP

https://github.com/IndicoDataSolutions/Enso Enso: An Open Source Library for Benchmarking Embeddings + Transfer Learning Methods http://enso.readthedocs.io/en/latest/

https://arxiv.org/abs/1809.05214 Model-Based Reinforcement Learning via Meta-Policy Optimization https://sites.google.com/view/mb-mpo

https://arxiv.org/abs/1806.05662 GLoMo: Unsupervisedly Learned Relational Graphs as Transferable Representations

https://arxiv.org/abs/1810.05751 Policy Transfer with Strategy Optimization In this paper, we present a different approach that leverages domain randomization for transferring control policies to unknown environments. The key idea that, instead of learning a single policy in the simulation, we simultaneously learn a family of policies that exhibit different behaviors.

https://blog.openai.com/reinforcement-learning-with-prediction-based-rewards/

https://arxiv.org/abs/1810.12894v1 Exploration by Random Network Distillation

https://arxiv.org/abs/1811.08883 Rethinking ImageNet Pre-training