**This is an old revision of the document!**

Multi-Objective Learning

https://arxiv.org/abs/1610.02707v1 Multi-Objective Deep Reinforcement Learning

We propose Deep Optimistic Linear Support Learning (DOL) to solve high dimensional multi-objective decision problems where the relative importances of the objectives are not known a priori. Using features from the high-dimensional inputs, DOL computes the convex coverage set containing all potential optimal solutions of the convex combinations of the objectives.

https://arxiv.org/abs/1611.05397 Reinforcement Learning with Unsupervised Auxiliary Tasks

Deep reinforcement learning agents have achieved state-of-the-art results by directly maximising cumulative reward. However, environments contain a much wider variety of possible training signals. In this paper, we introduce an agent that also maximises many other pseudo-reward functions simultaneously by reinforcement learning. All of these tasks share a common representation that, like unsupervised learning, continues to develop in the absence of extrinsic rewards. We also introduce a novel mechanism for focusing this representation upon extrinsic rewards, so that learning can rapidly adapt to the most relevant aspects of the actual task. Our agent significantly outperforms the previous state-of-the-art on Atari, averaging 880\% expert human performance, and a challenging suite of first-person, three-dimensional \emph{Labyrinth} tasks leading to a mean speedup in learning of 10× and averaging 87\% expert human performance on Labyrinth.

https://arxiv.org/abs/1611.01673v2 Generative Multi-Adversarial Networks

Generative adversarial networks (GANs) are a framework for producing a generative model by way of a two-player minimax game. In this paper, we propose the \emph{Generative Multi-Adversarial Network} (GMAN), a framework that extends GANs to multiple discriminators. In previous work, the successful training of GANs requires modifying the minimax objective to accelerate training early on. In contrast, GMAN can be reliably trained with the original, untampered objective. We explore a number of design perspectives with the discriminator role ranging from formidable adversary to forgiving teacher. Image generation tasks comparing the proposed framework to standard GANs demonstrate GMAN produces higher quality samples in a fraction of the iterations when measured by a pairwise GAM-type metric.

https://arxiv.org/pdf/1610.01945v2.pdf Connecting Generative Adversarial Networks and Actor-Critic Methods

Both generative adversarial networks (GAN) in unsupervised learning and actorcritic methods in reinforcement learning (RL) have gained a reputation for being difficult to optimize. Practitioners in both fields have amassed a large number of strategies to mitigate these instabilities and improve training. Here we show that GANs can be viewed as actor-critic methods in an environment where the actor cannot affect the reward. We review the strategies for stabilizing training for each class of models, both those that generalize between the two and those that are particular to that model. We also review a number of extensions to GANs and RL algorithms with even more complicated information flow. We hope that by highlighting this formal connection we will encourage both GAN and RL communities to develop general, scalable, and stable algorithms for multilevel optimization with deep networks, and to draw inspiration across communities.

https://deepmind.com/blog/reinforcement-learning-unsupervised-auxiliary-tasks/ Reinforcement learning with unsupervised auxiliary tasks https://arxiv.org/pdf/1611.05397.pdf

In this paper, we introduce an agent that also maximises many other pseudo-reward functions simultaneously by reinforcement learning. All of these tasks share a common representation that, like unsupervised learning, continues to develop in the absence of extrinsic rewards. We also introduce a novel mechanism for focusing this representation upon extrinsic rewards, so that learning can rapidly adapt to the most relevant aspects of the actual task.

https://openreview.net/forum?id=Bk8BvDqex Metacontrol for Adaptive Imagination-Based Optimization

Many machine learning systems are built to solve the hardest examples of a particular task, which often makes them large and expensive to run—especially with respect to the easier examples, which might require much less computation. For an agent with a limited computational budget, this “one-size-fits-all” approach may result in the agent wasting valuable computation on easy examples, while not spending enough on hard examples. Rather than learning a single, fixed policy for solving all instances of a task, we introduce a metacontroller which learns to optimize a sequence of “imagined” internal simulations over predictive models of the world in order to construct a more informed, and more economical, solution. The metacontroller component is a model-free reinforcement learning agent, which decides both how many iterations of the optimization procedure to run, as well as which model to consult on each iteration. The models (which we call “experts”) can be state transition models, action-value functions, or any other mechanism that provides information useful for solving the task, and can be learned on-policy or off-policy in parallel with the metacontroller. When the metacontroller, controller, and experts were trained with “interaction networks” (Battaglia et al., 2016) as expert models, our approach was able to solve a challenging decision-making problem under complex non-linear dynamics. The metacontroller learned to adapt the amount of computation it performed to the difficulty of the task, and learned how to choose which experts to consult by factoring in both their reliability and individual computational resource costs. This allowed the metacontroller to achieve a lower overall cost (task loss plus computational cost) than more traditional fixed policy approaches. These results demonstrate that our approach is a powerful framework for using rich forward models for efficient model-based reinforcement learning.

https://arxiv.org/pdf/1605.02651v1.pdf A Minimal Model for the Emergence of Cooperation in Randomly Growing Networks

https://arxiv.org/abs/1612.02605 Towards Information-Seeking Agents

We develop a general problem setting for training and testing the ability of agents to gather information efficiently. Specifically, we present a collection of tasks in which success requires searching through a partially-observed environment, for fragments of information which can be pieced together to accomplish various goals. We combine deep architectures with techniques from reinforcement learning to develop agents that solve our tasks. We shape the behavior of these agents by combining extrinsic and intrinsic rewards. We empirically demonstrate that these agents learn to search actively and intelligently for new information to reduce their uncertainty, and to exploit information they have already acquired.

https://arxiv.org/abs/1612.05159 Improving Scalability of Reinforcement Learning by Separation of Concerns

In this paper, we propose a framework for solving a single-agent task by using multiple agents, each focusing on different aspects of the task. This approach has two main advantages: 1) it allows for specialized agents for different parts of the task, and 2) it provides a new way to transfer knowledge, by transferring trained agents. Our framework generalizes the traditional hierarchical decomposition, in which, at any moment in time, a single agent has control until it has solved its particular subtask. We illustrate our framework using a number of examples.

https://arxiv.org/pdf/1611.01673v2.pdf GENERATIVE MULTI-ADVERSARIAL NETWORKS

Generative adversarial networks (GANs) are a framework for producing a generative model by way of a two-player minimax game. In this paper, we propose the Generative Multi-Adversarial Network (GMAN), a framework that extends GANs to multiple discriminators. In previous work, the successful training of GANs requires modifying the minimax objective to accelerate training early on. In contrast, GMAN can be reliably trained with the original, untampered objective. We explore a number of design perspectives with the discriminator role ranging from formidable adversary to forgiving teacher. Image generation tasks comparing the proposed framework to standard GANs demonstrate GMAN produces higher quality samples in a fraction of the iterations when measured by a pairwise GAM-type metric.

https://arxiv.org/abs/1604.06508 HIRL: Hierarchical Inverse Reinforcement Learning for Long-Horizon Tasks with Delayed Rewards

Reinforcement Learning (RL) struggles in problems with delayed rewards, and one approach is to segment the task into sub-tasks with incremental rewards. We propose a framework called Hierarchical Inverse Reinforcement Learning (HIRL), which is a model for learning sub-task structure from demonstrations. HIRL decomposes the task into sub-tasks based on transitions that are consistent across demonstrations. These transitions are defined as changes in local linearity w.r.t to a kernel function. Then, HIRL uses the inferred structure to learn reward functions local to the sub-tasks but also handle any global dependencies such as sequentiality. We have evaluated HIRL on several standard RL benchmarks: Parallel Parking with noisy dynamics, Two-Link Pendulum, 2D Noisy Motion Planning, and a Pinball environment. In the parallel parking task, we find that rewards constructed with HIRL converge to a policy with an 80% success rate in 32% fewer time-steps than those constructed with Maximum Entropy Inverse RL (MaxEnt IRL), and with partial state observation, the policies learned with IRL fail to achieve this accuracy while HIRL still converges. We further find that that the rewards learned with HIRL are robust to environment noise where they can tolerate 1 stdev. of random perturbation in the poses in the environment obstacles while maintaining roughly the same convergence rate. We find that HIRL rewards can converge up-to 6x faster than rewards constructed with IRL.

https://arxiv.org/pdf/1701.08734v1.pdf PathNet: Evolution Channels Gradient Descent in Super Neural Networks

For artificial general intelligence (AGI) it would be efficient if multiple users trained the same giant neural network, permitting parameter reuse, without catastrophic forgetting. PathNet is a first step in this direction. It is a neural network algorithm that uses agents embedded in the neural network whose task is to discover which parts of the network to re-use for new tasks. Agents are pathways (views) through the network which determine the subset of parameters that are used and updated by the forwards and backwards passes of the backpropogation algorithm. During learning, a tournament selection genetic algorithm is used to select pathways through the neural network for replication and mutation. Pathway fitness is the performance of that pathway measured according to a cost function. We demonstrate successful transfer learning; fixing the parameters along a path learned on task A and re-evolving a new population of paths for task B, allows task B to be learned faster than it could be learned from scratch or after fine-tuning. Paths evolved on task B re-use parts of the optimal path evolved on task A. Positive transfer was demonstrated for binary MNIST, CIFAR, and SVHN supervised learning classification tasks, and a set of Atari and Labyrinth reinforcement learning tasks, suggesting PathNets have general applicability for neural network training. Finally, PathNet also significantly improves the robustness to hyperparameter choices of a parallel asynchronous reinforcement learning algorithm (A3C).

http://cims.nyu.edu/~sainbar/commnet/ Learning Multiagent Communication with Backpropagation

Many tasks in AI require the collaboration of multiple agents. Typically, the communication protocol between agents is manually specified and not altered during training. In this paper we explore a simple neural model, called CommNet, that uses continuous communication for fully cooperative tasks. The model consists of multiple agents and the communication between them is learned alongside their policy. We apply this model to a diverse set of tasks, demonstrating the ability of the agents to learn to communicate amongst themselves, yielding improved performance over non-communicative agents and baselines. In some cases, it is possible to interpret the language devised by the agents, revealing simple but effective strategies for solving the task at hand.

https://arxiv.org/abs/1612.01294 Message Passing Multi-Agent GANs

We show that we can obtain multi-agent GANs that communicate through message passing to achieve better image generation. The objectives of the individual agents in this framework are two fold: a co-operation objective and a competing objective. The co-operation objective ensures that the message sharing mechanism guides the other generator to generate better than itself while the competing objective encourages each generator to generate better than its counterpart.

https://arxiv.org/pdf/1607.00548.pdf Active Object Localization in Visual Situations

We describe a method for performing active localization of objects in instances of visual situations. A visual situation is an abstract concept—e.g., “a boxing match”, “a birthday party”, “walking the dog”, “waiting for a bus”—whose image instantiations are linked more by their common spatial and semantic structure than by low-level visual similarity. Our system combines given and learned knowledge of the structure of a particular situation, and adapts that knowledge to a new situation instance as it actively searches for objects. More specifically, the system learns a set of probability distributions describing spatial and other relationships among relevant objects. The system uses those distributions to iteratively sample object proposals on a test image, but also continually uses information from those object proposals to adaptively modify the distributions based on what the system has detected. We test our approach’s ability to efficiently localize objects, using a situation-specific image dataset created by our group. We compare the results with several baselines and variations on our method, and demonstrate the strong benefit of using situation knowledge and active contextdriven localization. Finally, we contrast our method with several other approaches that use context as well as active search for object localization in images.

https://arxiv.org/abs/1702.05573 Collaborative Deep Reinforcement Learning for Joint Object Search

We examine the problem of joint top-down active search of multiple objects under interaction, e.g., person riding a bicycle, cups held by the table, etc.. Such objects under interaction often can provide contextual cues to each other to facilitate more efficient search. By treating each detector as an agent, we present the first collaborative multi-agent deep reinforcement learning algorithm to learn the optimal policy for joint active object localization, which effectively exploits such beneficial contextual information. We learn inter-agent communication through cross connections with gates between the Q-networks, which is facilitated by a novel multi-agent deep Q-learning algorithm with joint exploitation sampling. We verify our proposed method on multiple object detection benchmarks. Not only does our model help to improve the performance of state-of-the-art active localization models, it also reveals interesting co-detection patterns that are intuitively interpretable.

https://arxiv.org/abs/1511.06342 Actor-Mimic: Deep Multitask and Transfer Reinforcement Learning

This method, termed “Actor-Mimic”, exploits the use of deep reinforcement learning and model compression techniques to train a single policy network that learns how to act in a set of distinct tasks by using the guidance of several expert teachers.

https://openreview.net/forum?id=Hk8N3Sclg¬eId=Hk8N3Sclg MULTI-AGENT COOPERATION AND THE EMERGENCE OF (NATURAL) LANGUAGE

https://arxiv.org/abs/1702.06856v2 Robustness to Adversarial Examples through an Ensemble of Specialists

We are proposing to use an ensemble of diverse specialists, where speciality is defined according to the confusion matrix. Indeed, we observed that for adversarial instances originating from a given class, labeling tend to be done into a small subset of (incorrect) classes. Therefore, we argue that an ensemble of specialists should be better able to identify and reject fooling instances, with a high entropy (i.e., disagreement) over the decisions in the presence of adversaries. Experimental results obtained confirm that interpretation, opening a way to make the system more robust to adversarial examples through a rejection mechanism, rather than trying to classify them properly at any cost.

https://arxiv.org/abs/1703.00573v1 Generalization and Equilibrium in Generative Adversarial Nets (GANs)

We introduce a new metric called neural net distance for which generalization does occur. We also show that an approximate pure equilibrium in the 2-player game exists for a natural training objective (Wasserstein).

Finally, the above theoretical ideas lead us to propose a new training protocol, MIX+GAN, which can be combined with any existing method. We present experiments showing that it stabilizes and improves some existing methods.

Suspecting that a pure equilibrium may not exist for all objectives, we recommend in practice our mix+ gan protocol using a small mixture of discriminators and generators. Our experiments show it improves the quality of several existing GAN training methods. Finally, note that existence of an equilibrium does not imply that a simple algorithm (in this case, backpropagation) would find it easily. That still defies explanation.

https://arxiv.org/abs/1512.08575v1 Optimal Selective Attention in Reactive Agents

In this report we present the minimum-information principle for selective attention in reactive agents. We further motivate this approach by reducing the general problem of optimal control in POMDPs, to reactive control with complex observations. Lastly, we explore a newly discovered phenomenon of this optimization process - period doubling bifurcations. This necessitates periodic policies, and raises many more questions regarding stability, periodicity and chaos in optimal control.

https://arxiv.org/abs/1703.01161 FeUdal Networks for Hierarchical Reinforcement Learning

We introduce FeUdal Networks (FuNs): a novel architecture for hierarchical reinforcement learning. Our approach is inspired by the feudal reinforcement learning proposal of Dayan and Hinton, and gains power and efficacy by decoupling end-to-end learning across multiple levels – allowing it to utilise different resolutions of time. Our framework employs a Manager module and a Worker module. The Manager operates at a lower temporal resolution and sets abstract goals which are conveyed to and enacted by the Worker. The Worker generates primitive actions at every tick of the environment. The decoupled structure of FuN conveys several benefits – in addition to facilitating very long timescale credit assignment it also encourages the emergence of sub-policies associated with different goals set by the Manager. These properties allow FuN to dramatically outperform a strong baseline agent on tasks that involve long-term credit assignment or memorisation. We demonstrate the performance of our proposed system on a range of tasks from the ATARI suite and also from a 3D DeepMind Lab environment.

https://arxiv.org/abs/1611.05397 Reinforcement Learning with Unsupervised Auxiliary Tasks

In this paper, we introduce an agent that also maximises many other pseudo-reward functions simultaneously by reinforcement learning. All of these tasks share a common representation that, like unsupervised learning, continues to develop in the absence of extrinsic rewards. We also introduce a novel mechanism for focusing this representation upon extrinsic rewards, so that learning can rapidly adapt to the most relevant aspects of the actual task.

https://arxiv.org/pdf/1704.00756v1.pdf Multi-Advisor Reinforcement Learning

This article deals with a novel branch of Separation of Concerns, called Multi-Advisor Reinforcement Learning (MAd-RL), where a single-agent RL problem is distributed to n learners, called advisors. Each advisor tries to solve the problem with a different focus. Their advice is then communicated to an aggregator, which is in control of the system. For the local training, three off-policy bootstrapping methods are proposed and analysed: local-max bootstraps with the local greedy action, rand-policy bootstraps with respect to the random policy, and agg-policy bootstraps with respect to the aggregator's greedy policy. MAd-RL is positioned as a generalisation of Reinforcement Learning with Ensemble methods. An experiment is held on a simplified version of the Ms. Pac-Man Atari game. The results confirm the theoretical relative strengths and weaknesses of each method.

https://arxiv.org/abs/1704.06676v1 Multi-Objective Deep Q-Learning with Subsumption Architecture

We propose an architecture in which separate DQNs are used to control the agent's behaviour with respect to particular objectives. In this architecture we use signal suppression, known from the (Brooks) subsumption architecture, to combine outputs of several DQNs into a single action. Our architecture enables the decomposition of the agent's behaviour into controllable and replaceable sub-behaviours learned by distinct modules. To evaluate our solution we used a game-like simulator in which an agent - provided with high-level visual input - pursues multiple objectives in a 2D world. Our solution provides benefits of modularity, while its performance is comparable to the monolithic approach.

https://arxiv.org/abs/1511.08779 Multiagent Cooperation and Competition with Deep Reinforcement Learning

Agents trained under collaborative rewarding schemes find an optimal strategy to keep the ball in the game as long as possible. We also describe the progression from competitive to collaborative behavior. The present work demonstrates that Deep Q-Networks can become a practical tool for studying the decentralized learning of multiagent systems living in highly complex environments. https://github.com/NeuroCSUT/DeepMind-Atari-Deep-Q-Learner-2Player

https://arxiv.org/pdf/1705.06366v1.pdf Automatic Goal Generation for Reinforcement Learning Agents

s. Instead, we propose a method that allows an agent to automatically discover the range of tasks that it is capable of performing. We use a generator network to propose tasks for the agent to try to achieve, specified as goal states. The generator network is optimized using adversarial training to produce tasks that are always at the appropriate level of difficulty for the agent. Our method thus automatically produces a curriculum of tasks for the agent to learn.

https://arxiv.org/abs/1705.08142v1 Sluice networks: Learning what to share between loosely related tasks

To overcome this, we introduce Sluice Networks, a general framework for multi-task learning where trainable parameters control the amount of sharing – including which parts of the models to share. Our framework goes beyond and generalizes over previous proposals in enabling hard or soft sharing of all combinations of subspaces, layers, and skip connections. We perform experiments on three task pairs from natural language processing, and across seven different domains, using data from OntoNotes 5.0, and achieve up to 15% average error reductions over common approaches to multi-task learning.

https://arxiv.org/abs/1705.11192v1 Emergence of Language with Multi-agent Games: Learning to Communicate with Sequences of Symbols

e study a setting where two agents engage in playing a referential game and, from scratch, develop a communication protocol necessary to succeed in this game. Unlike previous work, we require that messages they exchange, both at train and test time, are in the form of a language (i.e. sequences of discrete symbols).

https://code.facebook.com/posts/1686672014972296 Deal or no deal? Training AI bots to negotiate

https://arxiv.org/pdf/1707.03300.pdf The Intentional Unintentional Agent: Learning to Solve Many Continuous Control Tasks Simultaneously

This paper introduces the Intentional Unintentional (IU) agent. This agent endows the deep deterministic policy gradients (DDPG) agent for continuous control with the ability to solve several tasks simultaneously. We show that the IU agent not only learns to solve many tasks simultaneously but it also learns faster than agents that target a single task at-a-time. In some cases, where the single task DDPG method completely fails, the IU agent successfully solves the task. To demonstrate this, we build a playroom environment using the MuJoCo physics engine, and introduce a grounded formal language to automatically generate tasks.

https://arxiv.org/abs/1707.04175 Distral: Robust Multitask Reinforcement Learning

We propose a new approach for joint training of multiple tasks, which we refer to as Distral (Distill & transfer learning). Instead of sharing parameters between the different workers, we propose to share a “distilled” policy that captures common behaviour across tasks. Each worker is trained to solve its own task while constrained to stay close to the shared policy, while the shared policy is trained by distillation to be the centroid of all task policies.

https://arxiv.org/abs/1705.08142 https://github.com/sebastianruder/sluice-networks Sluice networks: Learning what to share between loosely related tasks

To overcome this, we introduce Sluice Networks, a general framework for multi-task learning where trainable parameters control the amount of sharing – including which parts of the models to share. Our framework goes beyond and generalizes over previous proposals in enabling hard or soft sharing of all combinations of subspaces, layers, and skip connections. We perform experiments on three task pairs from natural language processing, and across seven different domains, using data from OntoNotes 5.0, and achieve up to 15% average error reductions over common approaches to multi-task learning. We analyze when the architecture is particularly helpful, as well as its ability to fit noise. We show that a) label entropy is predictive of gains in sluice networks, confirming findings for hard parameter sharing, and b) while sluice networks easily fit noise, they are robust across domains in practice.

https://arxiv.org/abs/1611.01796 Modular Multitask Reinforcement Learning with Policy Sketches

Sketches annotate tasks with sequences of named subtasks, providing information about high-level structural relationships among tasks but not how to implement them—specifically not providing the detailed guidance used by much previous work on learning policy abstractions for RL (e.g. intermediate rewards, subtask completion signals, or intrinsic motivations). To learn from sketches, we present a model that associates every subtask with a modular subpolicy, and jointly maximizes reward over full task-specific policies by tying parameters across shared subpolicies. Optimization is accomplished via a decoupled actor–critic training objective that facilitates learning common behaviors from multiple dissimilar reward functions. We evaluate the effectiveness of our approach in three environments featuring both discrete and continuous control, and with sparse rewards that can be obtained only after completing a number of high-level subgoals. Experiments show that using our approach to learn policies guided by sketches gives better performance than existing techniques for learning task-specific or shared policies, while naturally inducing a library of interpretable primitive behaviors that can be recombined to rapidly adapt to new tasks.

https://arxiv.org/abs/1707.04175 Distral: Robust Multitask Reinforcement Learning

We propose a new approach for joint training of multiple tasks, which we refer to as Distral (Distill & transfer learning). Instead of sharing parameters between the different workers, we propose to share a “distilled” policy that captures common behaviour across tasks. Each worker is trained to solve its own task while constrained to stay close to the shared policy, while the shared policy is trained by distillation to be the centroid of all task policies. Both aspects of the learning process are derived by optimizing a joint objective function. We show that our approach supports efficient transfer on complex 3D environments, outperforming several related methods. Moreover, the proposed learning process is more robust and more stable—attributes that are critical in deep reinforcement learning.

https://arxiv.org/abs/1706.02275 Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments

We then present an adaptation of actor-critic methods that considers action policies of other agents and is able to successfully learn policies that require complex multi-agent coordination. Additionally, we introduce a training regimen utilizing an ensemble of policies for each agent that leads to more robust multi-agent policies. We show the strength of our approach compared to existing methods in cooperative as well as competitive scenarios, where agent populations are able to discover various physical and informational coordination strategies.

https://blog.openai.com/learning-to-model-other-minds/

https://arxiv.org/pdf/1707.06334.pdf Fully Decentralized Policies for Multi-Agent Systems: An Information Theoretic Approach

https://arxiv.org/pdf/1711.02301.pdf Can Deep Reinforcement Learning Solve Erdos-Selfridge-Spencer Games?

https://arxiv.org/abs/1711.01431v1 The Case for Meta-Cognitive Machine Learning: On Model Entropy and Concept Formation in Deep Learning

Machine learning is usually defined in behaviourist terms, where external validation is the primary mechanism of learning. In this paper, I argue for a more holistic interpretation in which finding more probable, efficient and abstract representations is as central to learning as performance. In other words, machine learning should be extended with strategies to reason over its own learning process, leading to so-called meta-cognitive machine learning. As such, the de facto definition of machine learning should be reformulated in these intrinsically multi-objective terms, taking into account not only the task performance but also internal learning objectives. To this end, we suggest a “model entropy function” to be defined that quantifies the efficiency of the internal learning processes. It is conjured that the minimization of this model entropy leads to concept formation. Besides philosophical aspects, some initial illustrations are included to support the claims.

https://arxiv.org/abs/1710.03748v2 Emergent Complexity via Multi-Agent Competition

In this paper, we point out that a competitive multi-agent environment trained with self-play can produce behaviors that are far more complex than the environment itself. We also point out that such environments come with a natural curriculum, because for any skill level, an environment full of agents of this level will have the right level of difficulty. https://github.com/openai/multiagent-competition

https://arxiv.org/abs/1802.01282 Coordinated Exploration in Concurrent Reinforcement Learning

We consider a team of reinforcement learning agents that concurrently learn to operate in a common environment. We identify three properties - adaptivity, commitment, and diversity - which are necessary for efficient coordinated exploration and demonstrate that straightforward extensions to single-agent optimistic and posterior sampling approaches fail to satisfy them. As an alternative, we propose seed sampling, which extends posterior sampling in a manner that meets these requirements. Simulation results investigate how per-agent regret decreases as the number of agents grows, establishing substantial advantages of seed sampling over alternative exploration schemes.

https://arxiv.org/abs/1709.080 Autonomous Agents Modelling Other Agents: A Comprehensive Survey and Open Problems 71

The purpose of the present article is to provide a comprehensive survey of the salient modelling methods which can be found in the literature. The article concludes with a discussion of open problems which may form the basis for fruitful future research.

https://arxiv.org/pdf/1804.02808v1.pdf Latent Space Policies for Hierarchical Reinforcement Learning

First, each layer in the hierarchy can be trained with exactly the same algorithm. Second, by using an invertible mapping from latent variables to actions, each layer becomes invertible, which means that the higher layer can always perfectly invert any behavior of the lower layer. This makes it possible to train lower layers on heuristic shaping rewards, while higher layers can still optimize task-specific rewards with good asymptotic performance. Finally, our method has a natural interpretation as an iterative procedure for constructing graphical models that gradually simplify the task dynamics.

https://e-drexler.com/d/09/00/AgoricsPapers/agoricpapers/aos/aos.0.html Markets and Computation: Agoric Open Systems

https://arxiv.org/pdf/1802.05642.pdf The Mechanics of n-Player Differentiable Games

https://arxiv.org/abs/1806.10332v1 MONAS: Multi-Objective Neural Architecture Search using Reinforcement Learning

https://arxiv.org/abs/1807.07665v1 Multitask Reinforcement Learning for Zero-shot Generalization with Subtask Dependencies

Unlike existing multitask RL approaches that explicitly describe what the agent should do, a subtask graph in our problem only describes properties of subtasks and relationships among them, which requires the agent to perform complex reasoning to find the optimal sequence of subtasks to execute. To tackle this problem, we propose a neural subtask graph solver (NSS) which encodes the subtask graph using a recursive neural network. To overcome the difficulty of training, we propose a novel non-parametric gradient-based policy to pre-train our NSS agent. % and further finetune it through actor-critic method. The experimental results on two 2D visual domains show that our agent can perform complex reasoning to find a near-optimal way of executing the subtask graph and generalize well to the unseen subtask graphs. In addition, we compare our agent with a Monte-Carlo tree search (MCTS) method showing that (1) our method is much more efficient than MCTS and (2) combining MCTS with NSS dramatically improves the search performance.

https://arxiv.org/abs/1807.09936v1 Multi-Agent Generative Adversarial Imitation Learning

We propose a new framework for multi-agent imitation learning for general Markov games, where we build upon a generalized notion of inverse reinforcement learning. We further introduce a practical multi-agent actor-critic algorithm with good empirical performance. Our method can be used to imitate complex behaviors in high-dimensional environments with multiple cooperative or competing agents.

https://arxiv.org/abs/1809.11044 Relational Forward Models for Multi-Agent Learning

https://arxiv.org/abs/1810.05587 Is multiagent deep reinforcement learning the answer or the question? A brief survey

https://arxiv.org/abs/1810.10096v1 Learning Representations in Model-Free Hierarchical Reinforcement Learning

https://arxiv.org/abs/1705.08971v2 Optimal Cooperative Inference

https://www.biorxiv.org/content/early/2017/09/01/183632 Clustering and compositionality of task representations in a neural network trained to perform many cognitive tasks https://github.com/gyyang/multitask