Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
imperfect_information [2017/11/30 16:57]
admin
imperfect_information [2018/12/29 12:47] (current)
admin
Line 125: Line 125:
 Imperfect-Information Games Imperfect-Information Games
  
 +
 +http://​web.cs.ucla.edu/​~kaoru/​theoretical-impediments.pdf Theoretical Impediments to Machine Learning
 +
 +Current machine learning systems operate, almost exclusively,​ in a purely statistical mode, which puts severe theoretical limits on their performance. We consider the feasibility of leveraging counterfactual reasoning in machine learning tasks, and to identify areas where such reasoning could lead to major breakthroughs in machine learning applications.
 +
 +http://​www.cs.ox.ac.uk/​publications/​publication11394-abstract.html Counterfactual Multi−Agent Policy Gradients
 +
 +To this end, we propose a new multi-agent actor-critic method called counterfactual multi-agent (COMA) policy gradients. COMA uses a centralised critic to estimate the Q-function and decentralised actors to optimise the agents'​ policies. In addition, to address the challenges of multi-agent credit assignment, it uses a counterfactual baseline that marginalises out a single agent'​s action, while keeping the other agents'​ actions fixed. COMA also uses a critic representation that allows the counterfactual baseline to be computed efficiently in a single forward pass. We evaluate COMA in the testbed of StarCraft unit micromanagement,​ using a decentralised variant with significant partial observability. COMA significantly improves average performance over other multi-agent actor-critic methods in this setting, and the best performing agents are competitive with state-of-the-art centralised controllers that get access to the full state.
 +
 +https://​arxiv.org/​pdf/​1803.10760.pdf Unsupervised Predictive Memory in a
 +Goal-Directed Agent
 +
 +We develop a model, the Memory, RL, and Inference Network (MERLIN), in which memory
 +formation is guided by a process of predictive modeling. MERLIN facilitates
 +the solution of tasks in 3D virtual reality environments (6) for which partial
 +observability is severe and memories must be maintained over long durations.
 +Our model demonstrates a single learning agent architecture that can solve
 +canonical behavioural tasks in psychology and neurobiology without strong
 +simplifying assumptions about the dimensionality of sensory input or the duration
 +of experiences.
 +
 +We propose MERLIN, an integrated AI agent architecture that acts in partially observed
 +virtual reality environments and stores information in memory based on different principles
 +from existing end-to-end AI systems: it learns to process high-dimensional sensory streams,
 +compress and store them, and recall events with less dependence on task reward. We bring
 +together ingredients from external memory systems, reinforcement learning, and state estimation
 +(inference) models and combine them into a unified system using inspiration from three
 +ideas originating in psychology and neuroscience:​ predictive sensory coding , the hippocampal
 +representation theory of Gluck and Myers, and the temporal context model
 +and successor representation.
 +
 +https://​www.arxiv-vanity.com/​papers/​1803.08460/​ Towards Universal Representation for Unseen Action Recognition
 +
 +this paper proposes a pipeline using a large-scale training source to achieve a Universal Representation (UR) that can generalise to a more realistic Cross-Dataset UAR (CD-UAR) scenario. We first address UAR as a Generalised Multiple-Instance Learning (GMIL) problem and discover ‘building-blocks’ from the large-scale ActivityNet dataset using distribution kernels. Essential visual and semantic components are preserved in a shared space to achieve the UR that can efficiently generalise to new datasets. Predicted UR exemplars can be improved by a simple semantic adaptation, and then an unseen action can be directly recognised using UR during the test. 
 +
 +https://​arxiv.org/​pdf/​1804.09401.pdf Generative Temporal Models with Spatial Memory
 +for Partially Observed Environments
 +
 + In this work we introduce a novel
 +action-conditioned generative model of such challenging
 +environments. The model features a
 +non-parametric spatial memory system in which
 +we store learned, disentangled representations of
 +the environment. Low-dimensional spatial updates
 +are computed using a state-space model that
 +makes use of knowledge on the prior dynamics
 +of the moving agent, and high-dimensional visual
 +observations are modelled with a Variational
 +Auto-Encoder.
 +
 +To our knowledge this is the first published work that builds
 +a generative model for agents walking in an environment
 +that, thanks to the separation of the dynamics and visual
 +information in the DND memory, can coherently generate
 +for hundreds of time steps in a scalable way.
 +
 +https://​arxiv.org/​pdf/​1805.08195.pdf Depth-Limited Solving for
 +Imperfect-Information Games
 +
 +This paper introduces a
 +principled way to conduct depth-limited solving in imperfect-information games
 +by allowing the opponent to choose among a number of strategies for the remainder
 +of the game at the depth limit. Each one of these strategies results in a different set
 +of values for leaf nodes. This forces an agent to be robust to the different strategies
 +an opponent may employ. We demonstrate the effectiveness of this approach by
 +building a master-level heads-up no-limit Texas hold’em poker AI that defeats two
 +prior top agents using only a 4-core CPU and 16 GB of memory. Developing such
 +a powerful agent would have previously required a supercomputer.
 +
 +https://​arxiv.org/​abs/​1803.10760 Unsupervised Predictive Memory in a Goal-Directed Agent
 +
 +An obvious requirement for handling partially observed tasks is access to extensive memory, but we show memory is not enough; it is critical that the right information be stored in the right format. We develop a model, the Memory, RL, and Inference Network (MERLIN), in which memory formation is guided by a process of predictive modeling. MERLIN facilitates the solution of tasks in 3D virtual reality environments for which partial observability is severe and memories must be maintained over long durations. Our model demonstrates a single learning agent architecture that can solve canonical behavioural tasks in psychology and neurobiology without strong simplifying assumptions about the dimensionality of sensory input or the duration of experiences.
 +
 +https://​arxiv.org/​abs/​1808.10442 Application of Self-Play Reinforcement Learning to a Four-Player Game of Imperfect Information
 +
 +We introduce a new virtual environment for simulating a card game known as "Big 2". This is a four-player game of imperfect information with a relatively complicated action space (being allowed to play 1,2,3,4 or 5 card combinations from an initial starting hand of 13 cards). As such it poses a challenge for many current reinforcement learning methods. We then use the recently proposed "​Proximal Policy Optimization"​ algorithm to train a deep neural network to play the game, purely learning via self-play, and find that it is able to reach a level which outperforms amateur human players after only a relatively short amount of training time and without needing to search a tree of future game states.
 +
 +https://​arxiv.org/​abs/​1809.04040 Solving Imperfect-Information Games via Discounted Regret Minimization
 +
 +https://​arxiv.org/​abs/​1809.07893v1 Solving Large Extensive-Form Games with Strategy Constraints
 +
 +In this work we introduce a generalized form of Counterfactual Regret Minimization that provably finds optimal strategies under any feasible set of convex constraints. ​
 +
 +https://​arxiv.org/​abs/​1805.08195 Depth-Limited Solving for Imperfect-Information Games
 +
 +
 +Depth-Limited Solving for Imperfect-Information Games
 +Noam Brown, Tuomas Sandholm, Brandon Amos
 +(Submitted on 21 May 2018)
 +A fundamental challenge in imperfect-information games is that states do not have well-defined values. As a result, depth-limited search algorithms used in single-agent settings and perfect-information games do not apply. This paper introduces a principled way to conduct depth-limited solving in imperfect-information games by allowing the opponent to choose among a number of strategies for the remainder of the game at the depth limit. Each one of these strategies results in a different set of values for leaf nodes. This forces an agent to be robust to the different strategies an opponent may employ. We demonstrate the effectiveness of this approach by building a master-level heads-up no-limit Texas hold'​em poker AI that defeats two prior top agents using only a 4-core CPU and 16 GB of memory. Developing such a powerful agent would have previously required a supercomputer.
 +
 +https://​arxiv.org/​abs/​1807.10299v1 Variational Option Discovery Algorithms
 +
 +First: we highlight a tight connection between variational option discovery methods and variational autoencoders,​ and introduce Variational Autoencoding Learning of Options by Reinforcement (VALOR), a new method derived from the connection. In VALOR, the policy encodes contexts from a noise distribution into trajectories,​ and the decoder recovers the contexts from the complete trajectories. Second: we propose a curriculum learning approach where the number of contexts seen by the agent increases whenever the agent'​s performance is strong enough (as measured by the decoder) on the current set of contexts. We show that this simple trick stabilizes training for VALOR and prior variational option discovery methods, allowing a single agent to learn many more modes of behavior than it could with a fixed context distribution. Finally, we investigate other topics related to variational option discovery, including fundamental limitations of the general approach and the applicability of learned options to downstream tasks.