This is an old revision of the document!


https://arxiv.org/abs/1612.06915 AIVAT: A New Variance Reduction Technique for Agent Evaluation in Imperfect Information Games

In this paper, we introduce AIVAT, a low variance, provably unbiased value assessment tool that uses an arbitrary heuristic estimate of state value, as well as the explicit strategy of a subset of the agents. Unlike existing techniques which reduce the variance from chance events, or only consider game ending actions, AIVAT reduces the variance both from choices by nature and by players with a known strategy. The resulting estimator in no-limit poker can reduce the number of hands needed to draw statistical conclusions by more than a factor of 10.

https://arxiv.org/abs/1612.09596 Counterfactual Prediction with Deep Instrumental Variables Networks

This paper provides a recipe for combining ML algorithms to solve for causal effects in the presence of instrumental variables – sources of treatment randomization that are conditionally independent from the response. We show that a flexible IV specification resolves into two prediction tasks that can be solved with deep neural nets: a first-stage network for treatment prediction and a second-stage network whose loss function involves integration over the conditional treatment distribution. This Deep IV framework imposes some specific structure on the stochastic gradient descent routine used for training, but it is general enough that we can take advantage of off-the-shelf ML capabilities and avoid extensive algorithm customization. We outline how to obtain out-of-sample causal validation in order to avoid over-fit.

https://arxiv.org/pdf/1701.01724v1.pdf DeepStack: Expert-Level Artificial Intelligence in No-Limit Poker

Artificial intelligence has seen a number of breakthroughs in recent years, with games often serving as significant milestones. A common feature of games with these successes is that they involve information symmetry among the players, where all players have identical information. This property of perfect information, though, is far more common in games than in real-world problems. Poker is the quintessential game of imperfect information, and it has been a longstanding challenge problem in artificial intelligence. In this paper we introduce DeepStack, a new algorithm for imperfect information settings such as poker. It combines recursive reasoning to handle information asymmetry, decomposition to focus computation on the relevant decision, and a form of intuition about arbitrary poker situations that is automatically learned from selfplay games using deep learning. In a study involving dozens of participants and 44,000 hands of poker, DeepStack becomes the first computer program to beat professional poker players in heads-up no-limit Texas hold’em. Furthermore, we show this approach dramatically reduces worst-case exploitability compared to the abstraction paradigm that has been favored for over a decade.

https://arxiv.org/pdf/1605.03661v2.pdf Learning Representations for Counterfactual Inference

https://arxiv.org/abs/1701.01724v2 DeepStack: Expert-Level Artificial Intelligence in No-Limit Poker

https://www.cs.cmu.edu/~sandholm/dynamicThresholding.aaai17.pdf

Regret minimization is widely used in determining strategies for imperfect-information games and in online learning. In large games, computing the regrets associated with a single iteration can be slow. For this reason, pruning – in which parts of the decision tree are not traversed in every iteration – has emerged as an essential method for speeding up iterations in large games. The ability to prune is a primary reason why the Counterfactual Regret Minimization (CFR) algorithm using regret matching has emerged as the most popular iterative algorithm for imperfect-information games, despite its relatively poor convergence bound. In this paper, we introduce dynamic thresholding, in which a threshold is set at every iteration such that any action in the decision tree with probability below the threshold is set to zero probability. This enables pruning for the first time in a wide range of algorithms. We prove that dynamic thresholding can be applied to Hedge while increasing its convergence bound by only a constant factor in terms of number of iterations. Experiments demonstrate a substantial improvement in performance for Hedge as well as the excessive gap technique.

https://www.cs.cmu.edu/~sganzfri/Thesis.pdf Computing Strong Game-Theoretic Strategies and Exploiting Suboptimal Opponents in Large Games

https://arxiv.org/pdf/1011.0686.pdf A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning

In this paper, we propose a new iterative algorithm, which trains a stationary deterministic policy, that can be seen as a no regret algorithm in an online learning setting. We show that any such no regret algorithm, combined with additional reduction assumptions, must find a policy with good performance under the distribution of observations it induces in such sequential settings. We demonstrate that this new approach outperforms previous approaches on two challenging imitation learning problems and a benchmark sequence labeling problem.

https://github.com/clinicalml/cfrnet

Counterfactual regression (CFR) by learning balanced representations, as developed by Johansson, Shalit & Sontag (2016) and Shalit, Johansson & Sontag (2016). cfrnet is implemented in Python using TensorFlow and NumPy.

https://arxiv.org/abs/1702.06676v2 Counterfactual Control for Free from Generative Models

We introduce a method by which a generative model learning the joint distribution between actions and future states can be used to automatically infer a control scheme for any desired reward function, which may be altered on the fly without retraining the model. In this method, the problem of action selection is reduced to one of gradient descent on the latent space of the generative model, with the model itself providing the means of evaluating outcomes and finding the gradient, much like how the reward network in Deep Q-Networks (DQN) provides gradient information for the action generator. Unlike DQN or Actor-Critic, which are conditional models for a specific reward, using a generative model of the full joint distribution permits the reward to be changed on the fly. In addition, the generated futures can be inspected to gain insight in to what the network 'thinks' will happen, and to what went wrong when the outcomes deviate from prediction. https://github.com/arayabrain/GenerativeControl

https://arxiv.org/abs/1704.03073 Data-efficient Deep Reinforcement Learning for Dexterous Manipulation

We introduce two extensions to the Deep Deterministic Policy Gradient algorithm (DDPG), a model-free Q-learning based method, which make it significantly more data-efficient and scalable. Our results show that by making extensive use of off-policy data and replay, it is possible to find control policies that robustly grasp objects and stack them. Further, our results hint that it may soon be feasible to train successful stacking policies by collecting interactions on real robots.

https://arxiv.org/abs/1704.00773v1 A comparative study of counterfactual estimators

We provide a comparative study of several widely used off-policy estimators (Empirical Average, Basic Importance Sampling and Normalized Importance Sampling), detailing the different regimes where they are individually suboptimal. We then exhibit properties optimal estimators should possess. In the case where examples have been gathered using multiple policies, we show that fused estimators dominate basic ones but can still be improved.

https://arxiv.org/abs/1705.02955v1 Safe and Nested Subgame Solving for Imperfect-Information Games

Unlike perfect-information games, imperfect-information games cannot be solved by decomposing the game into subgames that are solved independently. Instead, all decisions must consider the strategy of the game as a whole, and more computationally intensive algorithms are used. While it is not possible to solve an imperfect-information game exactly through decomposition, it is possible to approximate solutions, or improve existing strategies, by solving disjoint subgames. This process is referred to as subgame solving. We introduce subgame solving techniques that outperform prior methods both in theory and practice. We also show how to adapt them, and past subgame solving techniques, to respond to opponent actions that are outside the original action abstraction; this significantly outperforms the prior state-of-the-art approach, action translation. Finally, we show that subgame solving can be repeated as the game progresses down the tree, leading to lower exploitability. Subgame solving is a key component of Libratus, the first AI to defeat top humans in heads-up no-limit Texas hold'em poker.

We introduced a subgame solving technique for imperfect-information games that has stronger theoretical guarantees and better practical performance than prior subgame-solving methods. We presented results on exploitability of both safe and unsafe subgame solving techniques. We also introduced a method for nested subgame solving in response to the opponent’s off-tree actions, and demonstrated that this leads to dramatically better performance than the usual approach of action translation. This is, to our knowledge, the first time that exploitability of subgame solving techniques has been measured in large games. Finally, we demonstrated the effectiveness of these techniques in practice against top human professionals in the game of heads-up no-limit Texas hold’em poker, the main benchmark challenge for AI in imperfect-information games. In the 2017 Brains vs. AI competition, our AI Libratus became the first AI to reach the milestone of defeating top humans in heads-up no-limit Texas hold’em.

https://arxiv.org/pdf/1705.08926v1.pdf Counterfactual Multi-Agent Policy Gradients

There is a great need for new reinforcement learning methods that can efficiently learn decentralised policies for such systems. To this end, we propose a new multi-agent actor-critic method called counterfactual multi-agent (COMA) policy gradients. COMA uses a centralised critic to estimate the Q-function and decentralised actors to optimise the agents’ policies.

https://arxiv.org/abs/1605.03661v2 Learning Representations for Counterfactual Inference

We propose a new algorithmic framework for counterfactual inference which brings together ideas from domain adaptation and representation learning. In addition to a theoretical justification, we perform an empirical comparison with previous approaches to causal inference from observational data. Our deep learning algorithm significantly outperforms the previous state-of-the-art.

https://arxiv.org/abs/1608.06580 Communication complexity of approximate Nash equilibria

https://arxiv.org/abs/1708.03658 iTrace: An Implicit Trust Inference Method for Trust-aware Collaborative Filtering

http://www.cs.cmu.edu/~noamb/papers/17-NIPS-Safe.pdf Safe and Nested Subgame Solving for Imperfect-Information Games

http://web.cs.ucla.edu/~kaoru/theoretical-impediments.pdf Theoretical Impediments to Machine Learning

Current machine learning systems operate, almost exclusively, in a purely statistical mode, which puts severe theoretical limits on their performance. We consider the feasibility of leveraging counterfactual reasoning in machine learning tasks, and to identify areas where such reasoning could lead to major breakthroughs in machine learning applications.

http://www.cs.ox.ac.uk/publications/publication11394-abstract.html Counterfactual Multi−Agent Policy Gradients

To this end, we propose a new multi-agent actor-critic method called counterfactual multi-agent (COMA) policy gradients. COMA uses a centralised critic to estimate the Q-function and decentralised actors to optimise the agents' policies. In addition, to address the challenges of multi-agent credit assignment, it uses a counterfactual baseline that marginalises out a single agent's action, while keeping the other agents' actions fixed. COMA also uses a critic representation that allows the counterfactual baseline to be computed efficiently in a single forward pass. We evaluate COMA in the testbed of StarCraft unit micromanagement, using a decentralised variant with significant partial observability. COMA significantly improves average performance over other multi-agent actor-critic methods in this setting, and the best performing agents are competitive with state-of-the-art centralised controllers that get access to the full state.

https://arxiv.org/pdf/1803.10760.pdf Unsupervised Predictive Memory in a Goal-Directed Agent