Deep Reinforcement Learning from Self-Play in Imperfect-Information Games SoK: Applying Machine Learning in Security - A Survey

We examine the generalized system designs, underlying assumptions, measurements, and use cases in active research. Our examinations lead to 1) a taxonomy on ML paradigms and security domains for future exploration and exploitation, and 2) an agenda detailing open and upcoming challenges. Based on our survey, we also suggest a point of view that treats security as a game theory problem instead of a batch-trained ML problem. Deep Learning for Predicting Human Strategic Behavior. Jason Hartford, James R. Wright, Kevin Leyton-Brown.

Semantics, Representations and Grammars of Deep Learning Connecting Generative Adversarial Networks and Actor-Critic Methods

Both generative adversarial networks (GAN) in unsupervised learning and actorcritic methods in reinforcement learning (RL) have gained a reputation for being difficult to optimize. Practitioners in both fields have amassed a large number of strategies to mitigate these instabilities and improve training. Here we show that GANs can be viewed as actor-critic methods in an environment where the actor cannot affect the reward. We review the strategies for stabilizing training for each class of models, both those that generalize between the two and those that are particular to that model. We also review a number of extensions to GANs and RL algorithms with even more complicated information flow. We hope that by highlighting this formal connection we will encourage both GAN and RL communities to develop general, scalable, and stable algorithms for multilevel optimization with deep networks, and to draw inspiration across communities. Adversarial synapses: Hebbian/anti-Hebbian learning optimizes min-max objectives

Here, we derive neural networks from principled min-max objectives: by minimizing with respect to neural activity and feedforward synaptic weights, and maximizing with respect to lateral synaptic weights. The min-max nature of the objective is reflected in the antagonism between Hebbian feedforward and anti-Hebbian lateral learning in derived networks. We prove that the only stable fixed points of the network dynamics correspond to the principal subspace projection (PSP) or the principal subspace whitening (PSW). Finally, from the min-max objectives we derive novel formulations of dimensionality reduction using fractional matrix exponents. The Morphospace of Consciousness

Then, building on insights from cognitive robotics, we ask what function consciousness serves, and interpret it as an evolutionary game-theoretic strategy. We distinguish four forms of consciousness, based on embodiment: biological, synthetic, group (resulting from group interactions) and simulated consciousness (embodied by virtual agents within a simulated reality). Such a taxonomy is useful for studying comparative signatures of consciousness across domains, in order to highlight design principles necessary to engineer conscious machines. Many Paths to Equilibrium: GANs Do Not Need to Decrease a Divergence At Every Step

We find evidence that divergence minimization may not be an accurate characterization of GAN training. The Mechanics of n-Player Differentiable Games

The key result is to decompose the second-order dynamics into two components. The first is related to potential games, which reduce to gradient descent on an implicit function; the second relates to Hamiltonian games, a new class of games that obey a conservation law, akin to conservation laws in classical mechanical systems. The decomposition motivates Symplectic Gradient Adjustment (SGA), a new algorithm for finding stable fixed points in general games. Basic experiments show SGA is competitive with recently proposed algorithms for finding local Nash equilibria in GANs – whilst at the same time being applicable to – and having guarantees in – much more general games. Multi-Agent Deep Deterministic Policy Gradient (MADDPG) A Multi-perspective Approach To Anomaly Detection For Self-aware Embodied Agents The Mechanics of n-Player Differentiable Games What game are we playing? End-to-end learning in normal and extensive form games A mathematical theory of resources

We prove some general theorems about how resource theories can be constructed from theories of processes wherein there is a special class of processes that are implementable at no cost and which define the means by which the costly states and processes can be interconverted one to another. AlphaX: eXploring Neural Architectures with Deep Neural Networks and Monte Carlo Tree Search

AlphaX also generates the training date for Meta-DNN. So, the learning of Meta-DNN is end-to-end. In searching for NASNet style architectures, AlphaX found several promising architectures with up to 1% higher accuracy than NASNet using only 17 GPUs for 5 days, demonstrating up to 23.5x speedup over the original searching for NASNet that used 500 GPUs in 4 days. A0C: Alpha Zero in Continuous Action Space

. This paper presents the necessary theoretical extensions of Alpha Zero to deal with continuous action space. We also provide some preliminary experiments on the Pendulum swing-up task, empirically showing the feasibility of our approach. Thereby, this work provides a first step towards the application of iterated search and learning in domains with a continuous action space. GANGs: Generative Adversarial Network Games

The size of these games precludes exact solution methods, therefore we define resource-bounded best responses (RBBRs), and a resource-bounded Nash Equilibrium (RB-NE) as a pair of mixed strategies such that neither G or C can find a better RBBR. The RB-NE solution concept is richer than the notion of `local Nash equilibria' in that it captures not only failures of escaping local optima of gradient descent, but applies to any approximate best response computations, including methods with random restarts. To validate our approach, we solve GANGs with the Parallel Nash Memory algorithm, which provably monotonically converges to an RB-NE. Two-Player Games for Efficient Non-Convex Constrained Optimization

The Lagrangian can be interpreted as a two-player game played between a player who seeks to optimize over the model parameters, and a player who wishes to maximize over the Lagrange multipliers. We propose a non-zero-sum variant of the Lagrangian formulation that can cope with non-differentiable–even discontinuous–constraints, which we call the “proxy-Lagrangian”. The first player minimizes external regret in terms of easy-to-optimize “proxy constraints”, while the second player enforces the original constraints by minimizing swap regret. For this new formulation, as for the Lagrangian in the non-convex setting, the result is a stochastic classifier. For both the proxy-Lagrangian and Lagrangian formulations, however, we prove that this classifier, instead of having unbounded size, can be taken to be a distribution over no more than m+1 models (where m is the number of constraints). This is a significant improvement in practical terms. AlphaSeq: Sequence Discovery with Deep Reinforcement Learning Stable Opponent Shaping in Differentiable Games