https://arxiv.org/pdf/1703.04529.pdf Task-based End-to-end Model Learning

As machine learning techniques have become more ubiquitous, it has become common to see machine learning prediction algorithms operating within some larger process. However, the criteria by which we train machine learning algorithms often differ from the ultimate criteria on which we evaluate them. This paper proposes an end-to-end approach for learning probabilistic machine learning models within the context of stochastic programming, in a manner that directly captures the ultimate task-based objective for which they will be used. We then present two experimental evaluations of the proposed approach, one as applied to a generic inventory stock problem and the second to a real-world electrical grid scheduling task. In both cases, we show that the proposed approach can outperform both a traditional modeling approach and a purely black-box policy optimization approach.

https://arxiv.org/abs/1611.03824 Learning to Learn for Global Optimization of Black Box Functions

We present a learning to learn approach for training recurrent neural networks to perform black-box global optimization. In the meta-learning phase we use a large set of smooth target functions to learn a recurrent neural network (RNN) optimizer, which is either a long-short term memory network or a differentiable neural computer. After learning, the RNN can be applied to learn policies in reinforcement learning, as well as other black-box learning tasks, including continuous correlated bandits and experimental design. We compare this approach to Bayesian optimization, with emphasis on the issues of computation speed, horizon length, and exploration-exploitation trade-offs.

https://arxiv.org/abs/1606.03152 Policy Networks with Two-Stage Training for Dialogue Systems

First, we show that, on summary state and action spaces, deep Reinforcement Learning (RL) outperforms Gaussian Processes methods.

We show that a deep RL method based on an actor-critic architecture can exploit a small amount of data very efficiently. Indeed, with only a few hundred dialogues collected with a handcrafted policy, the actor-critic deep learner is considerably bootstrapped from a combination of supervised and batch RL.