http://openreview.net/pdf?id=r1VGvBcxl REINFORCEMENT LEARNING THROUGH ASYNCHRONOUS ADVANTAGE ACTOR-CRITIC ON A GPU

We introduce a hybrid CPU/GPU version of the Asynchronous Advantage ActorCritic (A3C) algorithm, currently the state-of-the-art method in reinforcement learning for various gaming tasks.

http://openreview.net/pdf?id=Sk8J83oee GENERATIVE ADVERSARIAL PARALLELIZATION

We propose Generative Adversarial Parallelization (GAP), a framework in which many GANs or their variants are trained simultaneously, exchanging their discriminators. This eliminates the tight coupling between a generator and discriminator, leading to improved convergence and improved coverage of modes

https://arxiv.org/abs/1412.6651 Deep learning with Elastic Averaging SGD

We study the problem of stochastic optimization for deep learning in the parallel computing environment under communication constraints. A new algorithm is proposed in this setting where the communication and coordination of work among concurrent processes (local workers), is based on an elastic force which links the parameters they compute with a center variable stored by the parameter server (master). The algorithm enables the local workers to perform more exploration, i.e. the algorithm allows the local variables to fluctuate further from the center variable by reducing the amount of communication between local workers and the master. We empirically demonstrate that in the deep learning setting, due to the existence of many local optima, allowing more exploration can lead to the improved performance. We propose synchronous and asynchronous variants of the new algorithm. We provide the stability analysis of the asynchronous variant in the round-robin scheme and compare it with the more common parallelized method ADMM. We show that the stability of EASGD is guaranteed when a simple stability condition is satisfied, which is not the case for ADMM. We additionally propose the momentum-based version of our algorithm that can be applied in both synchronous and asynchronous settings. Asynchronous variant of the algorithm is applied to train convolutional neural networks for image classification on the CIFAR and ImageNet datasets. Experiments demonstrate that the new algorithm accelerates the training of deep architectures compared to DOWNPOUR and other common baseline approaches and furthermore is very communication efficient.

https://arxiv.org/abs/1611.09726v1 Gossip training for deep learning

We address the issue of speeding up the training of convolutional networks. Here we study a distributed method adapted to stochastic gradient descent (SGD). The parallel optimization setup uses several threads, each applying individual gradient descents on a local variable. We propose a new way to share information between different threads inspired by gossip algorithms and showing good consensus convergence properties. Our method called GoSGD has the advantage to be fully asynchronous and decentralized. We compared our method to the recent EASGD in \cite{elastic} on CIFAR-10 show encouraging results.

Our algorithm disposes of several advantages compared to other methods. First, it is fully asynchronous and decentralized avoiding all kind of idling, then the exchanges are pairwise and benefit of the faster communication channel CPI. Second, there are theoretical aspects interesting to discuss: it is possible to derive a consensus convergence rate for many gossip algorithms. It could be useful to extend this study to GoSDG in order to measure the sensibility of gossip averaging to the additional gradients. This would provide some insights to optimize the frequency of exchange and to control it as low as possible without impacting too much the consensus between threads.

https://research.googleblog.com/2017/04/federated-learning-collaborative.html

https://arxiv.org/abs/1602.05629 Communication-Efficient Learning of Deep Networks from Decentralized Data

However, this rich data is often privacy sensitive, large in quantity, or both, which may preclude logging to the data center and training there using conventional approaches. We advocate an alternative that leaves the training data distributed on the mobile devices, and learns a shared model by aggregating locally-computed updates. We term this decentralized approach Federated Learning. We present a practical method for the federated learning of deep networks based on iterative model averaging, and conduct an extensive empirical evaluation, considering five different model architectures and four datasets. These experiments demonstrate the approach is robust to the unbalanced and non-IID data distributions that are a defining characteristic of this setting. Communication costs are the principal constraint, and we show a reduction in required communication rounds by 10-100x as compared to synchronized stochastic gradient descent.

https://arxiv.org/abs/1705.0486 Efficient Parallel Methods for Deep Reinforcement Learning

We propose a novel framework for efficient parallelization of deep reinforcement learning algorithms, enabling these algorithms to learn from multiple actors on a single machine. The framework is algorithm agnostic and can be applied to on-policy, off-policy, value based and policy gradient based algorithms. Given its inherent parallelism, the framework can be efficiently implemented on a GPU, allowing the usage of powerful models while significantly reducing training time. We demonstrate the effectiveness of our framework by implementing an advantage actor-critic algorithm on a GPU, using on-policy experiences and employing synchronous updates. Our algorithm achieves state-of-the-art performance on the Atari domain after only a few hours of training. Our framework thus opens the door for much faster experimentation on demanding problem domains. https://github.com/alfredvc/paac

https://arxiv.org/abs/1808.07217v1 Don't Use Large Mini-Batches, Use Local SGD

https://arxiv.org/abs/1810.01021v1 Large batch size training of neural networks with adversarial training and second-order information

Our method allows one to increase batch size and learning rate automatically, based on Hessian information. This helps significantly reduce the number of parameter updates, and it achieves superior generalization performance, without the need to tune any of the additional hyper-parameters. Finally, we show that a block Hessian can be used to approximate the trend of the full Hessian to reduce the overhead of using second-order information. These improvements are useful to reduce NN training time in practice