Hardware Acceleration

http://vlsiarch.eecs.harvard.edu/wp-content/uploads/2016/05/reagen_isca16.pdf Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators

http://pages.saclay.inria.fr/olivier.temam/files/eval/supercomputer.pdf DaDianNao: A Machine-Learning Supercomputer

https://arxiv.org/pdf/1603.07400.pdf A Reconfigurable Low Power High Throughput Architecture for Deep Network Training

http://www.ece.ubc.ca/~aamodt/papers/Cnvlutin.ISCA2016.pdf Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing

http://openreview.net/pdf?id=By14kuqxx BIT-PRAGMATIC DEEP NEURAL NETWORK COMPUTING

PRA improves performance by 3.1x over the DaDiaNao (DaDN) accelerator Chen et al. (2014) and by 3.5x when DaDN uses an 8-bit quantized representation Warden (2016). DaDN was reported to be 300x faster than commodity graphics processors.

To the best of our knowledge Pragmatic is the first DNN accelerator that exploits not only the per layer precision requirements of CNNs but also the essential bit information content of the activation values. While this work targeted high-performance implementations, Pragmatic’s core approach should be applicable to other hardware accelerators. We have investigated Pragmatic only for inference and with image classification convolutional neural networks. While desirable, applying the same concept to other network types, layers other than the convolutional one, is left for future work. It would also be interesting to study how the Pragmatic concepts can be applied to more general purpose accelerators or even graphics processors.


http://openreview.net/pdf?id=HkNRsU5ge SIGMA-DELTA QUANTIZED NETWORKS

Deep neural networks can be obscenely wasteful. When processing video, a convolutional network expends a fixed amount of computation for each frame with no regard to the similarity between neighbouring frames. As a result, it ends up repeatedly doing very similar computations. To put an end to such waste, we introduce Sigma-Delta networks. With each new input, each layer in this network sends a discretized form of its change in activation to the next layer. Thus the amount of computation that the network does scales with the amount of change in the input and layer activations, rather than the size of the network. We introduce an optimization method for converting any pre-trained deep network into an optimally efficient Sigma-Delta network, and show that our algorithm, if run on the appropriate hardware, could cut at least an order of magnitude from the computational cost of processing video data.

https://arxiv.org/abs/1609.02053 Fast and Efficient Asynchronous Neural Computation with Adapting Spiking Neural Networks

https://arxiv.org/abs/1608.06049v1 Local Binary Convolutional Neural Networks

We propose local binary convolution (LBC), an efficient alternative to convolutional layers in standard convolutional neural networks (CNN). The design principles of LBC are motivated by local binary patterns (LBP). The LBC layer comprises of a set of fixed sparse pre-defined binary convolutional filters that are not updated during the training process, a non-linear activation function and a set of learnable linear weights. The linear weights combine the activated filter responses to approximate the corresponding activated filter responses of a standard convolutional layer.

https://arxiv.org/abs/1705.00125 Cnvlutin2: Ineffectual-Activation-and-Weight-Free Deep Neural Network Computing

https://arxiv.org/pdf/1703.09039.pdf Efficient Processing of Deep Neural Networks: A Tutorial and Survey

https://arxiv.org/pdf/1803.06333.pdf Snap Machine Learning

Our library, named Snap Machine Learning (Snap ML), combines recent advances in machine learning systems and algorithms in a nested manner to reƒect the hierarchical architecture of modern distributed systems. Œis allows us to e‚ectively leverage available network, memory and heterogeneous compute resources. On a terabyte-scale publicly available dataset for click-through-rate prediction in computational advertising, we demonstrate the training of a logistic regression classi€er in 1.53 minutes, a 46x improvement over the fastest reported performance.

https://arxiv.org/abs/1808.02513 Rethinking Numerical Representations for Deep Neural Networks

We show that inference using these custom numeric representations on production-grade DNNs, including GoogLeNet and VGG, achieves an average speedup of 7.6x with less than 1% degradation in inference accuracy relative to a state-of-the-art baseline platform representing the most sophisticated hardware using single-precision floating point.