http://arxiv.org/pdf/1406.3332v2.pdf Convolutional Kernel Networks

we have preferred to use L-BFGS-B on 300 000 pairs of randomly selected training data points, and initialize W with the K-means algorithm. L-BFGS-B is a parameter-free state-of-the-art batch method, which is not as fast as SGD but much easier to use. We always run the L-BFGS-B algorithm for 4 000 iterations, which seems to ensure convergence to a stationary point.

http://www.thespermwhale.com/jaseweston/icml2016/icml2016-memnn-tutorial.pdf Memory Networks

https://www.opendatascience.com/blog/introduction-deep-learning-for-chatbots-part-1