Deep learning is interesting in many ways. But when you consider to do it in thousands of cores that can process millions of parameters, then the problem is more interesting and complex at the same time.
Google was doing an interesting experiment, training a deep network with millions of parameters in thousands of CPUs. The goal was to train very large datasets without to limit the form of the model.
The paper describes the use of DistBelief, a framework created for distributed parallel computing applied to deep learning training. A collection of the features that the framework manage by itself are:
The framework automatically parallelises computation in each machine using all available core, and manages communication, synchronisation and data transfer between machines during both training and inference.
I couldn’t find too much information about it, only what it is written in the paper.
They have applied two algorithms: SGD (Stochastic Gradient Descent) and L-BFGS. These algorithms usually works well, but they doesn’t scale with very large data sets. That is because they introduce some modifications to them. The paper gives you more details about the optimisations in both algorithms that you can find interesting.
I was found really interesting the idea of distributed parallel computing working for very large datasets in such algorithms.