Horizontal scaling for model training #1053

mcdoker18 · 2019-10-28T11:17:35Z

For now, we can only scale resources for model training vertically. We have to do research and determine how we can improve it.
Sub tasks:

Create a list of frameworks which provide a way to distributed training:
1. Horovod
2. ...
Can we use the frameworks above with Mlflow?
Describe new training API or extent current API.
Develop the MVP with distributed training.

vlad-tokarev · 2019-10-28T15:01:32Z

Distribute ML research

Scaling ML algorithms
1. First of all we need to realize that not all algorithms and libraries can support scaling
2. First of all algorithm must support scaling. After this, library that implements it should support scaling
3. For example (as I understand): almost all NN methods and XGboost ara could be scaled by distribution calculations,
  but not Linear regression method
Two approaches for scaling models
1. Parallelization (Boost speed of training by multiple workers)
  1. Gradient Averaging
    Used for separation Gradients calculation from Gradient applying, so that calculations could be divided in
    the node pools. Obviously could be used only for models that use Gradient descent method to combine the results
    of multiple independent workers. For example this approach is used by TF native distributed framework and Horovod
  2. But I have not found information about distributed computations for the very popular Sklearn library.
    This lib only support parallelization for one node with python joblib library
2. Incremental learning (Boost resourses usage and speed by stream data processing)
  1. This technique that allows ML algo get data incrementally. Not all algos support this techniques
  2. Sklearn models support partial_fit method for incremental learning
    so legion user can write his notebook for using datastreams from big data storages

Horovod
- Was introduced by Uber
- Uber was using TF distributed approach but faced with two problems:
  - TF Distributed tool use GPU resourses in not optimal way
  - TF Distributed tool has verbose API that introduce a lot of new concepts to existing ML codebase
- Rely on MPI framework under the hood
- Rely on ring allreduce
- Support primarily NN, not support XGBoost
rabit
- lib from XGBoost authors for Reliable Allreduce and Broadcast Interface for distributed machine learning
Dusk
- Is approach (from XGBoost docs) for scaling by XGBoost library
Amazon SageMaker
This is AWS abstraction for running TF models either with Horovod or TF Native approach

ML training is actually running by toolchain integration (TI), not Legion by itself. For example
MLflow integration run training using one of its backends (local conda runner, databricks, k8s) (our integration only use local backend).
Therefore because legion scaling feature probably should not have direct dependence with TI, we need to introduce new abstraction level for running?

mcdoker18 added Spike improvement [Changed] for changes in existing functionality labels Oct 28, 2019

mcdoker18 added this to the 1.0.0 milestone Oct 28, 2019

mcdoker18 assigned vlad-tokarev Oct 28, 2019

mcdoker18 modified the milestones: 1.0.0, 1.1 Oct 28, 2019