Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Horizontal scaling for model training #1053

Open
mcdoker18 opened this issue Oct 28, 2019 · 1 comment
Open

Horizontal scaling for model training #1053

mcdoker18 opened this issue Oct 28, 2019 · 1 comment
Assignees
Labels
improvement [Changed] for changes in existing functionality Spike
Milestone

Comments

@mcdoker18
Copy link
Contributor

For now, we can only scale resources for model training vertically. We have to do research and determine how we can improve it.
Sub tasks:

  • Create a list of frameworks which provide a way to distributed training:
    1. Horovod
    2. ...
  • Can we use the frameworks above with Mlflow?
  • Describe new training API or extent current API.
  • Develop the MVP with distributed training.
@mcdoker18 mcdoker18 added Spike improvement [Changed] for changes in existing functionality labels Oct 28, 2019
@mcdoker18 mcdoker18 added this to the 1.0.0 milestone Oct 28, 2019
@mcdoker18 mcdoker18 modified the milestones: 1.0.0, 1.1 Oct 28, 2019
@vlad-tokarev
Copy link
Contributor

Distribute ML research

Questions

  1. How legion user should interact with ModelTrainingScaling feature?
    1. What API User should use
      1. Should he use unified legion API for scaling?
      2. Or he should rely on some framework specific scaling API?
        Examples:
        • TF distributed
        • Sklearn partial_fit
        • Horovod
    2. How many changes is desirable that user should do in his code to run scaled way of training?

General concerns about ML scaling process

  1. Scaling ML algorithms
    1. First of all we need to realize that not all algorithms and libraries can support scaling
    2. First of all algorithm must support scaling. After this, library that implements it should support scaling
    3. For example (as I understand): almost all NN methods and XGboost ara could be scaled by distribution calculations,
      but not Linear regression method
  2. Two approaches for scaling models
    1. Parallelization (Boost speed of training by multiple workers)
      1. Gradient Averaging
        Used for separation Gradients calculation from Gradient applying, so that calculations could be divided in
        the node pools. Obviously could be used only for models that use Gradient descent method to combine the results
        of multiple independent workers. For example this approach is used by TF native distributed framework and Horovod
      2. But I have not found information about distributed computations for the very popular Sklearn library.
        This lib only support parallelization for one node with python joblib library
    2. Incremental learning (Boost resourses usage and speed by stream data processing)
      1. This technique that allows ML algo get data incrementally. Not all algos support this techniques
      2. Sklearn models support partial_fit method for incremental learning
        so legion user can write his notebook for using datastreams from big data storages

Tools

  1. Horovod
    • Was introduced by Uber
    • Uber was using TF distributed approach but faced with two problems:
      • TF Distributed tool use GPU resourses in not optimal way
      • TF Distributed tool has verbose API that introduce a lot of new concepts to existing ML codebase
    • Rely on MPI framework under the hood
    • Rely on ring allreduce
    • Support primarily NN, not support XGBoost
  2. rabit
    • lib from XGBoost authors for Reliable Allreduce and Broadcast Interface for distributed machine learning
  3. Dusk
  4. Amazon SageMaker
    This is AWS abstraction for running TF models either with Horovod or TF Native approach

Problems

  1. ML training is actually running by toolchain integration (TI), not Legion by itself. For example
    MLflow integration run training using one of its backends (local conda runner, databricks, k8s) (our integration only use local backend).
    Therefore because legion scaling feature probably should not have direct dependence with TI, we need to introduce new abstraction level for running?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement [Changed] for changes in existing functionality Spike
Projects
None yet
Development

No branches or pull requests

2 participants