You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
How legion user should interact with ModelTrainingScaling feature?
What API User should use
Should he use unified legion API for scaling?
Or he should rely on some framework specific scaling API?
Examples:
TF distributed
Sklearn partial_fit
Horovod
How many changes is desirable that user should do in his code to run scaled way of training?
General concerns about ML scaling process
Scaling ML algorithms
First of all we need to realize that not all algorithms and libraries can support scaling
First of all algorithm must support scaling. After this, library that implements it should support scaling
For example (as I understand): almost all NN methods and XGboost ara could be scaled by distribution calculations,
but not Linear regression method
Two approaches for scaling models
Parallelization (Boost speed of training by multiple workers)
Gradient Averaging
Used for separation Gradients calculation from Gradient applying, so that calculations could be divided in
the node pools. Obviously could be used only for models that use Gradient descent method to combine the results
of multiple independent workers. For example this approach is used by TF native distributed framework and Horovod
But I have not found information about distributed computations for the very popular Sklearn library.
This lib only support parallelization for one node with python joblib library
Incremental learning (Boost resourses usage and speed by stream data processing)
This technique that allows ML algo get data incrementally. Not all algos support this techniques
Sklearn models support partial_fit method for incremental learning
so legion user can write his notebook for using datastreams from big data storages
Is approach (from XGBoost docs) for scaling by XGBoost library
Amazon SageMaker
This is AWS abstraction for running TF models either with Horovod or TF Native approach
Problems
ML training is actually running by toolchain integration (TI), not Legion by itself. For example
MLflow integration run training using one of its backends (local conda runner, databricks, k8s) (our integration only use local backend).
Therefore because legion scaling feature probably should not have direct dependence with TI, we need to introduce new abstraction level for running?
For now, we can only scale resources for model training vertically. We have to do research and determine how we can improve it.
Sub tasks:
The text was updated successfully, but these errors were encountered: