Learning-curves is Python module that extends sklearn's learning curve feature. It will help you visualizing the learning curve of your models:
Learning curves give an opportunity to diagnose bias and variance in supervised learning models, but also to visualize how training set size influence the performance of the models (more informations here).
Such plots help you answer the following questions:
- Would my model perform better with more data?
- Can I train my model with less data without reducing accuracy?
- Is my training/validation set biased?
- What is the best model for my data?
- What is the perfect training size for tuning parameters?
Learning-curves will also help you fitting the learning curve to extrapolate and find the saturation value of the curve.
This module is still under development. Therefore it is recommended to use:
$ pip install git+https://github.com/H4dr1en/learning-curves#egg=learning-curves
To create learning curve plots, you can start with the following lines:
import learning_curves as LC
lc = LC.LearningCurve()
lc.get_lc(estimator, X, Y)
Where estimator
implements fit(X,Y)
and predict(X,Y)
(Sklearn interface).
Output:
On this example the green curve suggests that adding more data to the training set is not likely to improve the model accuracy. The green curve also shows a saturation near 0.7. We can easily fit a function to this curve:
lc.plot(predictor="best")
Output:
Here we used a predefined function, pow
, to fit the green curve. The R2 score (0.999) is very close to 1, meaning that the fit is optimal. We can therefore use this curve to extrapolate the evolution of the accuracy with the training set size.
This also tells us how many data we should use to train our model to maximize performances and accuracy.
- Write your own predictors
- Find the best Predictor
- Compare learning curves of various models
- Extrapolate learning curve using multiple instances
- Evaluate extrapolation using mse validation
- Evaluate and compare your models scalability
- Save and load LearningCurve instances
The documentation is available here.
Some functions have their function_name_cust
equivalent. Calling the function without the _cust
suffix will internally call the function with the _cust
suffix with default parameters (such as the data points of the learning curves). Thanks to kwargs
, you can pass exactly the same parameters to both functions.
PRs, bug reports as well as improvment suggestions are welcomed :)