How can I obtain the data to plot the learning curve from CLI output? #5639
-
I'm referring to these type of plots https://scikit-learn.org/stable/modules/learning_curve.html Or this https://rstudio-conf-2020.github.io/dl-keras-tf/notebooks/learning-curve-diagnostics.nb.html What I want to determine is where we are at with the training - if the model could benefit from more training data, also if its underfitting or overfitting. Or if there is another way you suggest to determine these using what is available from the CLI train output. |
Beta Was this translation helpful? Give feedback.
Replies: 5 comments
-
A learning curve is basically plotting the performance on a development set, versus the amount of training data you used. You can simulate this by calling the You can notice overfitting when the training accuracy is still improving (training loss is decreasing), but at the same time development accuracy is remaining stable or even dropping. If the development accuracy is still improving when you add all training examples - that would signal that your model could benefit from even more annotated data. |
Beta Was this translation helpful? Give feedback.
-
@svlandeg Thank you, I have some follow up questions,
I did the following, so I just wanted to verify I was using this correctly.
Then I plot the results and it looks something like this |
Beta Was this translation helpful? Give feedback.
-
I think there's two aspects to your question. I focused on your question "if the model could benefit from more training data" - and for that I would advise plotting dev accuracies versus training sizes, as that helps you understand how much the model is still improving (on an independent dev set) when you're adding data. In proper ML lingo, "learning curve" does probably refer to the curve of plotting training loss vs. dev test accuracy. You want to stop learning (run no more epochs) when you start seeing overfitting. So basically they are two different curves, determining two different hyperparameters: one for the size of the required training dataset, and one for the ideal number of epochs. I use "dev set" thoughout instead of "test set" because I believe you should have some independent dev dataset to determine these hyperparameters on. Once you've determined all these hyperparameters and trained your model with the best ones, you can then use yet another set of data - the actual test set - to measure how well your final model works / generalizes on truely unseen data. If you don't do this, you could be overfitting on the dev set. And yes, you can use F-score instead of accuracy. It depends on the type of ML problem which is the most appropriate.
Why don't you just run Anyway I guess most of these topics are really general ML questions and not so much specific to spaCy. It might make more sense to post these on a different forum with a larger community. That would also help us to keep this tracker focused on bug reports and specific enhancement features. |
Beta Was this translation helpful? Give feedback.
-
@svlandeg Ok thanks, your advice is appreciated. Sorry I just want to clarify one last thing - you said "You have an estimate on how well the training dataset is being fitted with the loss - you don't necessarily need the training F-score/accuracy." Ok that's true but during training I don't see the loss for the dev set in the console output or in the json files
|
Beta Was this translation helpful? Give feedback.
-
True: by default the script prints the training loss, and the dev F-score. So basically any "loss" metric is calculated on the training dataset, and the others are calculated on the dev set. So you can't necessarily compare the two with this script (unless you run additional evaluations like you suggested). But you may not need that. You can monitor the training loss (which needs to go down) vs. the dev F (which needs to go up). |
Beta Was this translation helpful? Give feedback.
I think there's two aspects to your question. I focused on your question "if the model could benefit from more training data" - and for that I would advise plotting dev accuracies versus training sizes, as that helps you understand how much the model is still improving (on an independent dev set) when you're adding data.
In proper ML lingo, "learning curve" does probably refer to the curve of plotting training loss vs. dev test accuracy. You want to stop learning (run no more epochs) when you start seeing overfitting.
So basically they are two different curves, determining two different hyperparameters: one for the size of the required training dataset, and one for the ideal number of epo…