-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Finetuning #19
Comments
Hi, it's possible to resume training from a checkpoint (so it's the same functionality as fine-tuning), but it's not possible to fine-tune original gpt-2 model, because tokenizer is different. |
I am currently looking at these models to finetune, which were trained with this repo, or at least a fork of it. So I guess simply resuming with the tf version would suffice. |
Oh nice, thanks for sharing the link. Then yes, fine-tuning should work. |
These are pytorch models, which is good, because TF code is not really supported, while pytorch code is better developed and supported. |
Cool, will try on the weekend, thanks for the blazing fast responses 🥇 |
hey, cool - thanks for trying out my GPT-2 models! would be happy to hear your feedback on these. the larger GPT-2 model is still training, so if you want I can provide an updated model this week which should have slightly lower loss than the one released so far. |
Hey @gooofy, this would be very cool, please do! |
Hi, it's me again. I am not sure this is the right thread to follow up, so feel free to move it/let me know.
And this is a small part of the size mismatch errors:
|
Right, on each invocation you'll need to set all hyperparameters, and the error is indeed due to hyperparameter mismatch. I think that correct hyperparameters should be in |
Is there a CLI for the hyperparams? I cant seem to find one. |
Yes, it's defined implicitly via fire library, so all |
I guess resuming is also implicit, whenever there are the *.pt files in the model directory, furthermore the params.json is being overwritten on the invocation with current ones. |
Indeed, resuming is implicit here: Lines 139 to 140 in fa3f529
and right, params.json file will be overwritten, which is not great. |
Hey, I am running into some further problems trying to resume from the big German model, even after I set the params. I would appreciate any help. Also, again, I am running the latest version from @gooofy 's fork:
|
Hmm I see, this looks related to gradient checkpointing (which I didn't get a chance to try yet), I wonder if it will work if you disable it? Could be something else as well, hard to tell, sorry. |
here is the command line I am using for training this model - does this help? gpt-2 de345-root data/encoded-de sp-model.model --n_embed=1024 --n_head=16 --n_layer=24 --batch_size=3 --gradient_checkpointing --save_every=5000 params.json: { |
I have now disabled the gradient checkpointing, and I get stuck at the same place, but no error this time: |
just a wild guess: maybe you're using a different torch version?
|
Aahh, I am indeed. |
Just wondering, which CUDA version do you use? 10.0? |
yes, 10.0 |
new release has finished uploading, available here: https://zamia.org/brain/ trained for 4.5 epochs on 27GB text corpus |
Hi,
|
@Stamenov I wonder if this could be some bug of resume code, I didn't test it that much. Does progress bar jump to 7% immediately, or it's getting there after some time? There is no error message printed, right? Can you check the exit code? Also I wonder if training from scratch will work for you (to narrow down the issue)? |
@lopuhin I does take some time to get there, it also just jumps to 7% from 0, after using my GPU for some time (reported using nvidia-smi) and after briefly showing a "0/3 validation" progress bar, just below the overall progress bar. Training from scratch works, but with the default params only. With the ones from the german model, as supplied by @gooofy, I get CUDA out of memory error. Maybe this is related? Are there any additional logs or informations stored while the training is going on, that I could check? EDIT: Reducing the batch size to "1" shows the same behaviour, the progress bar now shows 96% and quits. |
what gpu model are you using? my settings are aimed at 11/12GB models (1080ti / titan x) |
I tried it with K80 and Tesla V100, same results. |
Okay, I think I was able to debug why this code does not work for finetuning (at least my case). |
uh, wow, nice find! congrats you got to the bottom of this :) haven't had a chance to look deeper into this one, but maybe we could add a --finetune command line option which would simply make load_model() set seen_tokens back to zero? |
Hi, this sounds great - did you implement the --finetune flag? |
Quite some time has passed, but I just wonder if you were able to find a solution for you? Also I'm not sure if line 185 in the code is still the same you were referring to back then? |
So i've tried finetuning the german model, by just setting the seen tokens back to 0, as was suggested.
But the trained model does perform much worse. And not at all, as before. |
@SaschaStenger Were you able to fix the problem? I am also using the model by @gooofy and i am unable to finetune it. I will apply your patch, if you were able to succeed. |
Sorry, so far i haven't been able to. But i'm still very interested in a solution and will look into it again and post any solution that i might find. |
@SaschaStenger I am trying few things. Will surely let you know if all goes well. |
Thank you @hafsabukhary. I wanted to ask, if any of your approaches might have been fruitful. |
@SaschaStenger I used the old main.py from https://github.com/gooofy/transformer-lm/tree/master/lm.
this way training continues. you have to use default parameters of German model. e.g., vocab_size, |
hi,
i am trying to finetune on a specific dataset. shouldn't the finetune from the beginning have the same gramatical quality as the german model 346 and implement the new vocabulary better with every iteration? has anybody found a solution for this problem? does finetuning work for you? thank you. |
I'm having similar issues. |
Made another finetuning test around 970 epochs, now it sometimes seems to overfit, by generating sentences that are the same as in the corpus that i use, (3,1 MB .txt) on other times it just sticks random snippets together witout any sense. |
Hi,
just wondering, since you are basing the tf train.py on nshepperd's finetuning script, I was wonder if this code also supports finetuning, or are models trained here from scratch, finetunable with nshepperd's train.py?
Best regards.
The text was updated successfully, but these errors were encountered: