Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extending IndicBART or IndicBERT #57

Open
singhakr opened this issue Jan 31, 2023 · 7 comments
Open

Extending IndicBART or IndicBERT #57

singhakr opened this issue Jan 31, 2023 · 7 comments

Comments

@singhakr
Copy link

singhakr commented Jan 31, 2023

I want to basically pre-train models from scratch, including tokenizer, for languages included in IndicBART and IndicBERT and some more languages, so as to build a something like IndicBARTExt and IndicBERTExt.

While going through some related issues, I noticed that there some conventions about language codes. It is possible to use pre-existing language codes for new languages.

Is there some way to add new language codes, say, for IndicBART/IndicBERT without much change in Python code which is called from the pre-train shell script? Or will it require considerable changes?

@singhakr
Copy link
Author

singhakr commented Feb 1, 2023

Or, if I can continue pre-training of the IndicBART and IndicBERT models on some more Indian languages, preferably with additional language codes, that will be even better. I have a vague idea of how to do it, but not exactly in terms of the toolkit code. Any help or pointers will be helpful.

@prajdabre
Copy link
Owner

There is already an indicbert v2 which supports 24 Indic languages: https://huggingface.co/ai4bharat/IndicBERTv2-SS
AFAIK it was trained from scratch.

As for IndicBARTv2, there is one in the works and should be out soon.

Regarding what you want to do, you will need to figure out the following:

  1. Extend the AlbertTokenizer with additional language tokens.
  2. Corresponding to the IDs of the language indicator tokens, extend the embedding matrix of the IndicBART model checkpoint (the .pt or .bin file containing the weights).

This will involve no change to the codebase.

@singhakr
Copy link
Author

singhakr commented Feb 1, 2023

The URL you included is giving 404 error. Is there some other place where it may be described?

Extend the AlbertTokenizer with additional language tokens.

I should be able to do this.

Corresponding to the IDs of the language indicator tokens, extend the embedding matrix of the IndicBART model checkpoint (the .pt or .bin file containing the weights)

For this can I simply load the pre-trained model and call the train script or function again on more data or should I keep in mind some other points?

And do the language codes matter or one can simply reuse existing codes? In the paper I remember reading that languages are not distinguished while training to allow zero shot learning. But I guess the code will matter when trying to actually translate.

BTW, I also want to fine-tune on some basic NLP tasks like POS tagging and NER etc. Will the pre-training differ for bilingual and monolingual tasks. There are, I think two different scripts for monolingual and translation tasks.

@singhakr
Copy link
Author

singhakr commented Feb 2, 2023

The URL is working now. I have data for three languages, two of them are not in the list of 24 and for one I may have some more data.

For IndicBERT, I have posted an issue on their repository.

For IndicBART, I am not clear about how to use the language codes to continue pre-training or while fine-tuning.

Should the pre-training be different for multilingual parallel corpora and multilingual monolingual corpora? Or will only fine-tuning be different? In either case, how to proceed properly? I don't have experience working with BERT before.

@prajdabre
Copy link
Owner

The URL you included is giving 404 error. Is there some other place where it may be described?

Extend the AlbertTokenizer with additional language tokens.

I should be able to do this.

Corresponding to the IDs of the language indicator tokens, extend the embedding matrix of the IndicBART model checkpoint (the .pt or .bin file containing the weights)

For this can I simply load the pre-trained model and call the train script or function again on more data or should I keep in mind some other points?

You will have to look into how to resize the embedding layer. You will have to hack YANMTT for this. Take a look here for hints: https://discuss.huggingface.co/t/adding-new-tokens-while-preserving-tokenization-of-adjacent-tokens/12604/3

And do the language codes matter or one can simply reuse existing codes? In the paper I remember reading that languages are not distinguished while training to allow zero shot learning. But I guess the code will matter when trying to actually translate.

You can reuse but its a hacky solution.

BTW, I also want to fine-tune on some basic NLP tasks like POS tagging and NER etc. Will the pre-training differ for bilingual and monolingual tasks. There are, I think two different scripts for monolingual and translation tasks.

YANMTT is not designed for basic NLP tasks. Its designed for NLG tasks. However you can treat the NLP task as a NLG task and see what happens. One script is for pre-training however, the fine-tuning script can also be indirectly used for pre-training. Im planning to retire the pre-training script since the latter one can already do everything.

@prajdabre
Copy link
Owner

The URL is working now. I have data for three languages, two of them are not in the list of 24 and for one I may have some more data.

For IndicBERT, I have posted an issue on their repository.

For IndicBART, I am not clear about how to use the language codes to continue pre-training or while fine-tuning.

For fine-tuning you can take a look here: https://github.com/AI4Bharat/indic-bart
For continued pre-training, you can reuse the commands in the aforementioned repo except that your source and target corpora are the same monolingual corpora files.

Should the pre-training be different for multilingual parallel corpora and multilingual monolingual corpora? Or will only fine-tuning be different? In either case, how to proceed properly? I don't have experience working with BERT before.

Rather than jumping into continued fine-tuning I recommend that you first get used to pretraining models from scratch with YANMTT. Once you familiarize yourself with this, things will get easier. Look into the examples folder for help.

@prajdabre
Copy link
Owner

Also I recommend going through the issues section. Most of your questions can be answered from there:

#37
#36
#34
#33
#54
#31
#17
#5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants