Question generation for non English languages #5

secsrexion · 2020-04-29T11:47:42Z

Hello
I hope you are doing fine , firstly i thank you for your contributions on question generation , and i have a question if i may ask .
Im trying to build a question generation system for a non-English language i was planing to use UniLm ( miniLm multilingual version ) because bert is not really built for text generation since you have experience on that what how do you suggest to do that and am i following the good path .

Thank you in advance for your appreciated help !

artitw · 2020-05-02T23:03:29Z

What language are you considering? Try looking up Cross-Lingual Natural Language Inference (XNLI) and Cross-Lingual Question Answering (MLQA) to fine tune miniLm. If you require something different, consider procuring your own dataset for fine tuning.

secsrexion · 2020-05-03T07:58:40Z

Hello Im just having some troubles to create the top layer To make seq2seq generation. Uf you could just explain it with some details how can i create it on my own that will be great .

…

Sent from my iPhone On 3 May 2020, at 00:03, artitw <[email protected]> wrote: What language are you considering? Try looking up Cross-Lingual Natural Language Inference (XNLI) and Cross-Lingual Question Answering (MLQA) to fine tune miniLm. If you require something different, consider procuring your own dataset for fine tuning. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub<#5 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AJEMYYKBBTDW2UODBBPMXOTRPSREZANCNFSM4MTUY2WA>.

artitw · 2020-05-08T23:56:44Z

Why do you have to create the top layer? Are you using the miniLM code available?

secsrexion · 2020-05-09T09:18:46Z

I want understand and re build it my own , right now the seq2seq code of minilm is not adapted to the multilingual version .
And if the multilingual gives some hood results i may pre train another version for this language.

Thanks for your replies i really appriciate your help

artitw · 2020-05-17T16:47:37Z

If you’re looking to customize the question generation component, take a look at

text2text/text2text/pytorch_pretrained_bert/modeling.py

Line 2107 in db07ee9

class BertForQuestionAnswering(PreTrainedBertModel):

However, what aspects of your multilingual approach would require adaptation?

secsrexion · 2020-05-17T18:14:15Z

The main problem is that the multilingual version is not as good as the native one , and NLG is a data-hungry task as you know .

artitw · 2020-05-17T23:00:22Z

My impression is that you could use more training data for what you’re trying to achieve. Am I missing something?

thusithaC · 2020-06-15T03:42:29Z

Hi @secsrexion How is your progress with non-english question generation. We are also interested in Chinese language QG task, and wondering how much work we might have to put to adapt the code provided by artitw.

BTW great work and thanks @artitw for sharing the code! Have you published your work anywhere?

secsrexion · 2020-06-16T16:28:43Z

Hello @thusithaC
i had to stop the developpement i was using a machine translated version of SQUAD and i discovered latter that it was a low quality one .
now i'm trying to gather up a good dataset to continue .

generally from what i found , using a multi-langual version of UNILM is a bad choice due to the lack of a rich repliable dataset in the training process , i was getting 80% from each phrase marked as UNK by the tokenizer .
i didnt tested it for Chinese i hope you will find better results , but if you want a good piece of advice we need to find/re-build the training code of UNILM to create a native version of the language model .

secsrexion · 2020-06-16T16:31:52Z

@artitw
i'm sorry for my late answer .

i was facing troubles with the multi-langual version and the quality of the dataset

now i'm trying to develop a riable dataset for arabic QST/ANS , and i'm searching for a way to train new native version of the UNILM , any ideas ?

thusithaC · 2020-06-17T02:32:28Z

@secsrexion Thanks for the reply. I saw your post on the UNILM github as well :) I wonder whether is quality issue you face is because the multi-lang modelsis based on "miniLM" i.e. the smaller model but this code-base is based on the full english unilm model, which is vastly superior?

secsrexion · 2020-06-17T08:26:35Z

Hi
I think its because of the small amount of the arabic training data they used in first place

jacampo · 2020-09-09T09:52:17Z

Hi,
Is it possible to use the model in spanish? If not, how could i train the program?

artitw · 2020-09-12T16:32:54Z

@thusithaC @secsrexion @jacampo I am looking into making a multilingual model to see if and how it can be done. As @secsrexion pointed out, the low amounts of data need to be addressed. I will keep you all updated.

thusithaC · 2020-09-13T01:29:12Z

Awesome! thanks,

…

On Sun, Sep 13, 2020 at 12:33 AM artitw ***@***.***> wrote: @thusithaC <https://github.com/thusithaC> @secsrexion <https://github.com/secsrexion> @jacampo <https://github.com/jacampo> I am looking into making a multilingual model to see if and how it can be done. As @secsrexion <https://github.com/secsrexion> pointed out, the low amounts of data need to be addressed. I will keep you all updated. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#5 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABVRIJ2VDBVST2CQE52MFWDSFOPEJANCNFSM4MTUY2WA> .

-- Thusitha Chandrapala

artitw · 2020-09-13T01:40:22Z

An alternative solution if you need something immediately is to translate to English and then use this model.

jacampo · 2020-09-15T10:06:03Z

@artitw Ok, thanks. The translation could be interresting too.
But what if i had a dataset and a bert model in another languaje? The structure of the code should be the same, or is there any other difference between languages?

artitw · 2020-09-19T16:45:20Z

@jacampo which BERT model are you referring to? If it uses WordPiece tokenization, I cannot think of any differences in the code used.

jacampo · 2020-09-22T10:15:19Z

I found one in spanish: https://github.com/dccuchile/beto
But i dont know if it can be done with it.

You use BertForSeq2SeqDecoder, rigth? What is the diference between that and BertModel or BertForPreTraining? Sorry for disturbing so much with so many questions

Edit: Ok i see you start with bert-base-cased, so my question is resolved, It is too much information at once, do you recommend a simple guide to understand the models and how to use it?

artitw · 2020-09-26T22:09:03Z

@jacampo glad you figured it out. Yes, it is indeed confusing. It sounds like you would find a fine-tuning guide useful. I can think about how that might be done. In the meantime, if you find something that works please share back here.

jacampo · 2020-09-28T07:07:13Z

@artitw thanks, i'll let you know

artitw · 2020-12-06T15:45:45Z

Anyone want to work together on this? I’ve started an approach to multilingual question generation and summarization but not had enough time to run experiments. I could provide some guidance for anyone interested in collaborating, as long as the work is contributed back to open source here. The approach would be based on cross-lingual models as I describe here: https://www.youtube.com/watch?v=caZLVcJqsqo

artitw · 2021-05-09T03:21:55Z

Multilingual question generation is now available. Check out the latest version

from text2text import Questioner
qr = Questioner()
qr.predict(["很喜欢陈慧琳唱歌。"], src_lang='zh')
[('我喜欢做什么?', '唱歌')]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question generation for non English languages #5

Question generation for non English languages #5

secsrexion commented Apr 29, 2020

artitw commented May 2, 2020

secsrexion commented May 3, 2020 via email

artitw commented May 8, 2020

secsrexion commented May 9, 2020

artitw commented May 17, 2020

secsrexion commented May 17, 2020

artitw commented May 17, 2020

thusithaC commented Jun 15, 2020

secsrexion commented Jun 16, 2020

secsrexion commented Jun 16, 2020

thusithaC commented Jun 17, 2020

secsrexion commented Jun 17, 2020

jacampo commented Sep 9, 2020

artitw commented Sep 12, 2020

thusithaC commented Sep 13, 2020 via email

artitw commented Sep 13, 2020

jacampo commented Sep 15, 2020 •

edited

Loading

artitw commented Sep 19, 2020

jacampo commented Sep 22, 2020 •

edited

Loading

artitw commented Sep 26, 2020

jacampo commented Sep 28, 2020

artitw commented Dec 6, 2020 •

edited

Loading

artitw commented May 9, 2021

Question generation for non English languages #5

Question generation for non English languages #5

Comments

secsrexion commented Apr 29, 2020

artitw commented May 2, 2020

secsrexion commented May 3, 2020 via email

artitw commented May 8, 2020

secsrexion commented May 9, 2020

artitw commented May 17, 2020

secsrexion commented May 17, 2020

artitw commented May 17, 2020

thusithaC commented Jun 15, 2020

secsrexion commented Jun 16, 2020

secsrexion commented Jun 16, 2020

thusithaC commented Jun 17, 2020

secsrexion commented Jun 17, 2020

jacampo commented Sep 9, 2020

artitw commented Sep 12, 2020

thusithaC commented Sep 13, 2020 via email

artitw commented Sep 13, 2020

jacampo commented Sep 15, 2020 • edited Loading

artitw commented Sep 19, 2020

jacampo commented Sep 22, 2020 • edited Loading

artitw commented Sep 26, 2020

jacampo commented Sep 28, 2020

artitw commented Dec 6, 2020 • edited Loading

artitw commented May 9, 2021

jacampo commented Sep 15, 2020 •

edited

Loading

jacampo commented Sep 22, 2020 •

edited

Loading

artitw commented Dec 6, 2020 •

edited

Loading