Adding models for new languages master thread #3056
Replies: 100 comments 41 replies
-
Currently, I am working on https://github.com/howl-anderson/Chinese_models_for_SpaCy for supporting Chinese language model for SpaCy, it's working pretty good to me, so how can my model into official repository? If suggestion? |
Beta Was this translation helpful? Give feedback.
-
When it comes to the Romanian language, there is a particularity, which I am not sure if it has been discussed elsewhere. Due to the historical development, most keyboards in Romania do not have the Romanian language characters using diacritics (ă, â, î, ș, ț). Thus they have been substituted by the people with their roots (a, i, s, t) in most of the written language, often also in official documents, but I'd say in the majority of the online data (no citation available). Looking at the RO language data, I can observe that most of the data is written using the diacritics but there are exceptions (e.g. "aceeasi" instead of "aceeași" in the list of the STOP_WORDS). This may be an issue for lemmatization and machine understanding. Here's an example where Romanians usually understand well this adaptation, while I'm not sure what is the case for the spaCy context manager:
Thus the options:
I believe the first option is easier and better, while the users may adjust the corpora used in training accordingly. What's your say? |
Beta Was this translation helpful? Give feedback.
-
@howl-anderson Thanks for your work on the Chinese model! We have a license to the OntoNotes 5 data, so I can use your scripts to convert the corpus and try to get Chinese added to our model training pipeline. In order to support a model officially, we need to have the model training with our scripts. Otherwise the binary would go out of date when we made changes to the library. But your scripts look like this should be fairly easy. One thing that would probably be good to try is using a different treebank instead of the UD_Chinese corpus. For instance, we should be able to run a dependency converter on the OntoNotes 5 Chinese parts. We also have a license to the Penn Chinese treebank. The reason to convert another corpus is, the UD_Chinese corpus is licensed CC-BY-NC, so we won't be able to distribute a commercially-friendly Chinese model if we use that data to train the tagger and parser. For OntoNotes 5 and Penn Chinese Treebank, we'd be able to release MIT-licensed models, like we do for English and German. So, we need to find a good dependency converter than works with Chinese, and run it over the treebank. I'm not sure whether the Stanford converter works with Chinese, if so it'd be a good choice. We could also check the ClearNLP converter, the MALT converter, and the conll-09 converter. |
Beta Was this translation helpful? Give feedback.
-
@ursachi Thanks for raising this. Orthographic and dialect variation is something we need to pay more attention to. Perhaps a good solution would be to provide a function that restored the diacritics if missing? I'm not sure how difficult this would be, but it might be easy for common words. If so, this would be a useful utility that people could use as a pre-process. For the stop list, I'd be happy to have duplicate versions in the stop words, with and without diacritics. I'd say the same for the tokenizer exceptions. The lemma lookup tables already get rather large, so it seems like we probably want to do this inside a lemmatizer function, instead of duplicating the data there. More generally, there's the question of how to handle this for statistical models. A model trained on a corpus with diacritics will not perform well on text without the diacritics, and vice versa. One solution is to apply a data augmentation process, so that the model sees both types of text. We would need to have a function that takes the training data, and returns two versions: one with diacritics, and one without, with the same gold-standard analyses in both cases. |
Beta Was this translation helpful? Give feedback.
-
Thank you @honnibal, I have a license for OntoNotes 5 data too. And I will try to get a license for Penn Chinese treebank. Meanwhile, I will check and update my scripts for more easily integrated with your scripts. Let's keep in touch. I will keep you informed. |
Beta Was this translation helpful? Give feedback.
-
@howl-anderson Could you have a look at constituency-to-dependency conversion scripts? If we can run the UD scripts on another corpus, that might be good. Alternatively, some other converter would be a good option. The corpora tend to be as constituency parses for Chinese, but spaCy needs to learn dependencies. |
Beta Was this translation helpful? Give feedback.
-
@honnibal Absolutely! I am very happy to participate in this project. When I get the results, I'll let you know. |
Beta Was this translation helpful? Give feedback.
-
I have an experimental release with a UD based Hungarian model, would this be interesting for the community? |
Beta Was this translation helpful? Give feedback.
-
@oroszgy This looks really good! I'll take a look at adding the data files for this to the model training pipeline. Currently I just need to update the machine image that has the corpora, to add new datasets. I need to make an update for Norwegian as well. In theory once the data files are added it should be pretty simple to be publishing the model. We need to have the pipeline training the model though, rather than just getting the artifact from you --- otherwise we can't retrain when we make code changes etc. |
Beta Was this translation helpful? Give feedback.
-
@honnibal let me know, if there is anything I can help with. |
Beta Was this translation helpful? Give feedback.
-
@honnibal Just for keep you informed, I found http://nlp.cs.lth.se/software/treebank_converter/ "The LTH Constituent-to-Dependency Conversion Tool for Penn-style Treebanks" is a promising converter tools for treebank to conll format. Since I still cannot get a Chinese treebank corpus, for now, I can not test it yet, I will continue to try get a licensed Chinese treebank. I will keep you informed. |
Beta Was this translation helpful? Give feedback.
-
Can we add a basic Marathi tokenizer as well? It's a language very close to hindi except for a few extra words and stem suffixes. The stemmer could be ported from here and here, the latter was adapted from the same paper you mentioned for the hindi support. The stem suffixes mentioned in the latter being, although not a complete list in tandem with the first this should cover a huge part :
A list of basic stop words is available here, while the numbers are available here. Should i put in a PR? |
Beta Was this translation helpful? Give feedback.
-
@Shashi456 Yes, that sounds good! 👍 |
Beta Was this translation helpful? Give feedback.
-
Hi, I'm trying to train NER for Lithuanian language. What I have already done:
I have a few questions:
Below is the training data, the trained NER should recognize these entities, right? So far when testing the trained model I get no entities. |
Beta Was this translation helpful? Give feedback.
-
The model should learn to generalise based on those examples and the training process expects to see lots of examples and generalise based on them – not see a handful of examples and memorise them. So ideally, you want a few thousand examples or more. For reference, the English model was trained on 2 million words. One strategy would be to take the Lithuanian UD corpus you trained on and label it for named entities. This is how the new Greek model for spaCy was trained btw. |
Beta Was this translation helpful? Give feedback.
-
Hello, |
Beta Was this translation helpful? Give feedback.
-
Hi! I am starting to study Spacy v3, in order to retrain our custom models with the latest version. The new pipeline templates look very powerful and friendly, thanks for the great work. |
Beta Was this translation helpful? Give feedback.
-
Hi, |
Beta Was this translation helpful? Give feedback.
-
We are preparing the pull requests for improving the tokenization and lemmatization of the catalan models, and we are almost there, but we are at a loss when preparing the attribule_ruler patterns to handle exceptions for token attributes needed for contractions. We copied the json format that we obtained from the existing patterns by doing "nlp.get_pipe("attribute_ruler").patterns", resulting in: We then tried to load them at initialization from our configuration file with: [initialize.components.attribute_ruler.tag_map] When we tried to train, we get: ✘ Error validating initialization settings in Maybe it is a very basic question, but we can't find in the documentation of how this file should be formatted, but when we load our patterns as in https://spacy.io/api/attributeruler#add_patterns, all's well. What are we doing wrong? Thanks for the help. Hope we make it before you release version 3.1.2 Carlos |
Beta Was this translation helpful? Give feedback.
-
Is there an example somewhere of a project that you used to build the core models? I didn't find it in spacy-projects, and I would need an example of how you train morphologizer+parser on one corpus and NER on another one and then merge them into one model inside a project. My main problem is how to get the tok2vec components right as well as when to add the attribute ruler and lemmatizer in the pipeline. |
Beta Was this translation helpful? Give feedback.
-
Hi, I have time on my hand and would like to add Latvian language support to spaCy. (I know, it's in xx already.) I note that the promising link to "adding languages" now leads to "language features" and that page doesn't seem to help me. The discussion in this thread seems to come from people who are already well past the first hurdles and somehow know what they are doing. I'd love to find some resource saying:
Has anybody written a howto on adding a new language that I can find somewhere on the web? Thanks for any pointers! |
Beta Was this translation helpful? Give feedback.
-
May i ask if the the process of adding a new language model mainly is finding/creating a good data set with commecial license and train those datasets with spacy? |
Beta Was this translation helpful? Give feedback.
-
Hi, I am in the process of deciding what should be included in the Swedish pipeline and I saw that the components in the core models vary a lot between different languages. |
Beta Was this translation helpful? Give feedback.
-
Draft language models for Croatian. Hi, I don't know the Croatian language, but I'm willing to extend to Croatian some text evaluation functions, concerning text complexity and readability, already being developed for English, Italian, Greek and Spanish on top of spaCy in an OS collaborative learning platform, inside an Erasmus+ project. First, I've generated a few components with spaCy 3.0.5, starting from the conllu corpora at https://github.com/UniversalDependencies/UD_Croatian-SET Then, I created a POS-based lemmatizer, by exploiting the Inflexional lexicon hrLex 1.3, at the Slovenian Clarin repository https://www.clarin.si/repository/xmlui/handle/11356/1232 . The final pipeline components were: ['lemmatizer', 'tok2vec', 'tagger', 'morphologizer', 'parser']. I hope someone will continue my work, refine it and possibly extend the model with a NER component: I'm available to transfer my work notes. |
Beta Was this translation helpful? Give feedback.
-
Hi. I just extended the README page in https://github.com/gtoffoli/commons-language/tree/master/nlp/spacy_custom/hr, starting to add a detaild trace of the actions performed. Also, I've read the documentation on Sourcing components from existing pipelines . But I didn't fully understand it. Till now I only followed instructions referring to CLI spaCy commands such as convert, fill-init, debug data and train. Should I open a Python shell (or write a Python script) and run the training through API functions? |
Beta Was this translation helpful? Give feedback.
-
Thank you Adriane and Paul. I was able to use With an updated version of my
Please, remember that hr500k 1.0 is the source repository from wich also the UD_Croatian-SET, listed in the UD site, was derived; only, the latter doesn't retain the NE annotation. On this occasion, I realised that Nikola Ljubešić, the maintainer of the hr500k 1.0, is also the curator of the hrLex Inflectional Lexicon, which I had previously used to generate lookup tables for a POS-based lemmatiser; more generally, he seems to be a very influential and helpful researcher. |
Beta Was this translation helpful? Give feedback.
-
Hi @adrianeboyd and @svlandeg 🙂, I've noticed that spaCy now has support for Setswana [lang/tn]. More specifically, Setswana shares a lot of similarities with Northern Sotho. Would it all be possible to take a transfer learning like approach to developing a fully-fledged model for both languages in parallel? And if so, how do I get started? I'd love to do the same for Afrikaans in future as well. |
Beta Was this translation helpful? Give feedback.
-
For anyone seeing this thread now - in the future, please open a new Discussion if you have a proposal for new language support. When this thread started, Discussions didn't exist on Github, and keeping this in one thread made it easier to manage. However with Discussions it's fine to open lots of threads, and that way we can make sure notifications just go to people working on any particular language. |
Beta Was this translation helpful? Give feedback.
-
EDIT 2022-07-11: For new language additions, please open a new thread in Discussions instead of commenting on this one.
This thread bundles discussion around adding pre-trained models for new languages (and improving the existing language data). A lot of information and discussion has been spread over various different issues (usually specific to the language), which made it more difficult to get an overview.
See here for the available pre-trained models, and this page for all languages currently available in spaCy. Languages marked as "alpha support" usually only include tokenization rules and various other rules and language data.
How to go from alpha support to a pre-trained model
The process requires the following steps and components:
NOUN
and optional morphological features.spacy convert
command that take.conllu
files and output spaCy's JSON format. See here for an example of a training pipeline with data conversion. Corpora can have very subtle formatting differences, so it's important to check that they can be converted correctly.spacy train
to train a new model.With our new internal model training infrastructure, it's now much easier for us to integrate new pipelines and train new models.
Ideas for how to get involved
Contributing to the models isn't always easy, because there are a lot of different things to consider, and a big part of it comes down to sourcing suitable data and running experiments. But here are a few ideas for things that can move us forward:
1️⃣ Difficulty: good for beginners
📖 Relevant documentation: Adding languages, Tokenization, Test suite Readme
2️⃣ Difficulty: advanced
token.tag_
, e.g."NNS"
), mapped to the coarse-grained tag (token.pos_
, e.g."NOUN"
) and other morphological features. The tags in the tag map should be the tags used by the treebank.spacy convert
and runspacy train
to train the model. See here for an example. (Note that most corpora don't come with NER annotations, so you'll usually only be able to train the tagger and parser). It might work out-of-the-box straight away – or it might require some more formatting and pre-processing. Finding this out will be very helpful. You can share your results and the reproducible commands to use in this thread.v2.1.0
– pre-training a language model similar to BERT/Elmo/ULMFiT etc. (see 💫 Add experimental ULMFit/BERT/Elmo-like pretraining #2931). We only need the cleaned, raw text – for example as a.txt
or.jsonl
file:When using other resources, make sure the data license is compatible with spaCy's MIT license and ideally allows commercial use (since many people use spaCy commercially). Examples of suitable licenses are CC, Apache, MIT. Examples of unsuitable licenses are CC BY-NC, CC BY-SA, (A)GPL.
📖 Relevant documentation: Adding languages, Training via the CLI
If you have questions, feel free to leave a comment here. We'll also be updating this post with more tasks and ideas as we go.
[EDIT, February 2021: since we have the discussions board on Github, there is a whole forum on language support where you can create a new thread to discuss language-specific collaborations, issues, progress, etc...]
Beta Was this translation helpful? Give feedback.
All reactions