Adding models for new languages master thread #3056

ines · 2018-12-16T15:57:42Z

ines
Dec 16, 2018
Maintainer

EDIT 2022-07-11: For new language additions, please open a new thread in Discussions instead of commenting on this one.

This thread bundles discussion around adding pre-trained models for new languages (and improving the existing language data). A lot of information and discussion has been spread over various different issues (usually specific to the language), which made it more difficult to get an overview.

See here for the available pre-trained models, and this page for all languages currently available in spaCy. Languages marked as "alpha support" usually only include tokenization rules and various other rules and language data.

How to go from alpha support to a pre-trained model

The process requires the following steps and components:

Language data: shipped with spaCy, see here. The tokenization should be reliable, and there should be a tag map that maps the tags used in the training data to coarse-grained tags like NOUN and optional morphological features.
Training corpus: the model needs to be trained on a suitable corpus, e.g. an existing Universal Dependencies treebank. Commercial-friendly treebank licenses are always a plus. Data for tagging and parsing is usually easier to find than data for named entity recognition – in the long term, we want to do more data annotation ourselves using Prodigy, but that's obviously a much bigger project. In the meantime, we have to use other available resources (academic etc.).
Data conversion: spaCy comes with a range of built-in converters via the spacy convert command that take .conllu files and output spaCy's JSON format. See here for an example of a training pipeline with data conversion. Corpora can have very subtle formatting differences, so it's important to check that they can be converted correctly.
Training pipeline: if we have language data plus a suitable training corpus plus a conversion pipeline, we can run spacy train to train a new model.

With our new internal model training infrastructure, it's now much easier for us to integrate new pipelines and train new models.

⚠️ Important note: In order to train and distribute "official" spaCy models, we need to be able to integrate and reproduce the full training pipeline whenever we release a new version of spaCy that requires new models (so we can't just upload a model trained by someone else).

Ideas for how to get involved

Contributing to the models isn't always easy, because there are a lot of different things to consider, and a big part of it comes down to sourcing suitable data and running experiments. But here are a few ideas for things that can move us forward:

1️⃣ Difficulty: good for beginners

Proofread and correct the existing language data for a language of your choice. There can always be typos or mistakes ported over from a different resource.
Write tokenizer tests with expected input / output. It's always really helpful to have examples of how things should work, to ensure we don't accidentally introduce regressions. Tests should be "fair" and representative of what's common in general-purpose texts. While edge cases and "tricky" examples can be nice, they shouldn't be the focus of the tests. Otherwise, we won't actually get a realistic picture of what works and what doesn't. See the English tests for examples.

📖 Relevant documentation: Adding languages, Tokenization, Test suite Readme

2️⃣ Difficulty: advanced

Contribute a noun chunker for the language of your choice. This is a method that extracts base noun phrases from the parser - see the docs here.
Add a tag map for a language and its treebank (e.g. Universal Dependencies). The tag map is keyed by the fine-grained part-of-speech tag (token.tag_, e.g. "NNS"), mapped to the coarse-grained tag (token.pos_, e.g. "NOUN") and other morphological features. The tags in the tag map should be the tags used by the treebank.
Experiment with training a model. Convert the training and development data using spacy convert and run spacy train to train the model. See here for an example. (Note that most corpora don't come with NER annotations, so you'll usually only be able to train the tagger and parser). It might work out-of-the-box straight away – or it might require some more formatting and pre-processing. Finding this out will be very helpful. You can share your results and the reproducible commands to use in this thread.
Prepare a raw text corpus from the CommonCrawl or a similar resource for the language you want to work on. Raw unlabelled text can be used to train the word vectors, estimate the unigram probabilities and – coming in v2.1.0 – pre-training a language model similar to BERT/Elmo/ULMFiT etc. (see 💫 Add experimental ULMFit/BERT/Elmo-like pretraining #2931). We only need the cleaned, raw text – for example as a .txt or .jsonl file:

{"text": "This is a paragraph of raw text in some language"}

When using other resources, make sure the data license is compatible with spaCy's MIT license and ideally allows commercial use (since many people use spaCy commercially). Examples of suitable licenses are CC, Apache, MIT. Examples of unsuitable licenses are CC BY-NC, CC BY-SA, (A)GPL.

📖 Relevant documentation: Adding languages, Training via the CLI

If you have questions, feel free to leave a comment here. We'll also be updating this post with more tasks and ideas as we go.

[EDIT, February 2021: since we have the discussions board on Github, there is a whole forum on language support where you can create a new thread to discuss language-specific collaborations, issues, progress, etc...]

howl-anderson · 2018-12-16T16:25:36Z

howl-anderson
Dec 16, 2018

Currently, I am working on https://github.com/howl-anderson/Chinese_models_for_SpaCy for supporting Chinese language model for SpaCy, it's working pretty good to me, so how can my model into official repository? If suggestion?

0 replies

ursachi · 2018-12-17T08:54:37Z

ursachi
Dec 17, 2018

When it comes to the Romanian language, there is a particularity, which I am not sure if it has been discussed elsewhere. Due to the historical development, most keyboards in Romania do not have the Romanian language characters using diacritics (ă, â, î, ș, ț). Thus they have been substituted by the people with their roots (a, i, s, t) in most of the written language, often also in official documents, but I'd say in the majority of the online data (no citation available).

Looking at the RO language data, I can observe that most of the data is written using the diacritics but there are exceptions (e.g. "aceeasi" instead of "aceeași" in the list of the STOP_WORDS). This may be an issue for lemmatization and machine understanding.

Here's an example where Romanians usually understand well this adaptation, while I'm not sure what is the case for the spaCy context manager:
"Ana are doua fete" could be written as:

"Ana are două fete" meaning "Ana has two girls"
"Ana are două fețe" meaning "Ana has a duplicitar behaviour"

Thus the options:

should we rather correct wherever diacritics haven't been added, or
update all files and add also the words without the diacritics?

I believe the first option is easier and better, while the users may adjust the corpora used in training accordingly. What's your say?

0 replies

honnibal · 2018-12-17T15:22:16Z

honnibal
Dec 17, 2018
Maintainer

@howl-anderson Thanks for your work on the Chinese model! We have a license to the OntoNotes 5 data, so I can use your scripts to convert the corpus and try to get Chinese added to our model training pipeline. In order to support a model officially, we need to have the model training with our scripts. Otherwise the binary would go out of date when we made changes to the library. But your scripts look like this should be fairly easy.

One thing that would probably be good to try is using a different treebank instead of the UD_Chinese corpus. For instance, we should be able to run a dependency converter on the OntoNotes 5 Chinese parts. We also have a license to the Penn Chinese treebank.

The reason to convert another corpus is, the UD_Chinese corpus is licensed CC-BY-NC, so we won't be able to distribute a commercially-friendly Chinese model if we use that data to train the tagger and parser. For OntoNotes 5 and Penn Chinese Treebank, we'd be able to release MIT-licensed models, like we do for English and German. So, we need to find a good dependency converter than works with Chinese, and run it over the treebank. I'm not sure whether the Stanford converter works with Chinese, if so it'd be a good choice. We could also check the ClearNLP converter, the MALT converter, and the conll-09 converter.

0 replies

honnibal · 2018-12-17T15:29:08Z

honnibal
Dec 17, 2018
Maintainer

@ursachi Thanks for raising this. Orthographic and dialect variation is something we need to pay more attention to.

Perhaps a good solution would be to provide a function that restored the diacritics if missing? I'm not sure how difficult this would be, but it might be easy for common words. If so, this would be a useful utility that people could use as a pre-process.

For the stop list, I'd be happy to have duplicate versions in the stop words, with and without diacritics. I'd say the same for the tokenizer exceptions. The lemma lookup tables already get rather large, so it seems like we probably want to do this inside a lemmatizer function, instead of duplicating the data there.

More generally, there's the question of how to handle this for statistical models. A model trained on a corpus with diacritics will not perform well on text without the diacritics, and vice versa. One solution is to apply a data augmentation process, so that the model sees both types of text. We would need to have a function that takes the training data, and returns two versions: one with diacritics, and one without, with the same gold-standard analyses in both cases.

0 replies

howl-anderson · 2018-12-17T15:42:15Z

howl-anderson
Dec 17, 2018

Thank you @honnibal, I have a license for OntoNotes 5 data too. And I will try to get a license for Penn Chinese treebank. Meanwhile, I will check and update my scripts for more easily integrated with your scripts. Let's keep in touch. I will keep you informed.

0 replies

honnibal · 2018-12-17T16:01:05Z

honnibal
Dec 17, 2018
Maintainer

@howl-anderson Could you have a look at constituency-to-dependency conversion scripts? If we can run the UD scripts on another corpus, that might be good. Alternatively, some other converter would be a good option. The corpora tend to be as constituency parses for Chinese, but spaCy needs to learn dependencies.

0 replies

howl-anderson · 2018-12-17T16:29:10Z

howl-anderson
Dec 17, 2018

@honnibal Absolutely! I am very happy to participate in this project. When I get the results, I'll let you know.

0 replies

oroszgy · 2019-01-04T23:40:36Z

oroszgy
Jan 4, 2019

I have an experimental release with a UD based Hungarian model, would this be interesting for the community?

0 replies

honnibal · 2019-01-21T12:43:52Z

honnibal
Jan 21, 2019
Maintainer

@oroszgy This looks really good!

I'll take a look at adding the data files for this to the model training pipeline. Currently I just need to update the machine image that has the corpora, to add new datasets. I need to make an update for Norwegian as well.

In theory once the data files are added it should be pretty simple to be publishing the model. We need to have the pipeline training the model though, rather than just getting the artifact from you --- otherwise we can't retrain when we make code changes etc.

0 replies

oroszgy · 2019-01-21T13:33:21Z

oroszgy
Jan 21, 2019

@honnibal let me know, if there is anything I can help with.

0 replies

howl-anderson · 2019-01-28T04:42:48Z

howl-anderson
Jan 28, 2019

@honnibal Just for keep you informed, I found http://nlp.cs.lth.se/software/treebank_converter/ "The LTH Constituent-to-Dependency Conversion Tool for Penn-style Treebanks" is a promising converter tools for treebank to conll format. Since I still cannot get a Chinese treebank corpus, for now, I can not test it yet, I will continue to try get a licensed Chinese treebank. I will keep you informed.

0 replies

Shashi456 · 2019-01-30T05:45:18Z

Shashi456
Jan 30, 2019

Can we add a basic Marathi tokenizer as well? It's a language very close to hindi except for a few extra words and stem suffixes. The stemmer could be ported from here and here, the latter was adapted from the same paper you mentioned for the hindi support. The stem suffixes mentioned in the latter being, although not a complete list in tandem with the first this should cover a huge part :

suffixes = {
    1: ["े", "ू", "ु", "ी", "ि", "ा" , " ौ"  , " ै" ,  "स" , "ल" , "त" , "म" , "अ" ,  "त"],
    2: ["नो" , "तो" , "ने" , "नी" , "ही" , "ते" ,"या" , "ला" , "ना" , "ऊण" , "शे" , "शी" , "चा" , "ची" , "चे", "ढा" , "रु" , "डे" ,  "ती" , "ान" , " ीण" , "डा" , "डी" , "गा" , "ला" , "ळा" , "या" , "वा" , "ये" , "वे" , "ती" ],
    3: ["शया" , "हून"],
    4: [" ुरडा"],
}

A list of basic stop words is available here, while the numbers are available here.

Should i put in a PR?

0 replies

ines · 2019-01-30T13:17:38Z

ines
Jan 30, 2019
Maintainer Author

@Shashi456 Yes, that sounds good! 👍

0 replies

blamm0 · 2019-02-11T09:30:09Z

blamm0
Feb 11, 2019

Hi,

I'm trying to train NER for Lithuanian language.

What I have already done:

Created a new language using your template files.
Compiled spaCy from source with the support for the 'lt' language.
Trained a custom model using Universal Dependencies data - UD_Lithuanian-HSE.

I have a few questions:

How many sentences in Training data is required? What's the minimum?
I'm using a modified version if your train_ner.py script, the modifications are that it loads a custom model from disk.

Below is the training data, the trained NER should recognize these entities, right?
# training data TRAIN_DATA = [ ("Kas yra Valdas Adamkus?", {"entities": [(7, 22, "PERSON")]}), ("Man patinka Kaunas ir Vilnius.", {"entities": [(11, 18, "LOC"), (21, 29, "LOC")]}), ("Štai atėjo Petras.", {"entities": [(10, 17, "PERSON")]}), ]

So far when testing the trained model I get no entities.
Thanks in advance.

0 replies

ines · 2019-02-11T11:12:42Z

ines
Feb 11, 2019
Maintainer Author

How many sentences in Training data is required? What's the minimum?
Below is the training data, the trained NER should recognize these entities, right?

The model should learn to generalise based on those examples and the training process expects to see lots of examples and generalise based on them – not see a handful of examples and memorise them. So ideally, you want a few thousand examples or more. For reference, the English model was trained on 2 million words.

One strategy would be to take the Lithuanian UD corpus you trained on and label it for named entities. This is how the new Greek model for spaCy was trained btw.

0 replies

trungtv · 2021-04-18T08:06:32Z

trungtv
Apr 18, 2021

Hello,
I am working on a Spacy language for Vietnamese since Spacy 1.x. Today, I am release a pipeline for Spacy 3.x in here https://gitlab.com/trungtv/vi_spacy. Please check out if you can announce it to the community.

1 reply

adrianeboyd Apr 26, 2021

Hi, please feel welcome to submit this to spacy universe if you'd like! See the bottom of the page for details: https://spacy.io/universe

emiguevara · 2021-04-22T12:50:21Z

emiguevara
Apr 22, 2021

Hi! I am starting to study Spacy v3, in order to retrain our custom models with the latest version. The new pipeline templates look very powerful and friendly, thanks for the great work.
My question: are the templates for the official models available anywhere to see? That would be very useful.

10 replies

cayorodriguez Apr 30, 2021

Can you post an example from the CoNLL-U file and show what your tokenization settings produce for the # text string?
Thanks for your help, Adriane. Sure, here it is:
I did use the --merge-subtokens option to convert the corpus
To research the issue we retrained using the same coarse-grained XPOS in the TAG and POS columns, and again the same thing happened: POS and MORPHpredicted wrong, but TAG got it right, in cases where there was contraction:

Text in CONLLU (test split unseen by training) and without NER and dep columns, for clarity

# text = Aquests cursos des d'aquest proper curs també s'impartiran al Conservatori de Tortosa.
# orig_file_sentence 001#11
1 Aquests aquest DET DET Gender=Masc|Number=Plur|PronType=Dem 
2 cursos curs NOUN NOUN Gender=Masc|Number=Plur
3 des des ADP ADP AdpType=Prep
4 d' de ADP ADP AdpType=Prep 
5 aquest aquest DET DET Gender=Masc|Number=Sing|PronType=Dem
6 proper proper ADJ ADJ Gender=Masc|Number=Sing
7 curs curs NOUN NOUN Gender=Masc|Number=Sing
8 també també ADV ADV _
9 s' ell PRON PRON Case=Acc,Dat|Person=3|PrepCase=Npr|PronType=Prs|Reflex=Yes
10 impartiran impartir VERB VERB Mood=Ind|Number=Plur|Person=3|Tense=Fut|VerbForm=Fin
11 al al ADP ADP AdpType=Preppron
12 Conservatori conservatori NOUN NOUN Gender=Masc|Number=Sing
13 de de ADP ADP AdpType=Prep
14 Tortosa Tortosa PROPN PROPN _ 
15 . . PUNCT PUNCT PunctType=Peri

Processing by model (with accuracy metrics vs test split of POS 0.94, TAG 0.988 and morph morph 0.915):
I have marked errors with <--

form pos tag morph

Aquests NUM DET NumForm=Digit|NumType=Card
cursos NOUN NOUN Gender=Masc|Number=Plur
des ADP ADP AdpType=Prep
d' VERB ADP VerbForm=Inf  <--
aquest DET DET Gender=Masc|Number=Sing|PronType=Dem
proper NOUN ADJ Gender=Masc|Number=Sing
curs NOUN NOUN Gender=Masc|Number=Sing
també ADV ADV
s' NUM PRON NumForm=Digit|NumType=Card <--
impartiran VERB VERB Mood=Ind|Number=Plur|Person=3|Tense=Fut|VerbForm=Fin
al ADP ADP AdpType=Preppron
Conservatori PROPN PROPN
de ADP ADP AdpType=Prep
Tortosa PROPN PROPN
. PUNCT PROPN PunctType=Peri

Just in case it's needed, here's what my config.cfg says about the morphology component

[components.morphologizer]
factory = "morphologizer"

[components.morphologizer.model]
@architectures = "spacy.Tagger.v1"
nO = null

[components.morphologizer.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
upstream = "*"

adrianeboyd Apr 30, 2021

Hmm, I suppose it's possible there's not enough training data and that the combined pos+morph tags end up being too sparse, but that seems unusual for these kinds of errors. How much training data do you have? Can you paste your whole config?

The tagger and morphologizer are both using the exact same tagger component internally, and this is pretty weird if they're listening to the same tok2vec. The tagger labels are ADP, etc. and the morphologizer labels look like AdpType=Prep|POS=ADP underneath, which you can see with nlp.get_pipe("morphologizer").labels.

cayorodriguez Apr 30, 2021

OK, here's my whole configuration file. Now I notice that the tokenizer listener I am using for this component is spacy.Tok2VecListener.v1, while for the other components it seems to be base on the spacy-transformers.TransformerListener.v1 tok2vec architecture. Could this be it?

As for the size of the corpus, we have 1313 documents with 10 sentences each, for training, with 171 documents (10 sentences each) for dev. We have a further 181 documents for final test of the model. I could try using these last ones for trainig also.

[paths]
train = "corpus/ANCORA_ca/train.spacy"
dev = "corpus/ANCORA_ca/dev.spacy"
vectors = "corpus/roberta-ca-base-cased-test/"
init_tok2vec = null

[system]
gpu_allocator = "pytorch"
seed = 0

[nlp]
lang = "ca"
pipeline = ["transformer","tok2vec","tagger","morphologizer","parser","ner","lemmatizer"]
batch_size = 128
disabled = []
after_pipeline_creation = null
tokenizer = {"@Tokenizers":"spacy.Tokenizer.v1"}
before_creation = {"@callbacks":"noun_chunks"}
after_creation = null

[components]

[components.lemmatizer]
factory = "lemmatizer"
mode = "lookup"
model = null
overwrite = false

[components.morphologizer]
factory = "morphologizer"

[components.morphologizer.model]
@architectures = "spacy.Tagger.v1"
nO = null

[components.morphologizer.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
upstream = "*"

[components.ner]
factory = "ner"
moves = null
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = false
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
pooling = {"@layers":"reduce_mean.v1"}
upstream = "*"

[components.parser]
factory = "parser"
learn_tokens = false
min_action_freq = 30
moves = null
update_with_oracle_cut_size = 100

[components.parser.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "parser"
extra_state_tokens = false
hidden_width = 128
maxout_pieces = 3
use_upper = false
nO = null

[components.parser.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
pooling = {"@layers":"reduce_mean.v1"}
upstream = "*"

[components.tagger]
factory = "tagger"

[components.tagger.model]
@architectures = "spacy.Tagger.v1"
nO = null

[components.tagger.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
pooling = {"@layers":"reduce_mean.v1"}
upstream = "*"

[components.tok2vec]
factory = "tok2vec"

[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v1"

[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v1"
width = ${components.tok2vec.model.encode.width}
attrs = ["LOWER","PREFIX","SUFFIX","SHAPE"]
rows = [5000,2500,2500,2500]
include_static_vectors = false

[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v1"
width = 96
depth = 4
window_size = 1
maxout_pieces = 3

[components.transformer]
factory = "transformer"
max_batch_items = 4096
set_extra_annotations = {"@annotation_setters":"spacy-transformers.null_annotation_setter.v1"}

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v1"
name = "roberta-ca-base-cased-test/"

[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96

[components.transformer.model.tokenizer_config]
use_fast = true

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 2000
gold_preproc = false
limit = 0
augmenter = null

[training]
accumulate_gradient = 3
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
before_to_disk = null

[training.batcher]
@batchers = "spacy.batch_by_padded.v1"
discard_oversize = true
size = 2000
buffer = 128
get_length = null

[training.logger]
@Loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001

[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 20000
initial_rate = 0.00005

[training.score_weights]
morph_per_feat = null
dep_las_per_type = null
sents_p = null
sents_r = null
ents_per_type = null
tag_acc = 0.2
pos_acc = 0.1
morph_acc = 0.1
dep_uas = 0.1
dep_las = 0.1
sents_f = 0.0
ents_f = 0.2
ents_p = 0.0
ents_r = 0.0
lemma_acc = 0.2

[pretraining]

[initialize]
vectors = null
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]

adrianeboyd Apr 30, 2021

Yes, that's not very much training data and that would 100% explain the difference. Just switch the morphologizer to the transformer listener and remove the tok2vec component.

cayorodriguez Apr 30, 2021

Yes, that did it. My bad, thanks again, Adriane

cayorodriguez · 2021-05-21T07:11:34Z

cayorodriguez
May 21, 2021

Hi,
We have created very comprehensive models for the Catalan language, and have made an initial public release for testing them. You can see and test them here:
https://github.com/TeMU-BSC/spacy
They are for now independent of your official language/data and lookup tables, but we would like to provide you with the project code and datasets to generate the "official" release. Our datasets all have the proper commercial-friendly licenses and are aggregated so they can do multi-task learning under a transformer backdrop. What's the proper roadmap to do that?
I am just trying to have the assets functionality in the proyect.yml download our roberta-based transformer, but can't find a suitable example or documentation on how to do it directly from huggingface.
Thanks for this wonderful platform.

4 replies

adrianeboyd May 21, 2021

Thanks, I'll have time to take a closer look in a week or so!

If your model is available from huggingface, you should just need to specify the name in the config rather than providing it as an asset. If it's not, then you could specify it as an asset and use the path to the asset as the name in the config. Whatever it you specify, it should be something that you can load with AutoModel.from_pretrained from the working directory.

cayorodriguez May 21, 2021

Great! Thanks. We'll prepare the project. The syntax_iterator and lemmatizer, as well as a couiple of other tweaks, will need to migrate to lang/ca, and the lemma lookup tables to spacy-lookups. For now, they are working embedded in our models. But this is pretty straightforward with a pull to your repos

cayorodriguez May 28, 2021

Hi Adriane, the spacy project and the push modifications are ready and tested for implementing the Catalan model. Next week we can talk about this. If you want, you can contact directly al the Barcelona Supercomputing Center address: [email protected]
Have a great weekend

cayorodriguez Jun 2, 2021

Hi Adriane, Ines,
We have created a python env with a modified spacy and spacy-lookups-data where we have added files and made the changes to the init.py's for the lang/data and lookups that we think will enable you to train and distribute a Catalan model without our bespoke modifications to embed our syntax_iterator and lemmatization.
The model is training and most components seem to work, but the lemmatizer is not working, and it seems that in the registry somewhere it is not using the new one we adapted from the French one. We would need you guys to take a look and see where we are missing something. We have organized the materials in a git repository with directories with the modified or new files, both for the spacy/lang/ca and the spacy-lookups-data modules: https://github.com/TeMU-BSC/spacy4release
Please let us know where we go from here to add catalan to your official releases in the next version of spacy. And thanks again.

cayorodriguez · 2021-08-13T12:22:30Z

cayorodriguez
Aug 13, 2021

We are preparing the pull requests for improving the tokenization and lemmatization of the catalan models, and we are almost there, but we are at a loss when preparing the attribule_ruler patterns to handle exceptions for token attributes needed for contractions. We copied the json format that we obtained from the existing patterns by doing "nlp.get_pipe("attribute_ruler").patterns", resulting in:
"[{"patterns": [[{"POS": "ADJ"}]], "attrs": {"TAG": "ADJ"}, "index": 0},
{"patterns": [[{"POS": "ADP"}]], "attrs": {"TAG": "ADP"}, "index": 0},
{"patterns": [[{"POS": "ADV"}]], "attrs": {"TAG": "ADV"}, "index": 0}," ...

We then tried to load them at initialization from our configuration file with:

[initialize.components.attribute_ruler.tag_map]
@readers = "srsly.read_json.v1"
path = "./corpus/original_tag_map.json"

When we tried to train, we get:

✘ Error validating initialization settings in
[initialize.components]
attribute_ruler -> tag_map value is not a valid dict
We tried also with the srsly.read_jsonl.v1" reader passing a json pattern per line file

Maybe it is a very basic question, but we can't find in the documentation of how this file should be formatted, but when we load our patterns as in https://spacy.io/api/attributeruler#add_patterns, all's well. What are we doing wrong?

Thanks for the help. Hope we make it before you release version 3.1.2

Carlos

1 reply

adrianeboyd Aug 13, 2021

The patterns are initialized from patterns, not tag_map, which is for the v2-style TAG_MAP format. Whether you need the json or jsonl reader depends on the format you saved the patterns in. It looks like json is the right choice for the snippet above.

See: https://spacy.io/api/attributeruler#initialize

Nuccy90 · 2021-08-13T14:34:52Z

Nuccy90
Aug 13, 2021

Is there an example somewhere of a project that you used to build the core models? I didn't find it in spacy-projects, and I would need an example of how you train morphologizer+parser on one corpus and NER on another one and then merge them into one model inside a project. My main problem is how to get the tok2vec components right as well as when to add the attribute ruler and lemmatizer in the pipeline.

2 replies

yosiasz Aug 13, 2021

what language are you working on? I faced the same challenge for Amharic and Tigrinya

adrianeboyd Aug 16, 2021

You can use spacy assemble to combine components from multiple pipelines into a single final pipeline. With multiple tok2vec components you need to pay attention that tok2vec components run right before the components that listen to them, and that the upstream settings are correct. The pretrained pipelines only contain one tok2vec because the ner component has an internal tok2vec model instead of a separate one. In general, if only one component is listening to a particular tok2vec, it's easier to configure it with an internal tok2vec before training (or move it inside the component with replace_listeners after training) when assembling a larger pipeline.

We actually use a short collate script to assemble the pipelines. The main parts look like this:

    meta = srsly.read_json(meta_loc)
    paths = {
        "tok2vec": tok2vec,
        "transformer": transformer,
        "morphologizer": morphologizer,
        "tagger": tagger,
        "parser": parser,
        "senter": senter,
        "ner": ner,
    }
    nlp = None
    for name, path in paths.items():
        if path is None:
            continue
        elif nlp is None:
            nlp = spacy.load(models_dir / path / "model-best")
            nlp.meta.update(meta)
        else:
            if name not in nlp.pipe_names:
                nlp.add_pipe(name, source=spacy.load(models_dir / path / "model-best"))
    if attr_ruler_patterns is not None:
        if "ner" in nlp.pipe_names:
            attr_ruler = nlp.add_pipe("attribute_ruler", before="ner")
        else:
            attr_ruler = nlp.add_pipe("attribute_ruler")
        attr_ruler.add_patterns(srsly.read_json(attr_ruler_patterns))
    if lemmatizer:
        if "ner" in nlp.pipe_names:
            nlp.add_pipe("lemmatizer", before="ner").initialize()
        else:
            nlp.add_pipe("lemmatizer").initialize()
    if "parser" in nlp.pipe_names and "senter" in nlp.pipe_names:
        nlp.disable_pipe("senter")
    if nlp is not None:
        nlp.to_disk(output)

It's not particularly elegant, but it gets the job done. If your pipelines don't vary in terms of which components are present (tok2vec vs. transformer, core vs. dep, disabling the senter if the parser is present, etc.), then spacy assemble should work fine.

ceesroele · 2021-09-22T11:53:30Z

ceesroele
Sep 22, 2021

Hi,

I have time on my hand and would like to add Latvian language support to spaCy. (I know, it's in xx already.)

I note that the promising link to "adding languages" now leads to "language features" and that page doesn't seem to help me. The discussion in this thread seems to come from people who are already well past the first hurdles and somehow know what they are doing.

I'd love to find some resource saying:
"Adding a new language? Then do:

...
...
...
..."

Has anybody written a howto on adding a new language that I can find somewhere on the web?

Thanks for any pointers!

1 reply

polm Sep 22, 2021

We haven't made a sufficiently updated guide for this since v3 came out, but the short rough version is:

Implement a working tokenizer (not hard for English-like languages)
Identify training data we can use for POS, NER, parser components (should be UD compatible with good license)
Implement other language defaults if necessary, like lemmatization or something

Once you have made significant progress (say finishing step 2) you can open a new Discussion thread and we can help with specific problems. Basically, if in doubt, look at similar existing languages and copy them.

Avi197 · 2021-10-07T12:23:39Z

Avi197
Oct 7, 2021

May i ask if the the process of adding a new language model mainly is finding/creating a good data set with commecial license and train those datasets with spacy?
I want to contribute to Vietnamese language model but currently there is vi_spacy as dependency, but not in the spacy pipeline, as the dataset in vi_spacy, you need to sign an agreement to get access to it
Thanks in advance

2 replies

adrianeboyd Oct 7, 2021

Hi, we're always happy to add pretrained pipelines for new languages, and we can handle the training side of things once we have a good idea of which datasets to consider. For the pretrained pipelines, we want to include POS tags, dependency parses, and NER. Typically the hardest part is finding an appropriate NER corpus. You can get in touch with me at [email protected] to discuss the details if you'd like!

Avi197 Oct 7, 2021

Lovely, i would love to discuss in detail.
As about the dataset, i did think of an approach without using non-commercial lisence, but i'm not sure if it's suited for spacy
I did use a BERT model to train with a popular non commercial NER data set, and get a pretty high result on test, around 94.7%, pretty much SoTA. So if i use this trained model on a raw corpus, create an tagged NER data, and manually check. Can we use this data set for spacy?

Nuccy90 · 2022-01-18T09:23:19Z

Nuccy90
Jan 18, 2022

Hi, I am in the process of deciding what should be included in the Swedish pipeline and I saw that the components in the core models vary a lot between different languages.
In particular, I am not sure I understand why some languages have both tagger and morphologizer whereas others have only one of them.
In addition, I would like some clarification on the function of the attribute ruler that is included in most pipelines. I thought it was needed to map XPOS to UPOS but even languages with only the morphologizer (which already annotates token.pos) have an attribute ruler.

1 reply

adrianeboyd Jan 18, 2022

If the training data includes both fine-grained and UD POS tags, then the pipeline can have both a tagger and a morphologizer.

The attribute ruler is currently used to do some generic copying / cleanup even in cases with a morphologizer:

copy token.pos to token.tag if there's no tagger
normalize annotation for whitespace tokens

The attribute ruler is extremely generic underneath and not just for tag maps.

But we can manage this part of the configuration for the core models on our end. The main thing we would need to get started is training data with appropriate licenses. You can email me ([email protected]) with details about the training data possibilities and I can look into it.

gtoffoli · 2022-05-02T15:31:37Z

gtoffoli
May 2, 2022

Draft language models for Croatian.

Hi, I don't know the Croatian language, but I'm willing to extend to Croatian some text evaluation functions, concerning text complexity and readability, already being developed for English, Italian, Greek and Spanish on top of spaCy in an OS collaborative learning platform, inside an Erasmus+ project.
Since I didn't find news on spaCy language models for Croatian, I've tried to build them myself.

First, I've generated a few components with spaCy 3.0.5, starting from the conllu corpora at https://github.com/UniversalDependencies/UD_Croatian-SET
I have essentially used the basic configuration produced by the "quickstart widget" at https://spacy.io/usage/training, and met no problems.
The initial pipeline components were: ['tok2vec', 'tagger', 'morphologizer', 'parser']

Then, I created a POS-based lemmatizer, by exploiting the Inflexional lexicon hrLex 1.3, at the Slovenian Clarin repository https://www.clarin.si/repository/xmlui/handle/11356/1232 .
Unlike another lexicon that was mentioned in spaCy issue #4164 (2019), hrLex 1.3 seems to have a convenient licence: "Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0).
I've moved along the lines of the Polish lemmatizer, which I had already imitated for Italian on last year.
Since I don't know Croatian, I worked a little blind, without being able to apply any optimization.
Unfortunately, the lookup tables are quite large, possibly because Croatian is a highly inflectional language: the size of lookups.bin is 39874KB; I guess that a rule-based lemmatizer would be much more efficient.

The final pipeline components were: ['lemmatizer', 'tok2vec', 'tagger', 'morphologizer', 'parser'].
This is the trailing of the evaluation report:
TOK 99.97
TAG 80.08
POS 91.54
MORPH 80.68
LEMMA 89.53
UAS 75.99
LAS 68.79
SENT P 89.64
SENT R 89.88
SENT F 89.76
SPEED 5577

I hope someone will continue my work, refine it and possibly extend the model with a NER component: I'm available to transfer my work notes.
Best, Giovanni

7 replies

polm May 8, 2022

It's fine to train the NER model separately and merge it in to the main pipeline afterwards, we do that for many official models. The way to add a component (with its tok2vec) to another pipeline is called "sourcing components" and is covered here in the docs.

If you could share your config and the conversion scripts for your data sources (if any) in a repo, that would make it easier for us to make specific suggestions and adjustments as required for integration.

gtoffoli May 8, 2022

I've just created a README page in https://github.com/gtoffoli/commons-language/tree/master/nlp/spacy_custom/hr.

The folder includes the code of the builder of lemmatizer lookups and of the lemmatizer itself, plus some conversion scripts.

adrianeboyd May 12, 2022

We'd definitely be interested in adding trained pipelines for Croatian if there are appropriately licensed sources available. From the resources you've mentioned, it sounds like we could use UD Croatian SET for syntax (with the new trainable lemmatizer) and hr500k for NER. I'll have a look!

gtoffoli May 12, 2022

With the SETimes.HR+ corpus, which contains only sentences with named entities (NE); I reached a higher performance for NER (about 0.87). hr500k is the source from wich the UD Croatian SET was derived; it seems to me that IOB data inside it are poor and sparse.

gtoffoli May 12, 2022

I don't know how the trainable lemmatizer works. I developed a prototype of POS-aware lookup lemmatizer based on the good Inflectional Lexicon hrLex 1.3 at https://www.clarin.si/repository/xmlui/handle/11356/1232 , by Nikola Ljubešić, with which I just established a contact.

gtoffoli · 2022-05-10T10:35:54Z

gtoffoli
May 10, 2022

Hi. I just extended the README page in https://github.com/gtoffoli/commons-language/tree/master/nlp/spacy_custom/hr, starting to add a detaild trace of the actions performed.

Also, I've read the documentation on Sourcing components from existing pipelines . But I didn't fully understand it. Till now I only followed instructions referring to CLI spaCy commands such as convert, fill-init, debug data and train.

Should I open a Python shell (or write a Python script) and run the training through API functions?
Could you point to examples where an existent component, say a NER, is referenced through the source argument on nlp.add_pipe, in the context of a pipeline being the object of models training? Thanks.

2 replies

adrianeboyd May 12, 2022

Above is a description of how we do this for the trained pipelines: #3056 (reply in thread)

For most cases (if you don't have quite as many combinations of pipeline components / disabling), instead of a script you should be able to use spacy assemble with source instead of factory as described here: https://spacy.io/usage/processing-pipelines#sourced-components

gtoffoli May 12, 2022

Ok. I've spotted your reply relevant to the issue. I'll try to take it into account.

gtoffoli · 2022-05-16T09:26:15Z

gtoffoli
May 16, 2022

Thank you Adriane and Paul.

I was able to use spacy assemble to create a single package from a NER model trained with SETimes.HR and other models (including the trainable-lemmatizer) trained with hr500k 1.0. But at this point it wasn't clear to me how to evaluate the package as a whole. Thus, I convinced myself to pursue a more linear solution.

With an updated version of my normalise_split_hr500k script, I generated and used 3 files to train (train, eval) and evaluate (test) the entire package. I still have to update the README page.
Maybe the results are acceptable for a first version of the HR model, although not exciting.

================================== Results
TOK      100.00
TAG      77.79
POS      90.08
MORPH    78.36
LEMMA    83.82
UAS      73.35
LAS      65.39
NER P    67.98
NER R    67.08
NER F    67.53
SENT P   90.83
SENT R   91.70
SENT F   91.26
SPEED    7014
=============================== NER
            P       R       F
PER     69.28   68.73   69.00
ORG     68.73   62.75   65.61
MISC    31.35   30.69   31.02
LOC     75.70   80.98   78.26
DERIV   28.57   15.38   20.00

Please, remember that hr500k 1.0 is the source repository from wich also the UD_Croatian-SET, listed in the UD site, was derived; only, the latter doesn't retain the NE annotation.

On this occasion, I realised that Nikola Ljubešić, the maintainer of the hr500k 1.0, is also the curator of the hrLex Inflectional Lexicon, which I had previously used to generate lookup tables for a POS-based lemmatiser; more generally, he seems to be a very influential and helpful researcher.

6 replies

adrianeboyd Jun 23, 2022

Just using hr500k 1.0, here's the first draft of a v3.4.0 model to try out. Despite the version in the filename it should work fine with v3.3.x because there aren't any significant changes in the configs from v3.3 to v3.4. It has floret vectors and uses the same data split you proposed in your demo:

https://drive.google.com/file/d/10u41DNZGva0xsnO5R2IZfIdk7Wa38pzZ/view?usp=sharing

gtoffoli Jun 23, 2022

Thank you Adriane! I had no problem plugging the model in my text-analysis application, under development, which is integrated with a collaborative platform for educational contents management and collaborative learning.
Please find attached some screens. Text analysis dashboard.pdf
The short sample text is from https://ecl.hu/wp-content/uploads/HR_B1_Reading_Sample.pdf. Will ask for some qualitative evaluation from Croatian people.
The keyword-in-context algorithm is from tmtoolkit. The other stuff is pure spaCy with some original post-processing.
What I like about spaCy, is that it allows you to proceed in parallel on several languages, applying practically the same tools at both the core and application levels.

gtoffoli Jun 24, 2022

Now I realized that, compared to texts in other languages, say Engish, Spanish, Italian, the output for the Croatian sample text doesn't include noun-chunks. Aren't noun-chunks supported by hr_core_news_md-3.4.0a1? If they are, I will investigate the issue in my software.

polm Jun 24, 2022

Noun chunks are not statistical, they're generated with a special function, typically using dependency relations and POS tags. Take a look at the English implementation. They are not necessary for a model or initial release, though definitely nice to have.

polm Jun 24, 2022

Also, this discussion has already been going on here for a while, but since it's getting involved it might be a good idea to make a new thread to keep everything in one place, separate from the general language discussion. You can make a new Discussion at this link.

justoutofcuriosity · 2022-07-09T17:39:05Z

justoutofcuriosity
Jul 9, 2022

Hi @adrianeboyd and @svlandeg 🙂,

I've noticed that spaCy now has support for Setswana [lang/tn]. More specifically, Setswana shares a lot of similarities with Northern Sotho. Would it all be possible to take a transfer learning like approach to developing a fully-fledged model for both languages in parallel? And if so, how do I get started? I'd love to do the same for Afrikaans in future as well.

2 replies

polm Jul 11, 2022

For models so far we have always used data original to the language the models are intended for. It sounds like the languages in question here are mutually intelligible, but we'd have to be cautious about trying to map information between them.

Do you have links to resources you would like to use, and could you clarify what type of transfer learning you had in mind?

polm Jul 11, 2022

Hey, we just decided to change how we handle discussion of new langauge additions - apologies for the sudden change, but please create a new Discussion if you'd like to talk about this or work on it.

polm · 2022-07-11T06:06:18Z

polm
Jul 11, 2022

For anyone seeing this thread now - in the future, please open a new Discussion if you have a proposal for new language support.

When this thread started, Discussions didn't exist on Github, and keeping this in one thread made it easier to manage. However with Discussions it's fine to open lots of threads, and that way we can make sure notifications just go to people working on any particular language.

0 replies

Adding models for new languages master thread #3056

ines Dec 16, 2018 Maintainer

How to go from alpha support to a pre-trained model

Ideas for how to get involved

1️⃣ Difficulty: good for beginners

2️⃣ Difficulty: advanced

Replies: 100 comments · 41 replies

honnibal Dec 17, 2018 Maintainer

honnibal Dec 17, 2018 Maintainer

honnibal Dec 17, 2018 Maintainer

honnibal Jan 21, 2019 Maintainer

ines Jan 30, 2019 Maintainer Author

ines Feb 11, 2019 Maintainer Author

ines
Dec 16, 2018
Maintainer

Replies: 100 comments 41 replies

honnibal
Dec 17, 2018
Maintainer

honnibal
Dec 17, 2018
Maintainer

honnibal
Dec 17, 2018
Maintainer

honnibal
Jan 21, 2019
Maintainer

ines
Jan 30, 2019
Maintainer Author

ines
Feb 11, 2019
Maintainer Author