Training a Turkish model #2617

Nadedic · 2018-08-01T07:13:14Z

Nadedic
Aug 1, 2018

Basically what the titles suggests, in Issue #1490, there seems to have steps but no clue on having a constructed language model.

Could use some guidance if it has been used or trained..
Thanks

ines · 2018-08-03T11:38:56Z

ines
Aug 3, 2018
Maintainer

The Turkish language data has been making good progress, but we haven't tried training any models yet, and I haven't heard of any experiments from the community either.

For #1490, I specifically selected languages for which Universal Dependencies has published data with suitable licenses. See here for the Turkish treebank: https://github.com/UniversalDependencies/UD_Turkish-IMST

This section in the docs has an example of using spaCy's converter and spacy train to train a model from a UD dataset. Before you can start, you'll probably also need to add a tag map that maps the part-of-speech tags in the dataset to spaCy's POS symbols. If you (or anyone else) ends up experimenting with a Turkish model, definitely keep us updated on the results. If we have a training pipeline and confirmed results, it'll make it much easier for us to add it to our official spaCy models.

0 replies

selcukakbas · 2018-08-03T20:16:28Z

selcukakbas
Aug 3, 2018

Hi, I have been working with Turkish text classification. for a few years. All of it is in R language #rstats

Saw spaCy has great models for major languages, why not train one for Turkish.
But I don't know what we need to train a model as good as English one.

I can contribute,
collection of words with class (noun, verb, etc.)
I have a db of 3500 Turkish e-books. Using for NLP purposes.

@DuyguA @cbilgili

0 replies

honnibal · 2018-08-05T11:56:07Z

honnibal
Aug 5, 2018
Maintainer

@selcukakbas Training requires an annotated corpus -- we need examples of the words in context. Just the list of words and possible tags isn't enough.

We've got a fair few people working with Turkish to various degrees, so I expect support to steadily improve. Turkish is a relatively difficult language though, as the morphology is quite rich, which spaCy currently doesn't do a great job on.

0 replies

DuyguA · 2018-08-05T15:58:03Z

DuyguA
Aug 5, 2018

Hi all,
@selcukakbas in order to process Turkish statistically, one needs to process subwords; as subwords 'compose' the meanings of words. Turkish suffixes carry a huge amount of information, one needs to make use of while designing the statistical algorithm.

I have an oncoming conference paper on self-attentive subword based neural Turkish POS tagger. I'll make an individiual repo, code will be in PyTorch. Once it's ready I'll try to integrate to SpaCy.

I also write blog posts on Turkish from time to time:
https://medium.com/@duygu.altinok12/turkish-nlp-a-gentle-introduction-2b33e694dd78

For any questions please feel free to ping me!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training a Turkish model #2617

{{title}}

Replies: 4 comments

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Training a Turkish model #2617

Nadedic Aug 1, 2018

Replies: 4 comments

ines Aug 3, 2018 Maintainer

selcukakbas Aug 3, 2018

honnibal Aug 5, 2018 Maintainer

DuyguA Aug 5, 2018

Nadedic
Aug 1, 2018

ines
Aug 3, 2018
Maintainer

selcukakbas
Aug 3, 2018

honnibal
Aug 5, 2018
Maintainer

DuyguA
Aug 5, 2018