Feedback on alpha Finnish, Korean and Swedish trained pipelines #10624

adrianeboyd · 2022-04-05T12:55:45Z

adrianeboyd
Apr 5, 2022

We're adding trained pipelines for Finnish, Korean and Swedish for spaCy v3.3 and would welcome feedback on alpha versions before the final release!

These pipelines feature floret vectors, which use character n-grams internally so that there are no OOV tokens. We're particularly excited to use floret vectors to improve the performance for the md and lg models for agglutinative languages like Finnish and Korean.

The pipelines also include a new trainable lemmatizer, which we're planning to use in a number of v3.3 pipelines to replace lower-quality lookup lemmatizers.

Any and all feedback is welcome, also privately to [email protected].

Try out a pipeline

To try out a new pipeline, upgrade to spaCy v3.3.0.dev0:

pip install spaCy==3.3.0.dev0

(Actually this spaCy upgrade would happen automatically in the background with pip install MODEL_URL, but it's better not to be surprised!)

And then install the pipeline you'd like to test with:

pip install MODEL_URL

(Note that spacy download won't work for these models until the official v3.3.0 release.)

Finnish

Korean

Note: mecab-ko is not required!

Swedish

The full descriptions and performance evaluations are on the linked release pages for each pipeline.

Differences with floret vectors

If you're running the trained pipeline on new texts and working with Doc objects, you shouldn't notice any difference with floret vectors vs. default vectors.

If you use vectors for similarity comparisons, there are a few differences, mainly because a floret pipeline doesn't include any kind of frequency-based word list similar to the list of in-vocabulary vector keys with default vectors.

If your workflow iterates over the vector keys, you'd need to rely on an external word list instead:

- lexemes = [nlp.vocab[orth] for orth in nlp.vocab.vectors]
+ lexemes = [nlp.vocab[word] for word in external_word_list]

Vectors.most_similar is not supported because there's no fixed list of vectors to compare your vectors to.

Korean details

We're particularly interested in feedback for Korean, where we have less internal expertise and the wide range of options for word segmentation makes it tricky to assemble a single trained pipeline.

For the new ko_core_news_* pipelines, the tokenization is based on the whitespace+punctuation segmentation from UD Korean Kaist. This means that the pipelines use the standard spaCy tokenizer and you don't need to have mecab-ko or natto-py installed at all.

The NER component is trained on the NER dataset from KLUE, which uses character segmentation. In order to use this annotation in the same pipeline as UD Korean Kaist, we snap the NER annotation to the nearest token boundaries.

As a result, the entity spans won't look identical to KLUE:

grammatical endings and other affixes may be included
a longer token with a different POS may be marked as an entity even though the entity is only a subpart of the token

Original KLUE annotation:

다만 너무 <김우빈:PS>의 분량만 많아서 ...ᄒ저는 <이현우:PS>팬이라 ᄒᄒᄒ그래도 재미있었어요

Using ko_core_news_md:

thiippal · 2022-04-05T17:11:06Z

thiippal
Apr 5, 2022

The model keys (e.g. fi_core_web_sm) listed in the post above do not work; the term "web" should be "news", e.g. fi_core_news_sm when importing the models!

1 reply

adrianeboyd Apr 5, 2022
Author

Thanks, fixed!

staedi · 2022-04-07T00:31:01Z

staedi
Apr 7, 2022

Thanks for finally bringing Korean language model to spaCy 👏.
After experimenting a toy sample, I have curiosity about the following.

PoS tagging
From below example, while a group of words is correctly divided into two words using .lemma_ [코스피+가], .pos_ and .text doesn't seem to work in that way.

As a Korean speaker, I expect each separated word has different PoS. In this example, only 코스피 should be PROPN. I wonder if the current operation is as intended.

doc = nlp_ko('코스피가 미국 연방공개시장위원회(FOMC) 공개를 앞두고 경계감이 강화되면서 하락')

for key in doc:
    print(key.text, key.pos_, key.lemma_)

코스피가 PROPN 코스피+가
미국 PROPN 미국
연방공개시장위원회 NOUN 연방+공개+시장위원회
( PUNCT (
FOMC X FOMC
) PUNCT )
공개를 NOUN 공개+를
앞두고 CCONJ 앞두+고
경계감이 NOUN 경계감+이
강화되면서 CCONJ 강화+되+면서
하락 NOUN 하락

NER labels
It seems that the labels of NER are different from those in the English language model.
From the below example, I suppose that OG means ORG in the English model. So, I wonder if this different naming is intended.

for ent in doc.ents:
  print(ent.text, ent.label_)

코스피가 OG
미국 연방공개시장위원회 OG
FOMC OG

Thank you for improving spaCy even better!

9 replies

adrianeboyd Dec 19, 2023
Author

The lemmatizer isn't really good enough at the segmentation, and it can't segment texts without spaces like a tool like mecab-ko can.

spacy.blank("ko") using mecab-ko would produce 집안+입니다 for this example, and it only produces segmentation on character boundaries, at least with the default settings/dictionaries that we're using. The mecab-ko segmentation is often quite different from the UD Korean Kaist segmentation (from the KOMA tagger according the docs, but it's not a tool I've used myself).

As I mentioned above:

[I]t might make sense to provide a custom component that you could add to the pipeline that would add the mecab segmentation and tags as custom token extensions, since a lot of users are used to working with mecab. There might be some cases where it's difficult to align the tokenization, but I think most cases would be pretty straightforward.

ryanheise Dec 19, 2023

The lemmatizer isn't really good enough at the segmentation, and it can't segment texts without spaces like a tool like mecab-ko can.

Do you have a counter example where it is not good enough?

The current lemmatizer's (reportedly unexpected) behaviour is precisely that it is turning a string such as 전통적으로 with no spaces into 전통+적+으로, effectively doing morphological analysis. In cases such as 집안+이+ㅂ니다, there are deterministic procedures to re-compose the decomposed character and get a satisfactory segmentation. To me, that seems good enough in the examples I have seen, so in theory it seems it would be possible to train this behaviour without another dependency such as mecab-ko (and would sort of guarantee that you would get segmentation that aligns with the data set you are using for training).

a corpus with modified UPOS tags and DEP+HEAD annotation to match this segmentation (which could look more like UD Japanese GSD than UD Korean Kaist)

I think another way to address this is for contributors to collaboratively create a dictionary from UD Korean Kaist tags to UPOS tags (if one doesn't already exist), and then just continue using the same corpus.

Or another way is to just use an identity mapping. It's not ideal, but from memory, the German model doesn't actually use the same universal dependency tags as is used for the other models, so there might be precedent.

adrianeboyd Dec 19, 2023
Author

The lemmatizer accuracy (which is only segmentation) is ~90% for ko_core_news_lg.

ryanheise Dec 19, 2023

I'm not sure if the 90% figure for lemmatization is directly comparable. Let's say that we are evaluating whether this morphological analysis behaviour of the Korean lemmatizer is any good for tokenization. And we see that it has 90% accuracy, while the Japanese tokenizer has 99% accuracy. Two concerns with the comparison:

First, I assume you would only measure a Korean lemmatization such as 전통+적+으로 "correct" if all of the segments within the lemma match completely because all of that is currently treated as a single token, and I'm guessing that is the smallest unit of measure in the accuracy score, whereas a comparable phrase in Japanese would be counted as separate tokens, and you would still count a tokenization as partially correct if it segmented some but not all of the tokens correctly. With the Korean lemmatizer, I am guessing it is giving you an all or nothing accuracy measure.

Second, this 90% figure is only measuring the performance of the harder subset of Korean segmentation. The easier subset is the segments that are delimited by spaces, which is a lot of them, and the accuracy for this part would surely be 100%. That alone would suggest true accuracy higher than 90%.

adrianeboyd Dec 20, 2023
Author

If you do try it out, we'd be interested to hear how it goes!

madrev · 2022-08-10T11:51:20Z

madrev
Aug 10, 2022

Hello @adrianeboyd and team, thanks for releasing a pretrained Korean model and for inviting feedback on it! We've recently put spaCy models into production for tokenization and lemmatization of Chinese, Japanese, and Korean. We've found in the process that lemmatization in Korean doesn't do quite what we expected, and would love any thoughts you might have on the matter.

Our use case (tried to zoom in to the relevant details here)

Our product allows users to set up "qualifier" words that serve as prerequisites for matching an inbound query to a particular chatbot flow.
For example, a user may set up a flow that is intended to help a customer report a lost parcel. If they configure the qualifier word "parcel", the flow will not trigger unless "parcel" is present in the inbound query, e.g.:
I did not receive my parcel may trigger the flow (pending other conditions)
I did not receive my bill will not trigger the flow.

How we expect it to work

We use lemmatization to allow matching of queries where the form present in the query is different than the form provided in the qualifier.

For agglutinative languages we expect lemmatization to strip off particles/suffixes and produce the root form of the word. Here's an example using the spaCy Finnish pipeline (fi_core_news_sm):
User provides the qualifier paketti (parcel), and the query Pakettia ei toimitettu (the parcel was not delivered) matches because the lemmatized qualifier is present in the lemmatized query.

In [10]: [x.lemma_ for x in fnlp('paketti')] # parcel
Out[10]: ['paketti']
In [11]: [x.lemma_ for x in fnlp('Pakettia ei toimitettu')] # the parcel was not delivered
Out[11]: ['paketti', 'ei', 'toimittaa']

What we see in Korean

Because the lemmas provided by ko_core_news_sm are separated into subtokens but particles/suffixes are not removed, we cannot easily match the provided qualifier to a form with particles included:

In [13]: [x.lemma_ for x in knlp('소포')] # parcel
Out[13]: ['소포'] 
In [15]: [x.lemma_ for x in knlp('소포가 배달되지 않았습니다')] # the parcel was not delivered
Out[15]: ['소포+가', '배달+되+지', '않+았+습니다']

For the moment we're getting around this by manually breaking down the tokens further by checking their tag_ and splitting the tokens on + where we think it is appropriate, but we'd love to handle this in a more robust way. This is our first time using spaCy in production and we don't have much experience creating/tweaking pipelines ourselves, so happy for any pointers on how to use spaCy better for this use case, or any corrections on our assumptions. Alternatively, if you expect this might change at all in future versions of the pipeline, we'd love to know about that too. Cheers!

1 reply

adrianeboyd Aug 11, 2022
Author

Thanks for the feedback!

The lemmas look like this because this is what is in the training corpus, UD Korean Kaist. I agree that it is a little unusual that the lemmas are just segmentation rather than lemmas, but this is the data that we have available to train the lemmatizer on.

One alternative could be to write a custom lemmatizer that uses mecab-ko to lemmatize. It looks like the current spacy.KoreanTokenizer.v1 tokenizer includes code to set lemmas, but at least with my local mecab-ko installation and tests with spacy.blank("ko"), the lemma annotation only appears to work in some cases, and not at all for the example above.

I haven't tested it a lot, but I came across a newer mecab-ko wrapper that might be easier to use than natto-py + system mecab-ko: https://pypi.org/project/mecab-ko/

In general (with either wrapper) I'm not sure how well mecab-ko will work if you lemmatize individual tokens vs. processing whole texts and given the segmentation differences between UD Korean Kaist and mecab-ko, where you'll have multiple mecab-ko tokens per Kaist token and you'd have to figure out how to return a lemma from multiple tokens. At least the fine-grained tags there are useful for identifying different types of affixes.

If you do want to write a custom lemmatizer that uses a third-party library, the basics could look like the Russian lemmatizer:

spaCy/spacy/lang/ru/__init__.py

Lines 26 to 47 in ed4ad30

    
           @Russian.factory( 
        
               "lemmatizer", 
        
               assigns=["token.lemma"], 
        
               default_config={ 
        
                   "model": None, 
        
                   "mode": "pymorphy2", 
        
                   "overwrite": False, 
        
                   "scorer": {"@scorers": "spacy.lemmatizer_scorer.v1"}, 
        
               }, 
        
               default_score_weights={"lemma_acc": 1.0}, 
        
           ) 
        
           def make_lemmatizer( 
        
               nlp: Language, 
        
               model: Optional[Model], 
        
               name: str, 
        
               mode: str, 
        
               overwrite: bool, 
        
               scorer: Optional[Callable], 
        
           ): 
        
               return RussianLemmatizer( 
        
                   nlp.vocab, model, name, mode=mode, overwrite=overwrite, scorer=scorer 
        
               )

and

https://github.com/explosion/spaCy/blob/ed4ad309e6dd6fb420cbf18e4fd5e8de3291eeba/spacy/lang/ru/lemmatizer.py

justus-saul · 2023-04-04T12:20:58Z

justus-saul
Apr 4, 2023

I am also currently working on a Korean related project using Spacy. At first this "+" subsegmentation confused me as well, but when I got to understand how Korean works it makes much more sense. The concept of what is considered an entity you want to treat independently differs from our usual western understanding of languages.

I came across a quite recent paper that might be very useful to understand the linguistic problems when dealing with Korean for analytical tasks and also provides a new approach including an open corpus.

I think this could help to improve the Spacy model quite a bit. I for now will try to add a pipeline component to Spacy in our project that shapes the output into something as presented in "Figure 3" of said paper.

Paper: https://raw.githubusercontent.com/openkorpos/openkorpos/main/openkorpos.pdf
Corpus: https://github.com/openkorpos/openkorpos

1 reply

adrianeboyd Apr 17, 2023
Author

Interesting paper and dataset! It's difficult to incorporate this segmentation into the spacy trained pipelines because of the tokenization used for the dependency parses, but if you're only doing tagging and/or NER this looks promising.

It would be interesting to compare a new pipeline that uses a trainable tokenizer and a tagger trained on this dataset with the merged tags from UD Korean KAIST in the ko_core_* taggers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feedback on alpha Finnish, Korean and Swedish trained pipelines #10624

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 12 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Feedback on alpha Finnish, Korean and Swedish trained pipelines #10624

adrianeboyd Apr 5, 2022

Try out a pipeline

Finnish

Korean

Swedish

Differences with floret vectors

Korean details

Replies: 4 comments · 12 replies

thiippal Apr 5, 2022

adrianeboyd Apr 5, 2022 Author

staedi Apr 7, 2022

adrianeboyd Dec 19, 2023 Author

ryanheise Dec 19, 2023

adrianeboyd Dec 19, 2023 Author

ryanheise Dec 19, 2023

adrianeboyd Dec 20, 2023 Author

madrev Aug 10, 2022

Our use case (tried to zoom in to the relevant details here)

How we expect it to work

What we see in Korean

adrianeboyd Aug 11, 2022 Author

justus-saul Apr 4, 2023

adrianeboyd Apr 17, 2023 Author

adrianeboyd
Apr 5, 2022

Replies: 4 comments 12 replies

thiippal
Apr 5, 2022

adrianeboyd Apr 5, 2022
Author

staedi
Apr 7, 2022

adrianeboyd Dec 19, 2023
Author

adrianeboyd Dec 19, 2023
Author

adrianeboyd Dec 20, 2023
Author

madrev
Aug 10, 2022

adrianeboyd Aug 11, 2022
Author

justus-saul
Apr 4, 2023

adrianeboyd Apr 17, 2023
Author