Feedback on alpha Finnish, Korean and Swedish trained pipelines #10624
Replies: 4 comments 12 replies
-
The model keys (e.g. |
Beta Was this translation helpful? Give feedback.
-
Thanks for finally bringing Korean language model to spaCy 👏.
As a Korean speaker, I expect each separated word has different PoS. In this example, only 코스피 should be PROPN. I wonder if the current operation is as intended. doc = nlp_ko('코스피가 미국 연방공개시장위원회(FOMC) 공개를 앞두고 경계감이 강화되면서 하락')
for key in doc:
print(key.text, key.pos_, key.lemma_)
for ent in doc.ents:
print(ent.text, ent.label_)
Thank you for improving spaCy even better! |
Beta Was this translation helpful? Give feedback.
-
Hello @adrianeboyd and team, thanks for releasing a pretrained Korean model and for inviting feedback on it! We've recently put spaCy models into production for tokenization and lemmatization of Chinese, Japanese, and Korean. We've found in the process that lemmatization in Korean doesn't do quite what we expected, and would love any thoughts you might have on the matter. Our use case (tried to zoom in to the relevant details here)Our product allows users to set up "qualifier" words that serve as prerequisites for matching an inbound query to a particular chatbot flow. How we expect it to workWe use lemmatization to allow matching of queries where the form present in the query is different than the form provided in the qualifier. For agglutinative languages we expect lemmatization to strip off particles/suffixes and produce the root form of the word. Here's an example using the spaCy Finnish pipeline (
What we see in KoreanBecause the lemmas provided by
For the moment we're getting around this by manually breaking down the tokens further by checking their |
Beta Was this translation helpful? Give feedback.
-
I am also currently working on a Korean related project using Spacy. At first this "+" subsegmentation confused me as well, but when I got to understand how Korean works it makes much more sense. The concept of what is considered an entity you want to treat independently differs from our usual western understanding of languages. I came across a quite recent paper that might be very useful to understand the linguistic problems when dealing with Korean for analytical tasks and also provides a new approach including an open corpus. I think this could help to improve the Spacy model quite a bit. I for now will try to add a pipeline component to Spacy in our project that shapes the output into something as presented in "Figure 3" of said paper. Paper: https://raw.githubusercontent.com/openkorpos/openkorpos/main/openkorpos.pdf |
Beta Was this translation helpful? Give feedback.
-
We're adding trained pipelines for Finnish, Korean and Swedish for spaCy v3.3 and would welcome feedback on alpha versions before the final release!
These pipelines feature floret vectors, which use character n-grams internally so that there are no OOV tokens. We're particularly excited to use floret vectors to improve the performance for the
md
andlg
models for agglutinative languages like Finnish and Korean.The pipelines also include a new trainable lemmatizer, which we're planning to use in a number of v3.3 pipelines to replace lower-quality lookup lemmatizers.
Any and all feedback is welcome, also privately to [email protected].
Try out a pipeline
To try out a new pipeline, upgrade to spaCy v3.3.0.dev0:
(Actually this spaCy upgrade would happen automatically in the background with
pip install MODEL_URL
, but it's better not to be surprised!)And then install the pipeline you'd like to test with:
(Note that
spacy download
won't work for these models until the official v3.3.0 release.)Finnish
fi_core_news_sm
: https://github.com/explosion/spacy-models/releases/download/fi_core_news_sm-3.3.0a0/fi_core_news_sm-3.3.0a0-py3-none-any.whlfi_core_news_md
: https://github.com/explosion/spacy-models/releases/download/fi_core_news_md-3.3.0a0/fi_core_news_md-3.3.0a0-py3-none-any.whlfi_core_news_lg
: https://github.com/explosion/spacy-models/releases/download/fi_core_news_lg-3.3.0a0/fi_core_news_lg-3.3.0a0-py3-none-any.whlKorean
Note:
mecab-ko
is not required!ko_core_news_sm
: https://github.com/explosion/spacy-models/releases/download/ko_core_news_sm-3.3.0a0/ko_core_news_sm-3.3.0a0-py3-none-any.whlko_core_news_md
: https://github.com/explosion/spacy-models/releases/download/ko_core_news_md-3.3.0a0/ko_core_news_md-3.3.0a0-py3-none-any.whlko_core_news_lg
: https://github.com/explosion/spacy-models/releases/download/ko_core_news_lg-3.3.0a0/ko_core_news_lg-3.3.0a0-py3-none-any.whlSwedish
sv_core_news_sm
: https://github.com/explosion/spacy-models/releases/download/sv_core_news_sm-3.3.0a0/sv_core_news_sm-3.3.0a0-py3-none-any.whlsv_core_news_md
: https://github.com/explosion/spacy-models/releases/download/sv_core_news_md-3.3.0a0/sv_core_news_md-3.3.0a0-py3-none-any.whlsv_core_news_lg
: https://github.com/explosion/spacy-models/releases/download/sv_core_news_lg-3.3.0a0/sv_core_news_lg-3.3.0a0-py3-none-any.whlThe full descriptions and performance evaluations are on the linked release pages for each pipeline.
Differences with floret vectors
If you're running the trained pipeline on new texts and working with
Doc
objects, you shouldn't notice any difference with floret vectors vs. default vectors.If you use vectors for similarity comparisons, there are a few differences, mainly because a floret pipeline doesn't include any kind of frequency-based word list similar to the list of in-vocabulary vector keys with default vectors.
If your workflow iterates over the vector keys, you'd need to rely on an external word list instead:
Vectors.most_similar
is not supported because there's no fixed list of vectors to compare your vectors to.Korean details
We're particularly interested in feedback for Korean, where we have less internal expertise and the wide range of options for word segmentation makes it tricky to assemble a single trained pipeline.
For the new
ko_core_news_*
pipelines, the tokenization is based on the whitespace+punctuation segmentation from UD Korean Kaist. This means that the pipelines use the standard spaCy tokenizer and you don't need to havemecab-ko
ornatto-py
installed at all.The NER component is trained on the NER dataset from KLUE, which uses character segmentation. In order to use this annotation in the same pipeline as UD Korean Kaist, we snap the NER annotation to the nearest token boundaries.
As a result, the entity spans won't look identical to KLUE:
Original KLUE annotation:
Using
ko_core_news_md
:Beta Was this translation helpful? Give feedback.
All reactions