Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

additional language dependencies #33

Open
balmas opened this issue Jan 29, 2021 · 3 comments
Open

additional language dependencies #33

balmas opened this issue Jan 29, 2021 · 3 comments
Assignees

Comments

@balmas
Copy link
Member

balmas commented Jan 29, 2021

#31 identified a tokenizer error with Chinese due to a missing dependency.

Spacy documentation lists additional dependencies for a number of languages at https://spacy.io/usage/models#languages:

Japanese: Unidic, Mecab, SudachiPy
Russian: pymorphy2
Ukrainian: pymorphy2
Thai: pythainlp
Korean: mecab-ko, mecab-ko-dic, natto-py
Vietnamese: Pyvi

@irina060981 if you can confirm the chinese fix works (and the Dockerfile fix too) maybe you can add these dependencies too?

@irina060981
Copy link
Member

@balmas - I spend all day to add these libraries - and here it is my results:
I was able to add

  • russian support
  • ukrainain support (besides libraries it was needed to fix misspelling in the code)
  • thai support
  • vietnamese support

And faced with problems for japanese and korean

Japanese needs
Unidic, Mecab, SudachiPy

I was able to find versions for our environment - Unidic, Mecab

But I didn't find a working version for SudachiPy to work with Cython
And was not able to install all the requirements for - flake8 flake8-import-order flake8-bulitins

There is a compiled library with SudachiPy and Cython - https://github.com/polm/fugashi
But spacy requires sudacypy module (from the error)

Korean needs
mecab-ko, mecab-ko-dic, natto-py

I was able to install natto-py
but failed with - mecab-ko, mecab-ko-dic
They failed with specific errors

I could continue with it tomorrow - it is really difficult to build the container on my evenning/night - it needs much more time.
I hope the traffic of docker resources will reduce on my morning

@balmas , how do you think how much time it is worth to spend for Koreen and Japaneese support?

@balmas
Copy link
Member Author

balmas commented Feb 1, 2021

@balmas , how do you think how much time it is worth to spend for Koreen and Japaneese support?

@irina060981 let's not worry about those for the moment. Thanks.

@monzug
Copy link

monzug commented Mar 15, 2021

Also, Telugu and Sanskrit also give a 500 error. see attachment

Screen Shot 2021-03-15 at 2 43 03 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants