Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Install fugashi, unidic, unidic-lite, and ipadic as dependencies to MLServer HuggingFace to support hosting Japanese language models #1506

Open
jbauer2718 opened this issue Dec 8, 2023 · 3 comments · May be fixed by #1511

Comments

@jbauer2718
Copy link

Because of the fact that Japanese mixes phonetic scripts and Chinese characters, special algorithms and dictionaries are needed to run tokenizers for these these models. A popular example of this is the BERT Japanese model:

https://huggingface.co/transformers/v4.11.3/_modules/transformers/models/bert_japanese/tokenization_bert_japanese.html

Without these dependencies, mlserver_huggingface/common.py errors when trying to load the tokenizer in the pipeline.

To reproduce, use any Japanese model. Here is an example.

@jbauer2718
Copy link
Author

If someone adds me as a contributor, I am happy to fix this issue and write a test for it.

@sakoush
Copy link
Member

sakoush commented Dec 11, 2023

@jbauer2718 many thanks for reporting this issue and offering to fix it. You can create a PR based on changes from your fork and we can look at it.

@jbauer2718 jbauer2718 linked a pull request Dec 12, 2023 that will close this issue
@jbauer2718
Copy link
Author

Hey @sakoush , just added the above-linked PR for the team's review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants