-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update the logic of misspell identification #44
Comments
Concerning the logic: Is this a viable response? >>> doc = nlp("This is a majour mistaken.")
>>> print(doc._.outcome_spellCheck)
This is a fact mistaken.
>>> doc = nlp("This is a majour mistake.")
>>> print(doc._.outcome_spellCheck)
This is a major mistake.
>>> doc = nlp("This is a majour mistakes.")
>>> print(doc._.outcome_spellCheck)
This is a for mistakes.
>>> doc = nlp("This is a majour misstake.")
>>> print(doc._.outcome_spellCheck)
This is a minor story. |
That is not the desired response. But it is based on the current logic. If you want to improve accuracy, please try pass the vocab file contextualSpellCheck/contextualSpellCheck/contextualSpellCheck.py Lines 34 to 35 in 15b30eb
This will help model prevent False positives. Feel free to open a PR with a fix! |
One side-effect of using the current transformers tokenizer logic is that it would by default support multi-lingual models. Otherwise I am not sure but I think different languages might require different spell-checkers as per the language nuances. |
As mentioned in the README
So lets say you want to perform spell correction on Japanese sentence:
Below is some code contributed to the repo for Japanese language: contextualSpellCheck/examples/ja_example.py Lines 4 to 13 in f8cbeb8
contextualSpellCheck examples folder I hope it answers your question @kshitij12345. Please feel free to provide ideas or reference if you find something I might have missed something here! |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
HI @R1j1t, First of all, congratulations for you Contextual Spell Checker (CSC) based on spaCy and BERT (transformer model). As I'm searching for this kind of tool, I tested your CSC and I can give the following feedback:
Could you consider exploring another type of transformer model like T5 (or ByT5) which has a seq2seq architecture (BERT as encoder mas GPT as decoder) allowing to have sentences of different sizes in input and output of the model? |
Hey @piegu, first of I want to thank you for your feedback. It feels terrific to have contributors, and even more so, who help in shaping the logic! When I started this project, I wanted the library to be generalized for multiple languages, hence spaCy and BERT's approach. I created tasks for me (#44, #40), and I would like to read more on these topics. But lately, I have been occupied with my day job and have limited my contributions to Regarding your 2nd point, it is something I would agree I did not know. As pointed out in the comment by sgugger:
I would still want to depend on transformer models, as it adds the functionality of multilingual support. I will try to experiment with your suggestions and try to think of a solution myself for the same. Hope you like the project. Feel free to contribute! |
I noticed that part of the logic of
Will changing |
Is your feature request related to a problem? Please describe.
The current logic of misspelling identification relies on
vocab.txt
from the transformer model. BERT tokenisers break not such common words into subwords and subsequently store the sub-words invocab.txt
. Hence the original word might not be present invocab.txt
and be identified as misspelt.Describe the solution you'd like
Still not clear, need to look into some papers on this.
Describe alternatives you've considered
Alternate which I can think of right now will be 2 folds:
Additional context
#30 explosion/spaCy#3994
The text was updated successfully, but these errors were encountered: