Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the logic of misspell identification #44

Open
R1j1t opened this issue Dec 21, 2020 · 10 comments
Open

Update the logic of misspell identification #44

R1j1t opened this issue Dec 21, 2020 · 10 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@R1j1t
Copy link
Owner

R1j1t commented Dec 21, 2020

Is your feature request related to a problem? Please describe.
The current logic of misspelling identification relies on vocab.txt from the transformer model. BERT tokenisers break not such common words into subwords and subsequently store the sub-words in vocab.txt. Hence the original word might not be present in vocab.txt and be identified as misspelt.

Describe the solution you'd like
Still not clear, need to look into some papers on this.

Describe alternatives you've considered
Alternate which I can think of right now will be 2 folds:

  • ask user to provide list of such words and append in the vocab.txt from the transformers model
  • if the proposed change is ##x then check the editdistance from detokenised form of that word + previous word

Additional context
#30 explosion/spaCy#3994

@R1j1t R1j1t added enhancement New feature or request help wanted Extra attention is needed labels Dec 21, 2020
@letconex
Copy link

letconex commented Dec 22, 2020

Concerning the logic: Is this a viable response?

>>> doc = nlp("This is a majour mistaken.")
>>> print(doc._.outcome_spellCheck)
This is a fact mistaken.
>>> doc = nlp("This is a majour mistake.")
>>> print(doc._.outcome_spellCheck)
This is a major mistake.
>>> doc = nlp("This is a majour mistakes.")
>>> print(doc._.outcome_spellCheck)
This is a for mistakes.
>>> doc = nlp("This is a majour misstake.")
>>> print(doc._.outcome_spellCheck)
This is a minor story.

@R1j1t
Copy link
Owner Author

R1j1t commented Dec 23, 2020

That is not the desired response. But it is based on the current logic. If you want to improve accuracy, please try pass the vocab file

vocab_path (str, optional): Vocabulary file path to be used by the
model . Defaults to "".

This will help model prevent False positives. Feel free to open a PR with a fix!

@kshitij12345
Copy link

One side-effect of using the current transformers tokenizer logic is that it would by default support multi-lingual models. Otherwise I am not sure but I think different languages might require different spell-checkers as per the language nuances.

@R1j1t
Copy link
Owner Author

R1j1t commented Jan 9, 2021

As mentioned in the README

This package currently focuses on Out of Vocabulary (OOV) word or non-word error (NWE) correction using BERT model.

So lets say you want to perform spell correction on Japanese sentence:

  1. provide Japanese spacy model: This will break the sentence into tokens. Now as this model is trained on Japanese language it knows the nuances (better than english model)
  2. Provide the Japanese bert model (from tokenizer models): Which will provide the candidate word for OOV word. Note that vocabulary here is considered of the transformer model and not the spaCy model

Below is some code contributed to the repo for Japanese language:

nlp = spacy.load("ja_core_news_sm")
checker = ContextualSpellCheck(
model_name="cl-tohoku/bert-base-japanese-whole-word-masking",
max_edit_dist=2,
)
nlp.add_pipe(checker)
doc = nlp("しかし大勢においては、ここような事故はウィキペディアの拡大には影響を及ぼしていない。")
print(doc._.performed_spellCheck)
print(doc._.outcome_spellCheck)

contextualSpellCheck examples folder

I hope it answers your question @kshitij12345. Please feel free to provide ideas or reference if you find something I might have missed something here!

@stale
Copy link

stale bot commented Feb 8, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added wontfix This will not be worked on and removed wontfix This will not be worked on labels Feb 8, 2021
@stale
Copy link

stale bot commented Mar 11, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the wontfix This will not be worked on label Mar 11, 2021
@stale stale bot closed this as completed Mar 18, 2021
@R1j1t R1j1t mentioned this issue Mar 31, 2021
@R1j1t R1j1t reopened this Aug 22, 2021
@stale stale bot removed the wontfix This will not be worked on label Aug 22, 2021
@stale
Copy link

stale bot commented Sep 21, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the wontfix This will not be worked on label Sep 21, 2021
@R1j1t R1j1t removed the wontfix This will not be worked on label Sep 24, 2021
@piegu
Copy link

piegu commented Oct 5, 2021

The current logic of misspell identification relies on vocab.txt from the transformer model. For not so common words tokenizers breaks them into subwords and hence the original entire word might be present as in in vocab.txt

HI @R1j1t,

First of all, congratulations for you Contextual Spell Checker (CSC) based on spaCy and BERT (transformer model).

As I'm searching for this kind of tool, I tested your CSC and I can give the following feedback:

  1. your CSC is an universal Spell Checker as it is possible to dowload the spaCy and BERT model of a language other than English. For example, this is my code for using your CSC in Portuguese in a Colab notebook:
# Installation
!pip install -U pip setuptools wheel
!pip install -U spacy
!pip install contextualSpellCheck

# spaCy model in Portuguese
spacy_model = "pt_core_news_md" # 48MB, or "pt_core_news_sm" (20MB), or "pt_core_news_lg"  (577MB)
!python -m spacy download {spacy_model} 

# BERT model in Portuguese
model_name = "neuralmind/bert-base-portuguese-cased" # or "neuralmind/bert-large-portuguese-cased"

# Importation and instantiation of the spaCy model
import spacy
import contextualSpellCheck
nlp = spacy.load(spacy_model)

# Download BERT model and add contextual spellchecker to the spaCy model
nlp.add_pipe(
    "contextual spellchecker",
    config={
        "model_name": model_name,
        "max_edit_dist": 2,
    },
);

# Sentence with errors ("milões" instead of "milhões")
sentence = "A receita foi de $ 9,4 milões em comparação com o ano anterior de $ 2,7 milões."

# Get sentence with corrections (if errors found by CSC)
doc = nlp(sentence)
print(f'({doc._.performed_spellCheck}) {doc._.outcome_spellCheck}')

# (True) A receita foi de $ 9,4 milhões em comparação com o ano anterior de $ 2,7 milhões.
  1. your CSC is an unigram Spell Checker as it uses the [MASK] token of a BERT model to replace a so-called mispelling word by a token from the BERT tokenizer vocab (see post). That means that your CSC can not correct a bigram error for example (see following example).
sentence = "a horta abdominal" # the correct sentence in Portuguese is "aorta abdominal"
doc = nlp(sentence)
print(f'({doc._.performed_spellCheck}) {doc._.outcome_spellCheck}')

# (False) 
# the CSC did not find corrected words with an edit distance < max_edit_dist
  1. your CSC is a word corrector by replacing non vocab words with tokens from the BERT tokenizer vocab (if the their edit distances are inferior to the max_edit_dist). That is the true issue I think (ie, using a BERT model). In fact, by using BERT models, I do not see how your CSC will be able to correct words instead of replacing them. It is true you can pass an infinite vocab file that will allow to detect most of mispelling words but as already said, your CSC will only be able to replace them by one token of the BERT tokenizer vocab (a token is not necessarily a word in the Wordpiece BERT tokenizer that uses subwords as tokens). This means that a "solution" would be to use finetuned BERT models with gigantic vocabulary (in order to have whole words instead of sub-words). Unfortunately, this kind of finetuning would require a huge corpus of texts. And even so, your CSC spell checker would remain a unigram one.

Could you consider exploring another type of transformer model like T5 (or ByT5) which has a seq2seq architecture (BERT as encoder mas GPT as decoder) allowing to have sentences of different sizes in input and output of the model?

@R1j1t
Copy link
Owner Author

R1j1t commented Oct 10, 2021

Hey @piegu, first of I want to thank you for your feedback. It feels terrific to have contributors, and even more so, who help in shaping the logic! When I started this project, I wanted the library to be generalized for multiple languages, hence spaCy and BERT's approach. I created tasks for me (#44, #40), and I would like to read more on these topics. But lately, I have been occupied with my day job and have limited my contributions to contextualSpellCheck.

Regarding your 2nd point, it is something I would agree I did not know. As pointed out in the comment by sgugger:

For this task, you need to either use a different model (coded yourself as it's not present in the library) or have your training set contain one [MASK] per token you want to mask. For instance if you want to mask all the tokens corresponding to one word (a technique called whole-word masking) what is typically done in training scripts is to replace all parts of one word by [MASK]. For pseudogener tokenized as pseudo, ##gene, that would mean having [MASK] [MASK].

I would still want to depend on transformer models, as it adds the functionality of multilingual support. I will try to experiment with your suggestions and try to think of a solution myself for the same.

Hope you like the project. Feel free to contribute!

@R1j1t R1j1t added bug Something isn't working and removed enhancement New feature or request labels Oct 10, 2021
@wanglc02
Copy link

wanglc02 commented Feb 6, 2024

I noticed that part of the logic of misspell_identify is:

        misspell = []
        for token in docCopy:
            if (
                (token.text.lower() not in self.vocab)

Will changing token.text.lower() into token._lemma.lower() improve accuracy? According to https://spacy.io/api/lemmatizer, "as of v3.0, the Lemmatizer is a standalone pipeline component that can be added to your pipeline, and not a hidden part of the vocab that runs behind the scenes. This makes it easier to customize how lemmas should be assigned in your pipeline." So, the __contains__ method of self.vocab will not convert a token to its base form. We have to get the base form by token._lemma.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

5 participants