Skip to content

Update the logic of misspell identification #44

Open
@R1j1t

Description

@R1j1t

Is your feature request related to a problem? Please describe.
The current logic of misspelling identification relies on vocab.txt from the transformer model. BERT tokenisers break not such common words into subwords and subsequently store the sub-words in vocab.txt. Hence the original word might not be present in vocab.txt and be identified as misspelt.

Describe the solution you'd like
Still not clear, need to look into some papers on this.

Describe alternatives you've considered
Alternate which I can think of right now will be 2 folds:

  • ask user to provide list of such words and append in the vocab.txt from the transformers model
  • if the proposed change is ##x then check the editdistance from detokenised form of that word + previous word

Additional context
#30 explosion/spaCy#3994

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghelp wantedExtra attention is needed

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions