Update the logic of misspell identification

**Is your feature request related to a problem? Please describe.**
The current logic of misspelling identification relies on `vocab.txt` from the transformer model. BERT tokenisers break not such common words into subwords and subsequently store the sub-words in `vocab.txt`. Hence the original word might not be present in `vocab.txt` and be identified as misspelt.

**Describe the solution you'd like**
Still not clear, need to look into some papers on this. 

**Describe alternatives you've considered**
Alternate which I can think of right now will be 2 folds:
- ask user to provide list of such words and append in the vocab.txt from the transformers model
- if the proposed change is ##x then check the editdistance from detokenised form of that word + previous word

**Additional context**
#30  https://github.com/explosion/spaCy/issues/3994

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update the logic of misspell identification #44

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Update the logic of misspell identification #44

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions