Open
Description
Is your feature request related to a problem? Please describe.
The current logic of misspelling identification relies on vocab.txt
from the transformer model. BERT tokenisers break not such common words into subwords and subsequently store the sub-words in vocab.txt
. Hence the original word might not be present in vocab.txt
and be identified as misspelt.
Describe the solution you'd like
Still not clear, need to look into some papers on this.
Describe alternatives you've considered
Alternate which I can think of right now will be 2 folds:
- ask user to provide list of such words and append in the vocab.txt from the transformers model
- if the proposed change is ##x then check the editdistance from detokenised form of that word + previous word
Additional context
#30 explosion/spaCy#3994