-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Not detecting word correctly? #126
Comments
The pyenchant tokenizer does not consider digits as being part of the character set making up valid words, so they are removed from the tokens it returns. There is an option to pass additional valid characters to the tokenizer, but that is not exposed by the sphinx integration layer. I don't think it would help in your case, because the word cannot start or end with those extra characters (the feature is meant to handle situations like apostrophes). You could write a new tokenizer that did include digits in the set of characters making up valid words, but that might be extreme for your case. I think for that to work properly you would need to have the tokenizer accepted in pyenchant so that its name can be used with get_tokenizer(). It looks like pybind11 is the name of a package. The PyPI package name filter won't work because the trailing 11 are removed so the package name won't match the token. What I usually do in cases like that is either add the base package name to the spelling list or add literal markup around the name in the text to make the spell checker ignore it. |
Hi, PyEnchant maintainer here.
I concur. Also, writing a tokenizer is rarely fun.
I think this is the way to go :) |
Thanks for the advice, @dmerejkowsky, and for maintaining PyEnchant! |
I just want to add that when it sees words like P.S. Here's MyST Markdown code snippet that I have: ## [{sub-ref}`release`] - as of {sub-ref}`today` _{subscript}`/UNRELEASED DRAFT/`_ |
Just realized that the misspelled sequences may also be in the middle of the commit hash: |
The pyenchant tokenizer is basic in what it's capable of, but involves a lot of scaffolding code. Trying to further leverage that code is not a good way forward. A proper tokenizer solution would take advantage of user dictionaries to extend tokenization. Something like Aho-Corasick, and if it fails, backtracking and falling back on the current solution of language-specific lexeme matching and passing to enchant. EDIT: Depending on the performance, |
What would need to happen and (and in which repo) to get this to work? I'll happily code something if it means that "doesn't" and "ad-hoc" officially become words again |
Looks like maybe the blocker is here pyenchant/pyenchant#286 @dmerejkowsky any chance you'd be able to take a look at that issue? |
See #147 also, but yes, we need a supportable API to set the word separators. |
Thanks for being so responsive. Too bad that it seems like pyenchant is kind of in coast mode now |
I think the main issue is that none of the maintainers of either project have a real need for the feature, so we're prioritizing other things. There's some advice in this ticket and the PR I linked that could lead someone to implement something and propose it to PyEnchant. It would also be possible to wrap the existing tokenizer and recombine split contractions (for example, recognize "don" and "t" as consecutive tokens and emit "don't" as one token). It's a bit hacky, but ought to work, and that sort of implementation could live independently of either project, and we could provide an interface for configuring it through Sphinx. |
Interesting ideas, thank you for the insight. A PR against pyenchant would definitely be best, seems like maybe they need some maintenance help getting through issues/PRs though (any chance you'd be interested, if Dimitri offered? You're active and sure know your tokenization stuff:) ). I could volunteer to code something (seems like stevepiercy might have offered as well) but just want to make sure not to duplicate efforts between a wrapper and a PR without getting some input from pyenchant first |
I'm not taking on new projects right now, so you won't be duplicating anything I'm doing. If you're interested, coordinating with the other folks who have expressed interest is a good idea. |
Actually, I'm the sole maintainer of PyEnchant right now. This was not so bad for a long time because I was using PyEnchant very regularly - it's how I implemented a better spell checker for Kakoune. But nowadays I use yet another spellchecker for Kakoune called skyspell, which means I no longer use PyEnchant ... So yeah, it would be best for PyEnchant (and its downstream projects like this one) if someone would volunteer to help maintain PyEnchant. Feel free to contact me for the details. |
Thanks, @dmerejkowsky |
I have the word
pybind11
in my text and have addedpybind11
to my spelling list. However, it seems to ignore the trailing numbers and think the word ispybind
only, creating a spelling error. I could addpybind
to the spelling list, but thenpybind12
would be valid, which isn't ideal. Why aren't the trailing numbers considered to be a part of the word? I didn't see any configuration option about changing that behavior.The text was updated successfully, but these errors were encountered: