Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misspelled word suggestions #45

Open
tgross35 opened this issue Jan 2, 2023 · 5 comments
Open

Misspelled word suggestions #45

tgross35 opened this issue Jan 2, 2023 · 5 comments

Comments

@tgross35
Copy link
Contributor

tgross35 commented Jan 2, 2023

It may be good to provide a SuggestionConfig or similar struct that could be passed as an argument to our .suggest function (or similar). There are some different functionalities we could use:

  • REP: simple replacements bassed on the affix config file
  • Phonetic replacements from the affix config file. These seem more rare and are likely only useful for text to speech
  • Levenshtein distance
  • TryChars just sample with one different

Possible API (from #16)

fn check_with_suggestions(&self, s: &str) -> Suggestions

enum Suggestions {
    Correct
    Incorrect(Vec<&str>)
}
@tgross35
Copy link
Contributor Author

tgross35 commented Jan 2, 2023

We should probably look at how hunspell does this

@tgross35
Copy link
Contributor Author

tgross35 commented Jan 2, 2023

This site describes how Hunspell works https://zverok.space/blog/2021-01-28-spellchecker-5.html

  1. Change the word to the uppercase (see also “Word case” sub-section below);
  2. Replace common misspellings, like “f”→”ph” and vice versa, defined by REP table from aff-file;
  3. Split the word in two parts in every position (with space or dash), to be tested as a single dictionary entry, like “ad hoc” (see also Spellchecker: hashset with size #13 below);
  4. Replace related chars, like “a”, “å”, “ä”, defined by MAP table from aff-file;
  5. Swap every two adjacent letters,
    oh, and for 4- and 5-letter words also try two swaps: “ahev” → “have”;
  6. Swap two non-adjacent letters (up to distance 4);
  7. Replace every letter with the adjacent on the keyboard, e.g. “miraclw” → “miracle”. The keyboard layout is defined by KEY directive in aff-file;
    and, on the same step, with the capitalized version of the character (“paris” → “Paris”, but not vice versa), also considered as a possible keyboard-related error;
  8. Remove every letter in turn;
  9. Insert every letter from the language’s alphabet (defined by TRY directive in aff-file) into every position;
  10. Move every letter forward and backward into all possible positions;
  11. Replace every letter with every other letter from the language’s alphabet;
  12. Find a duplicated pair of letters and remove it: “chicicken” → “chicken”;
  13. Split the word in two in every position, to be tested as two separate words (see also Project cleanup #3 above).

@tgross35
Copy link
Contributor Author

tgross35 commented Jan 2, 2023

@cpu
Copy link

cpu commented Oct 17, 2023

👋 I would be interested in using this crate to replace hunspell-rs/hunspell-sys with a memory safe alternative, but my use case requires misspelled word suggestions. Just leaving a comment here in case knowing it would be useful to a downstream project helps motivate the work.

Thank you!

@tgross35
Copy link
Contributor Author

@cpu Thanks for the feedback! I do indeed have prototype suggestions working, but some changes are needed before it is reliable :) in particular, I probably have to unblock #54 first

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants