diff --git a/README.md b/README.md index ca23cc6..1b5af39 100644 --- a/README.md +++ b/README.md @@ -52,12 +52,12 @@ if __name__ == "__main__": main() ``` -`Ilo` is highly configurable by design, so I recommend exploring the `Preprocessors`, `Filters`, and `Scorers` modules. The `Cleaners` module only contains one cleaner, which I highly recommend. The `Tokenizers` module contains several other word tokenizers, but their performance will be worse than the +`Ilo` is highly configurable by design, so I recommend exploring the `Preprocessors`, `Filters`, and `Scorers` modules. The `Cleaners` module only contains one cleaner, which I recommend using. The `Tokenizers` module contains several other word tokenizers, but their performance will be worse than the dedicated Toki Pona tokenizer `word_tokenize_tok`. ## Development 1. Install [pdm](https://github.com/pdm-project/pdm) -1. `pdm sync --dev` +1. `pdm install --dev` 1. Open any file you like! ## FAQ @@ -68,4 +68,26 @@ The intent is to show our methodology to the Unicode Consortium, particularly to After our proposal has been examined and a result given by the committee, I will translate this file and library into Toki Pona, with a note left behind for those who do not understand it. -### Why aren't any of the specific +### What's the deal with the tokenizers? + +The Toki Pona tokenizer `word_tokenize_tok` is very specific in always separating writing characters from punctuation, and leaving contiguous punctuation as contiguous- this is a level of precision that NLTK's English tokenizer does not want for several reasons, such as that English words can have "punctuation" characters in them. + +Toki Pona doesn't have any mid-word symbols when rendered in the Latin alphabet, so a more aggressive tokenizer is highly desirable. + +The other tokenizers are provided as a comparison case more than anything. I do not recommend their use. + +### Aren't there a lot of false positives? + +Yes. It's up to you to use this tool responsibly on input you've done your best to clean, and better, use stronger filters before weaker ones. For now though, here's a list of relevant false positives: + +- `ProperName` will errantly match text in languages without a capital/lowercase distinction, artificially inflating the scores. +- `Alphabetic` will match a _lot_ of undesirable text- it essentially allows 14 letters of the English alphabet. + +### Don't some of the cleaners/filters conflict? + +Yes. Some do so + +- `ConsecutiveDuplicates` may errantly change a word's validity. For example, "manna" is phonotactically invalid in Toki Pona, but would become "mana" which is valid. +- `ConsecutiveDuplicates` will not work correctly with syllabaries (alphabets, but representing a pair of consonant and vowel). + +You'll notice a _lot_ of these are troubles regarding the application of latin alphabet filters to non-latin text. Working on it!