o suli e sona

gregdan3 · May 3, 2024 · faa3781 · faa3781
1 parent 1ea6431
commit faa3781
Showing 1 changed file with 25 additions and 3 deletions.
diff --git a/README.md b/README.md
@@ -52,12 +52,12 @@ if __name__ == "__main__":
     main()
 ```
 
-`Ilo` is highly configurable by design, so I recommend exploring the `Preprocessors`, `Filters`, and `Scorers` modules. The `Cleaners` module only contains one cleaner, which I highly recommend. The `Tokenizers` module contains several other word tokenizers, but their performance will be worse than the
+`Ilo` is highly configurable by design, so I recommend exploring the `Preprocessors`, `Filters`, and `Scorers` modules. The `Cleaners` module only contains one cleaner, which I recommend using. The `Tokenizers` module contains several other word tokenizers, but their performance will be worse than the dedicated Toki Pona tokenizer `word_tokenize_tok`.
 
 ## Development
 
 1. Install [pdm](https://github.com/pdm-project/pdm)
-1. `pdm sync --dev`
+1. `pdm install --dev`
 1. Open any file you like!
 
 ## FAQ
@@ -68,4 +68,26 @@ The intent is to show our methodology to the Unicode Consortium, particularly to
 
 After our proposal has been examined and a result given by the committee, I will translate this file and library into Toki Pona, with a note left behind for those who do not understand it.
 
-### Why aren't any of the specific
+### What's the deal with the tokenizers?
+
+The Toki Pona tokenizer `word_tokenize_tok` is very specific in always separating writing characters from punctuation, and leaving contiguous punctuation as contiguous- this is a level of precision that NLTK's English tokenizer does not want for several reasons, such as that English words can have "punctuation" characters in them.
+
+Toki Pona doesn't have any mid-word symbols when rendered in the Latin alphabet, so a more aggressive tokenizer is highly desirable.
+
+The other tokenizers are provided as a comparison case more than anything. I do not recommend their use.
+
+### Aren't there a lot of false positives?
+
+Yes. It's up to you to use this tool responsibly on input you've done your best to clean, and better, use stronger filters before weaker ones. For now though, here's a list of relevant false positives:
+
+- `ProperName` will errantly match text in languages without a capital/lowercase distinction, artificially inflating the scores.
+- `Alphabetic` will match a _lot_ of undesirable text- it essentially allows 14 letters of the English alphabet.
+
+### Don't some of the cleaners/filters conflict?
+
+Yes. Some do so
+
+- `ConsecutiveDuplicates` may errantly change a word's validity. For example, "manna" is phonotactically invalid in Toki Pona, but would become "mana" which is valid.
+- `ConsecutiveDuplicates` will not work correctly with syllabaries (alphabets, but representing a pair of consonant and vowel).
+
+You'll notice a _lot_ of these are troubles regarding the application of latin alphabet filters to non-latin text. Working on it!