Skip to content

Commit

Permalink
o suli e sona
Browse files Browse the repository at this point in the history
  • Loading branch information
gregdan3 committed May 3, 2024
1 parent 1ea6431 commit faa3781
Showing 1 changed file with 25 additions and 3 deletions.
28 changes: 25 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,12 +52,12 @@ if __name__ == "__main__":
main()
```

`Ilo` is highly configurable by design, so I recommend exploring the `Preprocessors`, `Filters`, and `Scorers` modules. The `Cleaners` module only contains one cleaner, which I highly recommend. The `Tokenizers` module contains several other word tokenizers, but their performance will be worse than the
`Ilo` is highly configurable by design, so I recommend exploring the `Preprocessors`, `Filters`, and `Scorers` modules. The `Cleaners` module only contains one cleaner, which I recommend using. The `Tokenizers` module contains several other word tokenizers, but their performance will be worse than the dedicated Toki Pona tokenizer `word_tokenize_tok`.

## Development

1. Install [pdm](https://github.com/pdm-project/pdm)
1. `pdm sync --dev`
1. `pdm install --dev`
1. Open any file you like!

## FAQ
Expand All @@ -68,4 +68,26 @@ The intent is to show our methodology to the Unicode Consortium, particularly to

After our proposal has been examined and a result given by the committee, I will translate this file and library into Toki Pona, with a note left behind for those who do not understand it.

### Why aren't any of the specific
### What's the deal with the tokenizers?

The Toki Pona tokenizer `word_tokenize_tok` is very specific in always separating writing characters from punctuation, and leaving contiguous punctuation as contiguous- this is a level of precision that NLTK's English tokenizer does not want for several reasons, such as that English words can have "punctuation" characters in them.

Toki Pona doesn't have any mid-word symbols when rendered in the Latin alphabet, so a more aggressive tokenizer is highly desirable.

The other tokenizers are provided as a comparison case more than anything. I do not recommend their use.

### Aren't there a lot of false positives?

Yes. It's up to you to use this tool responsibly on input you've done your best to clean, and better, use stronger filters before weaker ones. For now though, here's a list of relevant false positives:

- `ProperName` will errantly match text in languages without a capital/lowercase distinction, artificially inflating the scores.
- `Alphabetic` will match a _lot_ of undesirable text- it essentially allows 14 letters of the English alphabet.

### Don't some of the cleaners/filters conflict?

Yes. Some do so

- `ConsecutiveDuplicates` may errantly change a word's validity. For example, "manna" is phonotactically invalid in Toki Pona, but would become "mana" which is valid.
- `ConsecutiveDuplicates` will not work correctly with syllabaries (alphabets, but representing a pair of consonant and vowel).

You'll notice a _lot_ of these are troubles regarding the application of latin alphabet filters to non-latin text. Working on it!

0 comments on commit faa3781

Please sign in to comment.