Skip to content

Commit

Permalink
o sin e sona ilo
Browse files Browse the repository at this point in the history
  • Loading branch information
gregdan3 committed May 5, 2024
1 parent ab14c3d commit 33908a5
Showing 1 changed file with 29 additions and 23 deletions.
52 changes: 29 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,9 @@ This library, "Language Knowledge," helps you identify whether a message is in T

I wrote it with a variety of scraps and lessons learned from a prior project, [ilo pi toki pona taso, "toki-pona-only tool"](https://github.com/gregdan3/ilo-pi-toki-pona-taso). That tool will be rewritten to use this library shortly.

If you've ever worked on a similar project, you know the question "is this message in [language]" is not a consistent one- the environment, time, preferences of the speaker, and much more, can all alter whether a given message is "in toki pona," and this applies to essentially any language.
If you've ever worked on a similar project, you know the question "is this message in [language]" is not a consistent one- the environment, time, preferences of the speaker, and much more, can all alter whether a given message is "in" any specific language, and this question applies to Toki Pona too.

This project "solves" that complex problem by offering a highly configurable and incredibly lazy parser
This project "solves" that complex problem by offering a highly configurable parser, so you can tune it to your preferences and goals.

## Quick Start

Expand All @@ -22,28 +22,11 @@ pdm add sonatoki
Then get started with a script along these lines:

```py
from sonatoki.Filters import (
Numerics,
Syllabic,
NimiLinku,
Alphabetic,
ProperName,
Punctuations,
)
from sonatoki.Scorers import SoftScaling
from sonatoki.Cleaners import ConsecutiveDuplicates
from sonatoki.Tokenizers import word_tokenize_tok
from sonatoki.Preprocessors import URLs, DiscordEmotes
from sonatoki.ilo import Ilo
from sonatoki.Configs import PrefConfig

def main():
ilo = Ilo(
preprocessors=[URLs, DiscordEmotes],
ignoring_filters=[Numerics, Punctuations],
scoring_filters=[NimiLinku, Syllabic, ProperName, Alphabetic],
cleaners=[ConsecutiveDuplicates],
scorer=SoftScaling,
tokenizer=word_tokenize_tok,
)
ilo = Ilo(**PrefConfig)
ilo.is_toki_pona("imagine how is touch the sky") # False
ilo.is_toki_pona("o pilin insa e ni: sina pilin e sewi") # True
ilo.is_toki_pona("I Think I Can Evade Detection") # False
Expand All @@ -52,7 +35,30 @@ if __name__ == "__main__":
main()
```

`Ilo` is highly configurable by design, so I recommend exploring the `Preprocessors`, `Filters`, and `Scorers` modules. The `Cleaners` module only contains one cleaner, which I recommend using. The `Tokenizers` module contains several other word tokenizers, but their performance will be worse than the dedicated Toki Pona tokenizer `word_tokenize_tok`.
Or if you'd prefer to configure on your own:

```py
from copy import deepcopy
from sonatoki.ilo import Ilo
from sonatoki.Configs import BaseConfig
from sonatoki.Filters import NimiPuAle, Phonotactic, ProperName
from sonatoki.Scorers import SoftPassFail

def main():
config = deepcopy(BaseConfig)
config["scoring_filters"].extend([NimiPuAle, Phonotactic, ProperName])
config["scorer"] = SoftPassFail

ilo = Ilo(**config)
ilo.is_toki_pona("mu mu!") # True
ilo.is_toki_pona("mi namako e moku mi") # True
ilo.is_toki_pona("ma wulin") # False

if __name__ == "__main__":
main()
```

`Ilo` is highly configurable by necessity, so I recommend looking through the premade configs in `Configs` as well as the individual `Preprocessors`, `Filters`, and `Scorers`. The `Cleaners` module only contains one cleaner, which I recommend always using. Similarly, the `Tokenizers` module contains several other word tokenizers, but their performance will be worse than the dedicated Toki Pona tokenizer `WordTokenizerTok`.

## Development

Expand Down

0 comments on commit 33908a5

Please sign in to comment.