Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roadmap of cleaning perseus #30

Open
jankounchained opened this issue Feb 7, 2023 · 0 comments
Open

roadmap of cleaning perseus #30

jankounchained opened this issue Feb 7, 2023 · 0 comments
Labels
data enhancement New feature or request

Comments

@jankounchained
Copy link
Member

jankounchained commented Feb 7, 2023

Perseus idiosyncrasies

Lemmas
Root and suffix are sometimes dash-separated.
This is happens only for VERBs.
Learn the reason for this & remove dashes (possibly also suffixes if it makes sense).

Beta code encoding errors
Make sure that every character is a valid ancient greek letter.
For example, ἀλλ̓ should be ἀλλ’

Compatability with Proiel

XPOS
XPOS tags just contain the combination of UPOS & FEATS, no new information is introduced. Our model should not learn them / use them as labels

Morpohological features
Proiel has both more features & possible values.
How do we not learn proiel-specific features when training on their data?

POS
Perseus doesn't have PROPN,
while proiel doesn't have PUNCT

  • Does training on Proiel make our senter worse?
  • Should we convert PROPN to NOUN? Or are we learning PROPN well enough to keep it?
@jankounchained jankounchained added enhancement New feature or request data labels Feb 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant