Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokens beginning with parentheses are treated entirely as punctuation #3

Closed
jtauber opened this issue Feb 1, 2021 · 5 comments
Closed

Comments

@jtauber
Copy link

jtauber commented Feb 1, 2021

e.g. from John 1.38 in MorphGNT SBLGNT I get:

('(ὃ', 'u--------')
@chrisdrymon
Copy link
Owner

Oh that's a great catch! I wondered how it tagged some punctuation wrong in the confusion matrix. It'll be fixed in the next commit.

@jcuenod
Copy link

jcuenod commented Feb 8, 2021

Maybe allow the use of a custom tokenizer?
I found in Shepherd of Hermas (27.3.1). There are also colons in there but they're in the Latin sections.

@jtauber
Copy link
Author

jtauber commented Feb 8, 2021

personally, I would just preprocess. In many cases the text will be in XML or some other format anyway so will require preprocessing. My run on John's Gospel involved preprocessing (although in that case it was concatenating an existing tokenization into a single string for the book)

@jcuenod
Copy link

jcuenod commented Feb 8, 2021

That's fair, although using custom tokenizers seems pretty common practice in ML.

@jtauber
Copy link
Author

jtauber commented Feb 8, 2021

Yes, but that's in large part because they're dealing with much more text and aren't as interested in spending a lot of time on any one text (unlike us :-))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants