-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tokens beginning with parentheses are treated entirely as punctuation #3
Comments
Oh that's a great catch! I wondered how it tagged some punctuation wrong in the confusion matrix. It'll be fixed in the next commit. |
Maybe allow the use of a custom tokenizer? |
personally, I would just preprocess. In many cases the text will be in XML or some other format anyway so will require preprocessing. My run on John's Gospel involved preprocessing (although in that case it was concatenating an existing tokenization into a single string for the book) |
That's fair, although using custom tokenizers seems pretty common practice in ML. |
Yes, but that's in large part because they're dealing with much more text and aren't as interested in spending a lot of time on any one text (unlike us :-)) |
e.g. from John 1.38 in MorphGNT SBLGNT I get:
The text was updated successfully, but these errors were encountered: