tokens beginning with parentheses are treated entirely as punctuation #3

jtauber · 2021-02-01T12:46:16Z

e.g. from John 1.38 in MorphGNT SBLGNT I get:

('(ὃ', 'u--------')

The text was updated successfully, but these errors were encountered:

chrisdrymon · 2021-02-01T19:04:32Z

Oh that's a great catch! I wondered how it tagged some punctuation wrong in the confusion matrix. It'll be fixed in the next commit.

jcuenod · 2021-02-08T18:31:25Z

Maybe allow the use of a custom tokenizer?
I found … in Shepherd of Hermas (27.3.1). There are also colons in there but they're in the Latin sections.

jtauber · 2021-02-08T18:37:12Z

personally, I would just preprocess. In many cases the text will be in XML or some other format anyway so will require preprocessing. My run on John's Gospel involved preprocessing (although in that case it was concatenating an existing tokenization into a single string for the book)

jcuenod · 2021-02-08T19:24:58Z

That's fair, although using custom tokenizers seems pretty common practice in ML.

jtauber · 2021-02-08T19:34:31Z

Yes, but that's in large part because they're dealing with much more text and aren't as interested in spending a lot of time on any one text (unlike us :-))

chrisdrymon closed this as completed Feb 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tokens beginning with parentheses are treated entirely as punctuation #3

tokens beginning with parentheses are treated entirely as punctuation #3

jtauber commented Feb 1, 2021

chrisdrymon commented Feb 1, 2021

jcuenod commented Feb 8, 2021

jtauber commented Feb 8, 2021

jcuenod commented Feb 8, 2021

jtauber commented Feb 8, 2021

tokens beginning with parentheses are treated entirely as punctuation #3

tokens beginning with parentheses are treated entirely as punctuation #3

Comments

jtauber commented Feb 1, 2021

chrisdrymon commented Feb 1, 2021

jcuenod commented Feb 8, 2021

jtauber commented Feb 8, 2021

jcuenod commented Feb 8, 2021

jtauber commented Feb 8, 2021