How to handle unglossed words? #158

fmatter · 2022-11-13T18:52:47Z

Quite often, people will not gloss words like person or place names or unparsable words, so some words may only be present in Primary_Text, but not in Analyzed_Word or Gloss.

The most transparent way to store an example like that in CLDF is to have an empty list item in these two columns:

Primary_Text: "x y Person z"
Analyzed_Word: "x\ty\t\tz" (["x","y",None,"z"] once read by pycldf)
Gloss: "xg\tyg\t\tzg" (["xg","yg",None,"zg"])

This passes validation, but for example cldf createdb does not work (TypeError: sequence item 1: expected str instance, NoneType found) and I've been doing things like ex["Analyzed_Word"] = ["" if x is None else x for x in ex["Analyzed_Word"]] in initializedb.py scripts.

Should empty items in a gloss column raise an error upon validation? If yes, is the way to handle unglossed words to simply leave them out? (i.e. "x\ty\tz" ["x","y","z"])? Or, if empty items are allowed, would it be OK for pycldf to yield "" instead of None (i.e. "x\ty\t\tz" ["x","y","","z"])?

The text was updated successfully, but these errors were encountered:

xrotwang · 2022-11-14T07:56:31Z

Hm, the most transparent practice I've seen in this regard is using ellipsis … (ideally the Unicode character U+2026, and not three dots ...) in both, Analyzed_Word and Gloss. Admittedly, this is also often used very inconsistently - leaving out the ellipsis in the Gloss, etc. But from my point of view, recommending this practice would also raise awareness of the fact that the ellipsis is part of the example, and must be considered for consistency.

fmatter · 2022-11-14T16:42:17Z

That's a very reasonable solution, works for me.

Should None in tab-delimited columns raise a validation error?

xrotwang · 2022-11-14T16:55:20Z

Should None in tab-delimited columns raise a validation error?

Yes, I would say so. After all, one of the main reasons for using ellipsis for unglossed words is that we get lists of str for both aligned properties.

xrotwang · 2022-11-14T16:57:22Z

Yes, I would say so. After all, one of the main reasons for using ellipsis for unglossed words is that we get lists of str for both aligned properties.

Maybe we could keep some sort of backwards compatibility (with somewhat undefined bahaviour) by converting None to ellipsis upon reading.

xrotwang mentioned this issue Nov 15, 2022

Recommendation for handling of unglossed words in Examples cldf/cldf#134

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to handle unglossed words? #158

How to handle unglossed words? #158

fmatter commented Nov 13, 2022

xrotwang commented Nov 14, 2022

fmatter commented Nov 14, 2022

xrotwang commented Nov 14, 2022

xrotwang commented Nov 14, 2022

How to handle unglossed words? #158

How to handle unglossed words? #158

Comments

fmatter commented Nov 13, 2022

xrotwang commented Nov 14, 2022

fmatter commented Nov 14, 2022

xrotwang commented Nov 14, 2022

xrotwang commented Nov 14, 2022