Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to handle unglossed words? #158

Open
fmatter opened this issue Nov 13, 2022 · 4 comments
Open

How to handle unglossed words? #158

fmatter opened this issue Nov 13, 2022 · 4 comments

Comments

@fmatter
Copy link
Contributor

fmatter commented Nov 13, 2022

Quite often, people will not gloss words like person or place names or unparsable words, so some words may only be present in Primary_Text, but not in Analyzed_Word or Gloss.

The most transparent way to store an example like that in CLDF is to have an empty list item in these two columns:

Primary_Text: "x y Person z"
Analyzed_Word: "x\ty\t\tz" (["x","y",None,"z"] once read by pycldf)
Gloss: "xg\tyg\t\tzg" (["xg","yg",None,"zg"])

This passes validation, but for example cldf createdb does not work (TypeError: sequence item 1: expected str instance, NoneType found) and I've been doing things like ex["Analyzed_Word"] = ["" if x is None else x for x in ex["Analyzed_Word"]] in initializedb.py scripts.

Should empty items in a gloss column raise an error upon validation? If yes, is the way to handle unglossed words to simply leave them out? (i.e. "x\ty\tz" ["x","y","z"])? Or, if empty items are allowed, would it be OK for pycldf to yield "" instead of None (i.e. "x\ty\t\tz" ["x","y","","z"])?

@xrotwang
Copy link
Contributor

Hm, the most transparent practice I've seen in this regard is using ellipsis (ideally the Unicode character U+2026, and not three dots ...) in both, Analyzed_Word and Gloss. Admittedly, this is also often used very inconsistently - leaving out the ellipsis in the Gloss, etc. But from my point of view, recommending this practice would also raise awareness of the fact that the ellipsis is part of the example, and must be considered for consistency.

@fmatter
Copy link
Contributor Author

fmatter commented Nov 14, 2022

That's a very reasonable solution, works for me.

Should None in tab-delimited columns raise a validation error?

@xrotwang
Copy link
Contributor

Should None in tab-delimited columns raise a validation error?

Yes, I would say so. After all, one of the main reasons for using ellipsis for unglossed words is that we get lists of str for both aligned properties.

@xrotwang
Copy link
Contributor

Yes, I would say so. After all, one of the main reasons for using ellipsis for unglossed words is that we get lists of str for both aligned properties.

Maybe we could keep some sort of backwards compatibility (with somewhat undefined bahaviour) by converting None to ellipsis upon reading.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants