Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in entry for 更 #2

Open
bahducoup opened this issue Nov 24, 2022 · 2 comments
Open

Error in entry for 更 #2

bahducoup opened this issue Nov 24, 2022 · 2 comments
Assignees

Comments

@bahducoup
Copy link

更 kæŋ¹ kaːŋ˥ - kaŋ˨˦ t͡ɕĩŋ˩˩ kɤŋ˥/t͡ɕiŋ˥/tʰxɤ tʰimɤ sənsɤ kĩ˥/kẽ˥ - -

When this row is split on '\t', kɤŋ˥/t͡ɕiŋ˥/tʰxɤ tʰimɤ sənsɤ is treated as a single token.
The tʰimɤ sənsɤ portion of this token seems to be erroneously included in the row.

I think it would be a good idea to check why these characters were included in the dataset and verify that there are no similar errors in the rest of the dataset.

@kalvinchang
Copy link
Member

this shouldnt affect the reconstruction cuz in the dataloader, we take the first pronunciation variant (kɤŋ˥ in this case)

thanks for catching this tho!!

@kalvinchang
Copy link
Member

the issue is that the romanized version (the pre-parsed version on Wiktionary) shows something like this
"gēng/jīng/the time sense” for Mandarin

we need to remove extra annotations for Mandarin in the Wiktionary parsing script

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants