Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ATF files parsing issue #66

Open
hohilwik opened this issue Jan 30, 2023 · 3 comments
Open

ATF files parsing issue #66

hohilwik opened this issue Jan 30, 2023 · 3 comments

Comments

@hohilwik
Copy link

hohilwik commented Jan 30, 2023

I am exploring machine translation for Sumerian and trying to parse atf files using pyorrac and cdli/atf2tei parsers instead of writing my own, and even the parser.py that was in this repo from a previous pull request, but nothing seems to work correctly and all of them throw errors. Is something wrong with the corpus? If not, how can I fix it without having to manually dig out all the problems?

After fixing a lot of of "?" marks at the end of @ broken or other signifiers, most of the problems are empty entries. Any way to fix this?

@epageperron
Copy link
Member

epageperron commented Jan 30, 2023

Hi Shirley,
The corpus isn't perfectly clean. If you want a cleaner version than this one you can use the API client and fetch the most recent texts https://github.com/cdli-gh/framework-api-client the server URL is https://cdli.mpiwg-berlin.mpg.de/. If you want to fix parts of the corpus you are welcome to open an account and submit change suggestions which we will review and integrate https://cdli.mpiwg-berlin.mpg.de/register

@epageperron
Copy link
Member

Also please feel free to tryout our translation pipeline https://github.com/cdli-gh/Sumerian-Translation-Pipeline

@hohilwik
Copy link
Author

I am working on improving on that pipeline actually. Thanks for the suggestion though! I'll look into the API and try to get the cleaner version and submit changes whenever I find some error.

Thanks a lot for the info

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants