Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify validations #115

Open
zuphilip opened this issue Jan 1, 2020 · 2 comments
Open

Simplify validations #115

zuphilip opened this issue Jan 1, 2020 · 2 comments
Labels
enhancement Any enhancement on the software itself (excluding new transformations)

Comments

@zuphilip
Copy link
Member

zuphilip commented Jan 1, 2020

Usually, I don't remember the exact name of the schema to validate aggainst, e.g.

ocr-validate page-2019-07-15 input.xml

is hard to remember. However, on the other side, it is usually easy to detect the exact version from inspecting the first lines with the stylesheet definition.

Thus, I suggest to simplify the validation, e.g. such that we can also use

ocr-validate page input.xml

which will then check whether the input file is valid against the stylesheet given at the beginning. Even

ocr-validate input.xml

could work for XML files and maybe some simply guessing for the others (html -> hocr, JSON -> GCV).

I am not yet sure, whether it is afterwards still useful to have the option to specify the exact stylesheet instead of simply any PAGE version, i.e. to make these simplifications additional rather than replacing the old ones with them.

@zuphilip zuphilip added the enhancement Any enhancement on the software itself (excluding new transformations) label Jan 1, 2020
@kba
Copy link
Collaborator

kba commented Jan 2, 2020

whether it is afterwards still useful to have the option to specify the exact stylesheet instead of simply any PAGE version

I would leave that option and optionally automate. Note that such automation requires reading and parsing the XML twice, once for the schema detection and once for the actual validation. For bulk processing this should be avoidable.

@zuphilip
Copy link
Member Author

zuphilip commented Jan 2, 2020

I am fine with an additional option and leaving the more specific ones as well. 👍

However, note that we might be able to detect the exact version of a page format easier by e.g. considering the first few lines and looking for some regex match similar to https://github.com/zotero/translators/blob/master/MARCXML.js#L41-L50 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any enhancement on the software itself (excluding new transformations)
Projects
None yet
Development

No branches or pull requests

2 participants