-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Structured page counts are not valid in schema.org #19
Comments
Here is some R code to normalize structured page counts: |
Normalizing structured page counts is not as straightfwd task as it first looks like. The reasons are many: spelling variations, ambiguous cases, terms or stopwords from multiple languages, handling of various exceptions. Moreover, many documents have only cover page information which will give a misleading page count estimate if converted directly. Anyway, the R code cited above is essentially ready, backed up by unit tests and extensive manual checking, and cleans up page counts for the complete Fennica catalog. |
@antagomir Thanks, looks really useful! The whole point of this pipeline is to stitch together existing tools instead of reinventing the wheel. Probably I just need to implement some glue code, e.g. a filter that can take N-Triples with structured page counts from stdin and output normalized page counts on stdout, using your normalization function behind the scenes. |
Similar things could be considered for the other fields as well. |
The proposed materialExtent property in Schema.org could be used to represent the original, structured page count. Still, the normalized page count is probably much more useful for most analysis purposes. We could simply provide them both. |
Yes and both may be needed. I agree. Hopefully we can soon activate with this a bit more again, |
Our MARC records have structured page counts, e.g.
vii, 89, 31 s.
. However, Schema.org only defines a single integer fieldschema:numberOfPages
so the structured values are not really valid Schema.org.Maybe we should convert those structured counts into a single number? It can't be done in SPARQL easily (Roman numerals!) but a relatively simple filter script (e.g. Python) could do it.
The text was updated successfully, but these errors were encountered: