Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Structured page counts are not valid in schema.org #19

Open
osma opened this issue Jan 27, 2017 · 6 comments
Open

Structured page counts are not valid in schema.org #19

osma opened this issue Jan 27, 2017 · 6 comments
Labels
Milestone

Comments

@osma
Copy link
Member

osma commented Jan 27, 2017

Our MARC records have structured page counts, e.g. vii, 89, 31 s.. However, Schema.org only defines a single integer field schema:numberOfPages so the structured values are not really valid Schema.org.

Maybe we should convert those structured counts into a single number? It can't be done in SPARQL easily (Roman numerals!) but a relatively simple filter script (e.g. Python) could do it.

@osma osma added the bug label Jan 27, 2017
@osma
Copy link
Member Author

osma commented Jan 31, 2017

Here is some R code to normalize structured page counts:
https://github.com/rOpenGov/bibliographica/blob/master/R/estimate_pages.R

@antagomir
Copy link

Normalizing structured page counts is not as straightfwd task as it first looks like. The reasons are many: spelling variations, ambiguous cases, terms or stopwords from multiple languages, handling of various exceptions. Moreover, many documents have only cover page information which will give a misleading page count estimate if converted directly. Anyway, the R code cited above is essentially ready, backed up by unit tests and extensive manual checking, and cleans up page counts for the complete Fennica catalog.

@osma
Copy link
Member Author

osma commented Feb 1, 2017

@antagomir Thanks, looks really useful! The whole point of this pipeline is to stitch together existing tools instead of reinventing the wheel. Probably I just need to implement some glue code, e.g. a filter that can take N-Triples with structured page counts from stdin and output normalized page counts on stdout, using your normalization function behind the scenes.

@antagomir
Copy link

Similar things could be considered for the other fields as well.

@osma
Copy link
Member Author

osma commented Oct 2, 2017

The proposed materialExtent property in Schema.org could be used to represent the original, structured page count. Still, the normalized page count is probably much more useful for most analysis purposes. We could simply provide them both.

@antagomir
Copy link

Yes and both may be needed. I agree. Hopefully we can soon activate with this a bit more again,

@osma osma added this to the Medium term milestone Nov 16, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants