-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Linking Term and Controlled Vocabs on one JSON schema #273
Comments
@sharifX Thanks for your thoughtful questions. You have identified some important issues and hopefully we can use some of your suggestions to improve the way we make our term metadata available. One thing that I should note is that the issues you raise here also apply to several controlled vocabularies that are in use for Darwin Core terms. So any improvements that we make should be incorporated TDWG-wide, which thus far means AC and DwC, but in the future may include other vocabularies such as Latimer Core and TCS. I will provide a bit of historical background, which may help clarify how we ended up in the place we are now. Ratifying controlled vocabularies is a relatively new thing in TDWG, with the first ones ratified only 4 years ago. When trying to figure out how to define a controlled vocabulary, we had to recognize that most users were assuming that a controlled vocabulary simply meant a list of controlled value strings. However, in TDWG, we have the convention that all terms in any vocabulary should have core properties like label and definition, with those being distinct from the controlled value string in the specific case of controlled vocabularies. We also provide machine-readable metadata in the form of RDF for all terms, recognizing that many users don't understand what that is or pay attention to it. For controlled value terms, it seemed appropriate to use SKOS in the machine-readable representations, because that seemed consistent with the design of SKOS and because SKOS provided some properties that were useful for describing and linking the controlled vocabulary terms. However, although there are plenty of examples of how to construct SKOS concept schemes, there weren't any that we could find that were linked to properties whose values should be taken from the concept scheme. So we did not find any property such as the The examples that you give from the subjectOrientation controlled vocabulary are more complex than some of the other controlled vocabularies. We've used SKOS concept schemes and collections in some of the controlled vocabularies to organize the terms in useful ways in vocabularies where we needed them. But neither concept schemes nor collections correspond directly to the vocabularies. In most cases, there is a single concept scheme within each vocabulary. However, in one case (the format controlled vocabulary http://rs.tdwg.org/ac/doc/format/), there are two concept schemes in the vocabulary: one for organizing concepts by file extensions and one for organizing Internet media (MIME) types. The original specification for Audiovisual Core said that either could be used to describe the format, so we supported both by having two concept schemes and then declaring which concepts in the two schemes were equivalent. So one can't say in all cases that the vocabularies are equivalent to the concept schemes. That's why the vocabularies themselves have an IRI identifier that is distinct from the concept scheme identifiers. As with all TDWG terms, the controlled vocabulary terms are linked to their containing vocabularies using The case of the SKOS collections is so far unique to the subjectPart and subjectOrientation controlled vocabularies. There are two reasons why the collections are described separately from the rest of the term metadata. One reason is practical. The term metadata unrelated to collections is managed in CSV files that are similar to all other controlled vocabularies, and that JSON-LD is generated by a script that is common to all controlled vocabularies. The metadata related to the SKOS collections are in CSV files that are unique to those vocabularies. There is specific software that is used to generate the SKOS collections-related metadata. The second reason is more process-related. The basic term metadata is somewhat more tightly controlled by the standards process. In particular, the normative properties (like for example the definition) can't be changed without going through the standards maintenance process described in the Vocabulary Maintenance Specification. In contrast the grouping of terms into collections is not controlled by any official standards process and really is just a suggestion, e.g. "if you are describing fish, you probably should use these terms". If the AC Maintenance Group thinks the groupings are wrong, or that some other part or orientation should be added to a collection, they can just do it without following any process. So it seemed better to maintain those SKOS collections metadata separately. Having said all of that, the primary purpose of making the machine-readable metadata available to users is to satisfy their use cases, so if the way we have structured the JSON does not make sense or is not very usable, then we should change it to make it more usable. When we did the implementation testing for the subjectPart and subjectOrientation controlled vocabularies, it was primarily to determine whether we had included all of the appropriate values that users needed and whether users could confidently apply the correct value to a test media item. We did not have any testers who did anything with the machine-readable metadata except me, and I was the one who set it up. So that wasn't a very good test. That is why your feedback is extremely valuable. If you can help us know how you want to use the JSON, we can improve it to make it more usable for you. I have just re-read your comment and wanted to add an additional note. You should be aware that the JSON-LD representations of term metadata when term (and term version) IRIs are dereferenced are auto generated and the script that generates them does not always create JSON-LD that is valid (although if you run it through a validator, it will pass). The problem comes when there are multiple values for a property -- they are repeated rather than given as an array. So those triples are ignored by triplestore ingestors and end up missing from the graph. This is a known problem, but thus far has not been fixed because there hasn't been clear demand for the metadata in that format. If you are planning to use JSON-LD exclusively, then we should fix the problem. The RDF/XML and Turtle should be good if your use case is loading a triplestore, and its availability is why we haven't really fixed the JSON-LD problem. |
Hi @baskaufs, Thank you so much for the detailed reply and for providing the historical context. This is really helpful. Yes, I am aware that the challenges are not just AC related. I picked a more recent repository to start the discussion and frame the context. Based on your suggestions and our FAIR Digital Object implementation of DiSSCo, the following could be one approach for us to consider. Also perhaps we could leverage the community feedback process via TAG. It is probably ok for GBIF, COL and DiSSCo to have separate vocabulary servers and governance process but aligning on some of the shared SKOS and other properties could be beneficial. For instance, agreeing on implementation of ConceptScheme vs Collections. For DiSSCo, a critical use case in JSON and JSON-LD serialisation is ensuring we can identify the specific Digital Object and its relationships. We aim to apply the same approach to terms and vocabularies Based on your feedback, for DiSSCo we might need to introduce two new aspects to enhance clarity and structure: Concept Scheme:
Controlled Vocabulary Implementation:
Additionally, each Term should:
I worked out a simple example for ods:topicDiscipline (this requires refinement and further discussion). We also plan to assign a PID to each Term.
And then Botany can described like this
Here we need to check if we need both -- By structuring our terms in this way, we avoid the need to create a new property like |
A couple notes:
|
@CecSve I did some more research to make a solid case for collections. I was unable to put one together. There are situations where an ordered collection may be preferred because it allows concepts to be sorted in a specific manner, but I don't think that's sufficient grounds to choose ordered collections over concept schemes. Also, concept schemes are a much more common mechanism for grouping concepts. |
I was wondering if someone could shed light on how AC T\terms and controlled vocabularies are linked. We are examining this closely from a DiSSCo/Digital Specimen/FAIR perspective. I am particularly trying to understand the connection between terms and controlled vocabularies, as well as their machine-readable descriptions. To get a better grasp, I explored a few examples.
For instance, in the AC terms documentation, it mentions a "Controlled Vocabulary for Audiovisual Core subjectOrientation". You can select acorient:r, which then provides a JSON file with machine-readable details:
http://rs.tdwg.org/acorient/values/version/r-2023-04-26.json.
However, my first question is: the JSON file does not explicitly state that acorient:r is part of a Controlled Vocabulary scheme. Instead, it describes this as "a SKOS concept scheme for orientation". If I refer to the terms documentation for ac_subjectOrientation and its JSON serialisation (http://rs.tdwg.org/ac/terms/version/subjectOrientation-2023-09-05.json), there is no direct indication that a controlled vocabulary is associated with this term.
On the other hand, the human-readable page here mentions "Controlled value strings from SKOS Collections for ac:subjectOrientationLiteral", and the corresponding JSON-LD SKOS Collection file lists the controlled terms as "members":
https://tdwg.github.io/rs.tdwg.org/cvJson/acorient_collection.json.
I find this structure overly complicated to navigate. I also reviewed the NERC implementation for comparison. They also use skos:Collection and the skos:member concept. However, I feel that a few key elements are missing from several controlled vocabulary implementations.
There is no ontology term like isControlledVocab: {yes/no} or hasControlledVocab to explicitly indicate whether a concept is part of a controlled vocabulary or has one.
A clear relationship, such as skos:inScheme, to show which Term Schema a Controlled Vocabulary belongs to, is also absent.
In my opinion, this information could be incorporated into a single schema to simplify FAIR access and understanding.
So the file http://rs.tdwg.org/ac/terms/version/subjectOrientation-2023-09-05.json defines the term subjectOrientation and includes a scopeNote stating:
“Values SHOULD be selected from the Controlled Vocabulary for Audiovisual Core subjectOrientation.”
However, there is no explicit, machine-readable link pointing to where the vocabulary can be found.
Meanwhile, the file https://tdwg.github.io/rs.tdwg.org/cvJson/acorient_collection.json provides SKOS Collections of controlled vocabulary concepts as members. http://rs.tdwg.org/acorient/values/r0000, r0001, etc.
These members represent the valid controlled terms for subjectOrientation, but this connection is not stated in the subjectOrientation JSON file.
Could we not add more context into a single JSON schema and link the term and its vocabularies better? Below is an oversimplified example to illustrate the point:
Sorry for the long post. As we are trying to re-use some of these existing ontologies, terms, and vocabularies in DiSSCo and create new terms that openDS specific, we want to better understand the current implementation and identify any gaps.
Thanks for your time!
The text was updated successfully, but these errors were encountered: