Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linking Term and Controlled Vocabs on one JSON schema #273

Open
sharifX opened this issue Dec 17, 2024 · 5 comments
Open

Linking Term and Controlled Vocabs on one JSON schema #273

sharifX opened this issue Dec 17, 2024 · 5 comments

Comments

@sharifX
Copy link

sharifX commented Dec 17, 2024

I was wondering if someone could shed light on how AC T\terms and controlled vocabularies are linked. We are examining this closely from a DiSSCo/Digital Specimen/FAIR perspective. I am particularly trying to understand the connection between terms and controlled vocabularies, as well as their machine-readable descriptions. To get a better grasp, I explored a few examples.

For instance, in the AC terms documentation, it mentions a "Controlled Vocabulary for Audiovisual Core subjectOrientation". You can select acorient:r, which then provides a JSON file with machine-readable details:
http://rs.tdwg.org/acorient/values/version/r-2023-04-26.json.

However, my first question is: the JSON file does not explicitly state that acorient:r is part of a Controlled Vocabulary scheme. Instead, it describes this as "a SKOS concept scheme for orientation". If I refer to the terms documentation for ac_subjectOrientation and its JSON serialisation (http://rs.tdwg.org/ac/terms/version/subjectOrientation-2023-09-05.json), there is no direct indication that a controlled vocabulary is associated with this term.

On the other hand, the human-readable page here mentions "Controlled value strings from SKOS Collections for ac:subjectOrientationLiteral", and the corresponding JSON-LD SKOS Collection file lists the controlled terms as "members":
https://tdwg.github.io/rs.tdwg.org/cvJson/acorient_collection.json.

I find this structure overly complicated to navigate. I also reviewed the NERC implementation for comparison. They also use skos:Collection and the skos:member concept. However, I feel that a few key elements are missing from several controlled vocabulary implementations.

There is no ontology term like isControlledVocab: {yes/no} or hasControlledVocab to explicitly indicate whether a concept is part of a controlled vocabulary or has one.

A clear relationship, such as skos:inScheme, to show which Term Schema a Controlled Vocabulary belongs to, is also absent.

In my opinion, this information could be incorporated into a single schema to simplify FAIR access and understanding.

So the file http://rs.tdwg.org/ac/terms/version/subjectOrientation-2023-09-05.json defines the term subjectOrientation and includes a scopeNote stating:

“Values SHOULD be selected from the Controlled Vocabulary for Audiovisual Core subjectOrientation.”
However, there is no explicit, machine-readable link pointing to where the vocabulary can be found.

Meanwhile, the file https://tdwg.github.io/rs.tdwg.org/cvJson/acorient_collection.json provides SKOS Collections of controlled vocabulary concepts as members. http://rs.tdwg.org/acorient/values/r0000, r0001, etc.

These members represent the valid controlled terms for subjectOrientation, but this connection is not stated in the subjectOrientation JSON file.

Could we not add more context into a single JSON schema and link the term and its vocabularies better? Below is an oversimplified example to illustrate the point:

{
  "@id": "http://rs.tdwg.org/ac/terms/version/subjectOrientation-2023-09-05",
  "rdfs:label": {
    "@language": "en",
    "@value": "Subject Orientation"
  },
  "skos:prefLabel": {
    "@language": "en",
    "@value": "Subject Orientation"
  },
  "hasControlledVocab": {
    "@id": "http://rs.tdwg.org/acorient/collection/",
    "skos:members": [
      "http://rs.tdwg.org/acorient/values/r0000",
      "http://rs.tdwg.org/acorient/values/r0001",
      "http://rs.tdwg.org/acorient/values/r0002",
      "http://rs.tdwg.org/acorient/values/r0003"
    ]
  }
}

Sorry for the long post. As we are trying to re-use some of these existing ontologies, terms, and vocabularies in DiSSCo and create new terms that openDS specific, we want to better understand the current implementation and identify any gaps.

Thanks for your time!

@baskaufs
Copy link
Contributor

@sharifX Thanks for your thoughtful questions. You have identified some important issues and hopefully we can use some of your suggestions to improve the way we make our term metadata available. One thing that I should note is that the issues you raise here also apply to several controlled vocabularies that are in use for Darwin Core terms. So any improvements that we make should be incorporated TDWG-wide, which thus far means AC and DwC, but in the future may include other vocabularies such as Latimer Core and TCS.

I will provide a bit of historical background, which may help clarify how we ended up in the place we are now. Ratifying controlled vocabularies is a relatively new thing in TDWG, with the first ones ratified only 4 years ago. When trying to figure out how to define a controlled vocabulary, we had to recognize that most users were assuming that a controlled vocabulary simply meant a list of controlled value strings. However, in TDWG, we have the convention that all terms in any vocabulary should have core properties like label and definition, with those being distinct from the controlled value string in the specific case of controlled vocabularies. We also provide machine-readable metadata in the form of RDF for all terms, recognizing that many users don't understand what that is or pay attention to it. For controlled value terms, it seemed appropriate to use SKOS in the machine-readable representations, because that seemed consistent with the design of SKOS and because SKOS provided some properties that were useful for describing and linking the controlled vocabulary terms.

However, although there are plenty of examples of how to construct SKOS concept schemes, there weren't any that we could find that were linked to properties whose values should be taken from the concept scheme. So we did not find any property such as the hasControlledVocab that you used in your example. I agree that making such a link would be valuable to those who want to use the terms in a Linked Data context, but thus far you are the first to ask for it. If we can find some examples of a well-known property that others have used to make such a link, we should add it in our property term metadata. If not, we could mint a term in the tdwgutility: namespace as we have for other terms that we needed. I believe that namespace is now managed by the Technical Architecture Group (TAG), so making a formal proposal (either to use some existing external term or to mint our own TDWG term) would be in order. I am pinging the TAG chair @ben-norton here to make him aware of this discussion.

The examples that you give from the subjectOrientation controlled vocabulary are more complex than some of the other controlled vocabularies. We've used SKOS concept schemes and collections in some of the controlled vocabularies to organize the terms in useful ways in vocabularies where we needed them. But neither concept schemes nor collections correspond directly to the vocabularies.

In most cases, there is a single concept scheme within each vocabulary. However, in one case (the format controlled vocabulary http://rs.tdwg.org/ac/doc/format/), there are two concept schemes in the vocabulary: one for organizing concepts by file extensions and one for organizing Internet media (MIME) types. The original specification for Audiovisual Core said that either could be used to describe the format, so we supported both by having two concept schemes and then declaring which concepts in the two schemes were equivalent. So one can't say in all cases that the vocabularies are equivalent to the concept schemes. That's why the vocabularies themselves have an IRI identifier that is distinct from the concept scheme identifiers. As with all TDWG terms, the controlled vocabulary terms are linked to their containing vocabularies using rdfs:isDefinedBy (see http://rs.tdwg.org/format/values/m001.ttl for an example). So if we made a link between the property term and the controlled vocabulary to be used with it, we should probably link to the vocabulary IRI and not the concept scheme. I suppose alternatively one could also make one or more links to the concept scheme(s). If there are examples outside of TDWG, those would be helpful to see.

The case of the SKOS collections is so far unique to the subjectPart and subjectOrientation controlled vocabularies. There are two reasons why the collections are described separately from the rest of the term metadata. One reason is practical. The term metadata unrelated to collections is managed in CSV files that are similar to all other controlled vocabularies, and that JSON-LD is generated by a script that is common to all controlled vocabularies. The metadata related to the SKOS collections are in CSV files that are unique to those vocabularies. There is specific software that is used to generate the SKOS collections-related metadata. The second reason is more process-related. The basic term metadata is somewhat more tightly controlled by the standards process. In particular, the normative properties (like for example the definition) can't be changed without going through the standards maintenance process described in the Vocabulary Maintenance Specification. In contrast the grouping of terms into collections is not controlled by any official standards process and really is just a suggestion, e.g. "if you are describing fish, you probably should use these terms". If the AC Maintenance Group thinks the groupings are wrong, or that some other part or orientation should be added to a collection, they can just do it without following any process. So it seemed better to maintain those SKOS collections metadata separately.

Having said all of that, the primary purpose of making the machine-readable metadata available to users is to satisfy their use cases, so if the way we have structured the JSON does not make sense or is not very usable, then we should change it to make it more usable. When we did the implementation testing for the subjectPart and subjectOrientation controlled vocabularies, it was primarily to determine whether we had included all of the appropriate values that users needed and whether users could confidently apply the correct value to a test media item. We did not have any testers who did anything with the machine-readable metadata except me, and I was the one who set it up. So that wasn't a very good test. That is why your feedback is extremely valuable. If you can help us know how you want to use the JSON, we can improve it to make it more usable for you.

I have just re-read your comment and wanted to add an additional note. You should be aware that the JSON-LD representations of term metadata when term (and term version) IRIs are dereferenced are auto generated and the script that generates them does not always create JSON-LD that is valid (although if you run it through a validator, it will pass). The problem comes when there are multiple values for a property -- they are repeated rather than given as an array. So those triples are ignored by triplestore ingestors and end up missing from the graph. This is a known problem, but thus far has not been fixed because there hasn't been clear demand for the metadata in that format. If you are planning to use JSON-LD exclusively, then we should fix the problem. The RDF/XML and Turtle should be good if your use case is loading a triplestore, and its availability is why we haven't really fixed the JSON-LD problem.

@sharifX
Copy link
Author

sharifX commented Jan 3, 2025

Hi @baskaufs,

Thank you so much for the detailed reply and for providing the historical context. This is really helpful. Yes, I am aware that the challenges are not just AC related. I picked a more recent repository to start the discussion and frame the context.

Based on your suggestions and our FAIR Digital Object implementation of DiSSCo, the following could be one approach for us to consider. Also perhaps we could leverage the community feedback process via TAG. It is probably ok for GBIF, COL and DiSSCo to have separate vocabulary servers and governance process but aligning on some of the shared SKOS and other properties could be beneficial. For instance, agreeing on implementation of ConceptScheme vs Collections.

For DiSSCo, a critical use case in JSON and JSON-LD serialisation is ensuring we can identify the specific Digital Object and its relationships. We aim to apply the same approach to terms and vocabularies

Based on your feedback, for DiSSCo we might need to introduce two new aspects to enhance clarity and structure:

Concept Scheme:

  • skos:ConceptScheme Contains multiple terms (skos:Concepts).

Controlled Vocabulary Implementation:

  • Contains one or more ConceptSchemes.
  • Directly linked to terms using rdfs:isDefinedBy for semantic clarity.

Additionally, each Term should:

  • be linked to its associated ConceptScheme(s) using skos:inScheme.
  • be linked to the ControlledVocabulary using rdfs:isDefinedBy.

I worked out a simple example for ods:topicDiscipline (this requires refinement and further discussion). We also plan to assign a PID to each Term.

{
  "@id": "https://hdl.handle.net/20.500.1025/ods_topicDiscipline_vocab",
  "@type": "skos:ConceptScheme",
  "rdfs:label": "Topic Discipline Vocabulary",
  "skos:definition": "A controlled vocabulary containing disciplines associated with topics in the DiSSCo schema.",
  "dcterms:created": "2025-01-02T10:00:00.000Z",
  "dcterms:modified": "2025-01-02T10:00:00.000Z",
  "skos:member": [
    "https://hdl.handle.net/20.500.1025/ods_topicDiscipline/Anthropology",
    "https://hdl.handle.net/20.500.1025/ods_topicDiscipline/Botany",
    "https://hdl.handle.net/20.500.1025/ods_topicDiscipline/Astrogeology",
    "https://hdl.handle.net/20.500.1025/ods_topicDiscipline/Geology",
    "https://hdl.handle.net/20.500.1025/ods_topicDiscipline/Microbiology",
    "https://hdl.handle.net/20.500.1025/ods_topicDiscipline/Palaeontology",
    "https://hdl.handle.net/20.500.1025/ods_topicDiscipline/Zoology",
    "https://hdl.handle.net/20.500.1025/ods_topicDiscipline/Ecology",
    "https://hdl.handle.net/20.500.1025/ods_topicDiscipline/OtherBiodiversity",
    "https://hdl.handle.net/20.500.1025/ods_topicDiscipline/OtherGeodiversity",
    "https://hdl.handle.net/20.500.1025/ods_topicDiscipline/Unclassified"
  ]
}

And then Botany can described like this

{
  "@id": "https://hdl.handle.net/20.500.1025/ods_topicDiscipline/Botany",
  "@type": "skos:Concept",
  "rdfs:label": "Botany",
  "skos:prefLabel": "Botany",
  "skos:definition": "The scientific study of plants.",
  "rdfs:isDefinedBy": "https://hdl.handle.net/20.500.1025/ods_topicDiscipline_vocab",
  "skos:inScheme": "https://hdl.handle.net/20.500.1025/ods_topicDiscipline_vocab"
}

Here we need to check if we need both -- skos:inScheme is specific to SKOS, and rdfs:isDefinedBy is a general RDF property.

By structuring our terms in this way, we avoid the need to create a new property like hasControlledVocab

@ben-norton
Copy link
Member

A couple notes:
Here, the phrase "controlled vocabulary" refers to the values assigned to a term, not the terms. I prefer the phrase 'value domain' to avoid any confusion, but that's a separate discussion.

  1. @sharifX I think your JSON-LD code snippet is very good.
  2. I recently finished a draft version of a metadata schema for Controlled Vocabularies.
    https://docs.google.com/spreadsheets/d/1Ei9jpg1y4Prf4jOJEufzA3wbRzR49jKDsXVIiJhZOL0/edit?usp=sharing
    It has a rather expanded set of properties than the ones in the JSON-LD above, but there's a good reason for this. I created the draft scheme using published vocabularies in the geosciences domain and the best practices guide listed below. It needs to be simplified, but all of the pieces are there.
  3. In terms of best practices for creating controlled vocabularies, please see https://www.niso.org/publications/ansiniso-z3919-2005-r2010
  4. I'm more inclined to use Concept Collections or Ordered Collections for controlled vocabularies.
  5. In my opinion, there should be a separate process for ratifying controlled vocabularies.
  6. This topic is at the top of the agenda for the TAG in 2025. We plan to include several controlled vocabularies with the Mineralogy Extension in hopes of initiating this process.
  7. As a side note, hierarchical classification systems are a separate type of KOS resource and require a separate metadata scheme. Ideally, the controlled vocabulary scheme would be extended to cover classification systems, but whether or not that's possible remains a work in progress.

@CecSve
Copy link

CecSve commented Jan 10, 2025

4. I'm more inclined to use Concept Collections or Ordered Collections for controlled vocabularies.

Can I ask why? I am only asking because I do not know what the argument would be for either, but I am currently in favour of @sharifX 's example.

@ben-norton
Copy link
Member

@CecSve I did some more research to make a solid case for collections. I was unable to put one together. There are situations where an ordered collection may be preferred because it allows concepts to be sorted in a specific manner, but I don't think that's sufficient grounds to choose ordered collections over concept schemes. Also, concept schemes are a much more common mechanism for grouping concepts.
One issue regarding @sharifX example is the use of skos:member with concept scheme. SKOS Collections have members. Concept Schemes have top concepts. The relation between concept and concept scheme is skos:inScheme.
Concept inScheme skos:ConceptScheme
Concept isMember skos:Collection
See https://www.sciencedirect.com/science/article/pii/S1570826813000176 and https://en.wikipedia.org/wiki/Simple_Knowledge_Organization_System
@sharifX How would your scheme handle hierarchies? Have you considered the use of schema:DefinedTermSet and schema:DefinedTerm?
https://schema.org/DefinedTerm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants