Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update json-ld processing to adhere to SoSo 1.3 #1

Open
nein09 opened this issue Jun 15, 2022 · 4 comments
Open

Update json-ld processing to adhere to SoSo 1.3 #1

nein09 opened this issue Jun 15, 2022 · 4 comments
Assignees
Labels
enhancement New feature or request

Comments

@nein09
Copy link
Contributor

nein09 commented Jun 15, 2022

I took a look at the json-ld processor and the queries that it uses, and compared them with the SoSo 1.3 guidelines.

Things that are in SoSo 1.3 that don't seem to be here yet:

  • license
  • sameAs
  • isAccessibleForFree
  • identifier (outside the context of funding / awards)
  • citation
  • publisher / provider
  • prov:wasRevisionOf
  • prov:used
  • dateCreated
  • dateModified
  • expires

Things that are present that aren't in the SoSo 1.3 spec (were they once and they got dropped?)

  • had_derivation
  • has_derivation
  • hasPart

Things that SoSo allows to have more information than the current SPARQL queries are pulling out:

  • Author / creator / contributor supports specifying multiple people playing multiple roles; is there room in the Solr index to include them somehow? For example, should the Solr 'investigator' field map to creator: { @type: Role, "roleName": "Principal Investigator" } or is that too specific?
  • Spatial coverage: Currently supports one bounding box or place name; however, other kinds of GeoShapes and GeoCoordinates (e.g.) points can be specified under SoSo, and lists of each of these are allowed.
  • keywords: Currently only supports text, not defined terms / controlled vocabularies. Also, I have seen keywords specified at the DataCatalog level instead of on the Dataset; I don't know if any of the repositories you consume are doing that but you might be missing out on them if they do
  • no geologic time support yet for temporal coverage
@mbjones mbjones added the enhancement New feature or request label Jun 15, 2022
@mbjones
Copy link
Member

mbjones commented Jun 15, 2022

Thanks @nein09 lots of improvements to make a a few things to discuss here. A few things may be there but may not be obvious (e.g., like prov:wasRevisionOf, prov:used, prov:wasDerivedFrom and their relationship to the provenance section of SoSo). Handling people and their roles has been a big issue, one which we've started addressing in our work on the slinky graph. Let's discuss how to handle that with @amoeba too.

@nein09
Copy link
Contributor Author

nein09 commented Jun 21, 2022

Per the discussion on 21 June: The easiest place to start is with things that are easily supported by the Solr index.

Priorities

  • license does not yet exist in the Solr index (adding it isn't super hard but then they'd have to reindex the corpus which would take on the order of weeks) Add 'license' field to the index #2
  • identifier: the dataone harvester probably pulled this off of the dataset (especially if it's a DOI or the @id field - this gets turned into the SID (a series identifier) and the PID is a checksum of the canonicalized document), but if there is more than one identifier other than the PID or SID listed, it could go under 'alternativeIdentifier'.
  • publisher / provider: these should be two separate fields in DataONE. The publisher is the organization that released and distributed the dataset; the provider has a copy of it that is available for access. (Google Dataset Search is the provider, but not the publisher, for many datasets). This should be a name and a ROR / other identifier.

Provenance problems

  • prov:wasRevisionOf - is equivalent to isBasedOn and obsoletes, but both entities have to be in the DataONE index and you have to be the owner of the dataset that you are revising. This is part of the DataONE system metadata.
  • prov:used isn't in SPARQL, but it does get into the index. prov_used is a field in the Solr index, along with prov_*.
  • Derivation relationships are very important to capture.
  • Ideally, prov:hadDerivation and prov:hasDerivation would populate prov:wasDerivedFrom, as an inverse relationship, but not overwriting things in Solr will be tricky.

Not priorities:

  • isAccessibleForFree (everyone puts 'true' even if it isn't true)
  • sameAs: a little like 'alternativeIdentifier' but is a potential footgun because the idea of 'same' is not 'identical'. But 'alternativeIdentifier' doesn't make claims about how identical the two datasets are.
  • 'citation' is still being worked on by SoSo, so best not to include it now
  • dateCreated and dateModified: this is not the same construct in DataONE because datasets are considered immutable; in DataONE, it tracks when the system metadata was last modified. dateModified would become dateCreated for a new version of the document.
  • expires (has anyone used this? If anyone feels like answering this question, we'd love to know)
  • prov:hadDerivation and prov:hasDerivation are recommended against by the w3c but DataONE indexes both. I am willing to bet that those are there for a reason.

@mbjones mbjones added this to the 2.4.0 milestone Aug 5, 2022
@nein09
Copy link
Contributor Author

nein09 commented Sep 6, 2022

A discussion about identifiers, which may or may not narrow things down: ESIPFed/science-on-schema.org#128

@nein09
Copy link
Contributor Author

nein09 commented Sep 6, 2022

It doesn't seem that alternativeIdentifier is a field in Solr, so I'd need to add one.

@mbjones mbjones modified the milestones: 2.4.0, 3.0.0 Sep 6, 2022
@mbjones mbjones removed this from the 3.0.0 milestone Dec 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants