Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ro-crate #2

Open
kmexter opened this issue Oct 1, 2024 · 12 comments
Open

ro-crate #2

kmexter opened this issue Oct 1, 2024 · 12 comments
Assignees

Comments

@kmexter
Copy link
Contributor

kmexter commented Oct 1, 2024

I went over this file: https://github.com/emo-bon/metaGOflow-data-products-RO-crate-example/blob/main/ro-crate-metadata.json and have the following comments based on a comparison to our ARMS ro-crates
https://github.com/arms-mbon/data_release_001/blob/main/ro-crate-metadata.json and https://github.com/arms-mbon/analysis_release_001/blob/main/ro-crate-metadata.json

  • add a Creator -> that could be embrc or could be metagoflow team - done
  • add a publisher -> vliz, as we did with arms data_release_001 - done
  • add wasAssociatedWith -> people involved in creating the dataset (e.g. the bioinformatian) - done
  • add a specific contactPoint -> is the [email protected] really the best email address? - done
  • the licence should be given as "license" - done
  • the associated sequence (that will be eventually the ENA run accession number) should get a predicate that say that it is an associatedSequence. We will need to find such a predicate: wasDerivedFrom or similar combined with a property saying that it is a sequence? @laurianvm ? - under discussion issue 7
  • the other files listed in the ro-create: where they were produced by metagotflow then need "wasInfluencedBy" and then refer to the GH repo of that file (or that instance of the file - need the version also). Also to discuss wtih @laurianvm but use as input the links given above - under discussion issue 7
  • need a downloadURL for each file - done
  • should have "keywords" for the repo - done
  • my arms rocrate has a "label" fo reach one, but here that is "name" - which is better? @cedricdcc ? - ignore
  • at repo level add some dct:relation to link to other repos that should be considered with this one ...would need to think what they would be
  • need to add material sample id (or uuid) I think as being a "sample" and will need laurian's input here as what the predicate and object would be

If you dont understand all of this @cymon (I have written in a bit of a hurry) then perhaps the best is that @laurianvm and I made a template and you fill it in?

@laurianvm
Copy link

laurianvm commented Oct 16, 2024

(@laurianvm to go through and give examples on how to
(including how to describe the institutes that have run the code, relating to codemeta template example))

@cymon
Copy link
Contributor

cymon commented Oct 16, 2024

I went over this file: https://github.com/emo-bon/metaGOflow-data-products-RO-crate-example/blob/main/ro-crate-metadata.json

OK, that file is an example of the of the metadata.json file generated by the create-ro-crate.py script, the template that script uses is here

and have the following comments based on a comparison to our ARMS ro-crates https://github.com/arms-mbon/data_release_001/blob/main/ro-crate-metadata.json and https://github.com/arms-mbon/analysis_release_001/blob/main/ro-crate-metadata.json

* add a Creator -> that could be embrc or could be metagoflow team

"creator": {"@id": "https://ror.org/0038zss60"}

I thought that using the ROR for the ID was correct (or at least acceptable); if not it can be changed.

* add a publisher -> vliz, as we did with arms data_release_001

This was already included:
"publisher": {"@id": "https://ror.org/0038zss60"}

Surely the publisher is EMBRC or EMO-BON rather than VLIZ?

* add wasAssociatedWith -> people involved in creating the dataset (e.g. the bioinformatian)

Do individuals have to be identified by name? I think it would be preferable to point to EMO BON and have a web-page where people involved are included. Else we need to keep track of who did what for every ro-crate, or we could just include the same named individuals in all manifests to simplify things.

* add a specific contactPoint -> is the [[email protected]](mailto:[email protected]) really the best email address?

It's "a" contact point - if there are alternatives we could replace it.

* the licence should be given as "license"

Ah, Americans...

* the associated sequence (that will be eventually the ENA run accession number) should get a predicate that say that it is an associatedSequence. We will need to find such a predicate: wasDerivedFrom or similar combined with a property saying that it is a sequence? @laurianvm ?

* the other files listed in the ro-create: where they were produced by metagotflow then need "wasInfluencedBy" and then refer to the GH repo of that file (or that instance of the file - need the version also). Also to discuss wtih @laurianvm but use as input the links given above

* need a downloadURL for each file

TODO.

* should have "keywords" for the repo

TODO.

* my arms rocrate has a "label" fo reach one, but here that is "name" - which is better? @cedricdcc ?

* at repo level add some dct:relation to link to other repos that should be considered with this one ...would need to think what they would be

* need to add material sample id (or uuid) I think as being a "sample" and will need laurian's input here as what the predicate and object would be

If you dont understand all of this @cymon (I have written in a bit of a hurry) then perhaps the best is that @laurianvm and I made a template and you fill it in?

@cymon
Copy link
Contributor

cymon commented Oct 17, 2024

@kmexter @laurianvm @marc-portier

I have a question regarding ro-crate stanza formatting: so this is a stanza from ARMS data_release_001

    {
        "@id": "./ARMS_ITS_Occurrence.csv",
        "@type": "File",
        "label": "./ARMS_ITS_Occurrence.csv",
        "fileFormat": "csv",
        "wasDerivedFrom": [
            "https://github.com/arms-mbon/data_workspace/tree/main/qualitycontrolled_data/combined",
            "https://github.com/arms-mbon/data_workspace/tree/main/analysis_data/from_pema/processing_batch1"
        ],
        "description": "The Occurrence extension for the ITS data",
        "downloadURL": "https://data.arms-mbon.org/data_release_001/latest/#./ARMS_ITS_Occurrence.csv"
    },

Here the "@id" points to a file at the location "./ARMS_ITS_Occurrence.csv" so I assume that the actual data file is included in the payload of the ro-crate. Yet, it also has a downloadURL that points to another copy of the same file that is in the payload. This is all good.

In the metaGOflow data products ro-crate (where this issue is attached) the entire ro-crate payload will consist of only the ro-crate.metadata.json manifest - no data files will be included. I therefore assumed that the "@id" field would be the URL to the data file and as such there would be no need for a "downloadURL" field. For example:

{
    "@id": "<the URL to the datafile in github/S3 or where ever>"
    "name": "ENA accession for run raw sequence data",
    "description": "FAKE: Raw sequence data and laboratory sequence generation metadata",
    "encodingFormat": "text/xml"
},

Is the "downloadURL" field redundant in the case, and can be left out, or should it be included even when the URL is identical to the "@id"?

@kmexter
Copy link
Contributor Author

kmexter commented Oct 17, 2024

  • creator: I would add the full institute name, not only the id
  • publisher: ditto and no, the publisher is the one responsible for getting the data to be shared, so if people have question about that (e.g. "why is this file not accessible?") they know who to contact. But in fact, the email address for that should be that of the open science team, so the text to add is

"publisher": {
"@id": ":VLIZ"
},
where later
{
"@id": "
:VLIZ",
"@type": "Organization",
"name": "Flanders Marine Institute",
"url": "https://www.vliz.be/en",
"label": "_:VLIZ",
"email":"[email protected]"
},

@kmexter
Copy link
Contributor Author

kmexter commented Oct 17, 2024

  • wasAssociatedWith is to acknowledge people for their efforts (this ro-crate is in lieu of a metadata record) so if you want to acknowledge them, add them here by name; otherwise you don't have to bother (is up to you).
  • contactPoint: we should discuss this in the next opco. If someone emails help@embrc about mgf, I can imagine it will take ages for that request to eventually get to you. at least it should be the emobon email address, not the embrc one

@kmexter
Copy link
Contributor Author

kmexter commented Oct 17, 2024

having said that about publisher above, I now change my mind
We agreed via Tosca that it would be https://www.embrc.eu/emo-bon
@laurianvm can you advise on how this should be written - I mean, this is a project (not an organisation) and it has a parent (EMBRC, being an organisation) and it has an email address ([email protected])

@cymon
Copy link
Contributor

cymon commented Oct 17, 2024

having said that about publisher above, I now change my mind We agreed via Tosca that it would be https://www.embrc.eu/emo-bon @laurianvm can you advise on how this should be written - I mean, this is a project (not an organisation) and it has a parent (EMBRC, being an organisation) and it has an email address ([email protected])

Can we have "publisher": ":EMBRC" and "creator": ":EMO BON" ?

We already have:
{
"@id": ":EMBRC",
"@type": "Organization",
"name": "European Marine Biological Resource Centre",
"url": "https://ror.org/0038zss60",
"contactPoint": {"@id": "mailto:[email protected]"}
},

We'd need a new "@type": for EMO BON.

@kmexter
Copy link
Contributor Author

kmexter commented Oct 17, 2024

argh, so the definition of publisher is not EMBRC as they are not publishing they data, but one can say that the EMO BON project is publishing the data via its data managers, so publisher is emo bon, creator is emo bon, but @laurianvm can we have an owner and funder that is embrc?

@cymon
Copy link
Contributor

cymon commented Oct 17, 2024

  • wasAssociatedWith is to acknowledge people for their efforts (this ro-crate is in lieu of a metadata record) so if you want to acknowledge them, add them here by name; otherwise you don't have to bother (is up to you).

I think it would be simpler just acknowledge the EMO BON project where the persons involved should be detailed. If people feel strongly that each ro-crate should acknowledge a set of individuals involved the creation of the data, then the various roles that need to be acknowledged would need to be defined, and who those person were responsible for those roles in each each ro-crate would need to be recorded. Doable, but a big of a flaff.

Edit: I'm just going to assume no one feels strongly enough about this unless told otherwise.

* contactPoint: we should discuss this in the next opco. If someone emails help@embrc about mgf, I can imagine it will take ages for that request to eventually get to you. at least it should be the emobon email address, not the embrc one

An EMO BON email address would be better.

@kmexter
Copy link
Contributor Author

kmexter commented Oct 18, 2024

use that then - [email protected]

@kmexter
Copy link
Contributor Author

kmexter commented Oct 18, 2024

@kmexter @laurianvm @marc-portier

I have a question regarding ro-crate stanza formatting: so this is a stanza from ARMS data_release_001

    {
        "@id": "./ARMS_ITS_Occurrence.csv",
        "@type": "File",
        "label": "./ARMS_ITS_Occurrence.csv",
        "fileFormat": "csv",
        "wasDerivedFrom": [
            "https://github.com/arms-mbon/data_workspace/tree/main/qualitycontrolled_data/combined",
            "https://github.com/arms-mbon/data_workspace/tree/main/analysis_data/from_pema/processing_batch1"
        ],
        "description": "The Occurrence extension for the ITS data",
        "downloadURL": "https://data.arms-mbon.org/data_release_001/latest/#./ARMS_ITS_Occurrence.csv"
    },

Here the "@id" points to a file at the location "./ARMS_ITS_Occurrence.csv" so I assume that the actual data file is included in the payload of the ro-crate. Yet, it also has a downloadURL that points to another copy of the same file that is in the payload. This is all good.

In the metaGOflow data products ro-crate (where this issue is attached) the entire ro-crate payload will consist of only the ro-crate.metadata.json manifest - no data files will be included. I therefore assumed that the "@id" field would be the URL to the data file and as such there would be no need for a "downloadURL" field. For example:

{
    "@id": "<the URL to the datafile in github/S3 or where ever>"
    "name": "ENA accession for run raw sequence data",
    "description": "FAKE: Raw sequence data and laboratory sequence generation metadata",
    "encodingFormat": "text/xml"
},

Is the "downloadURL" field redundant in the case, and can be left out, or should it be included even when the URL is identical to the "@id"?

This is indeed a question for @laurianvm and @marc-portier

@laurianvm
Copy link

laurianvm commented Dec 4, 2024

These are notes Laurian took down as we went over https://github.com/emo-bon/metaGOflow-data-products-RO-crate-example/blob/main/emo-bon-ro-crate-repository/EMOBON_BPNS_So_17-ro-crate/ro-crate-metadata.json

When something specific is required of Cymon, we will tag him

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants