-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add BCO format #3
Conversation
Signed-off-by: Ben Sherman <[email protected]>
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
I found running validation against the JSON Schema for BioCompute Objects complains about some missing sections and fields. I guess some of these fields and sections might not be so important, but I guess adding stubs so that the validation runs through can help to see there aren't any more subtle differences in any of the actual data. What I tried:
git clone http://opensource.ieee.org/2791-object/ieee-2791-schema.git 2a. Run validation with kwalify:
2b. Run validation with conda install jsonschema Create a file import jsonschema
import json
from jsonschema import validate
# Load Biocompute Object JSON and schema
with open('ieee-2791-schema/2791object.json') as schema_file:
schema = json.load(schema_file)
with open('bco.json') as bco_file:
bco = json.load(bco_file)
# Validate
try:
validate(instance=bco, schema=schema)
print("Validation successful.")
except Exception as e:
print("Validation failed:", e) Run it: python validate.py |& tee validation.out | head |
Validation for RO-Crate seems a bit thinner on the tooling side. But I tried this:
pip install requests pytest rocrate
pip install rocrateValidator
from rocrateValidator import validate as validate
v = validate.validate("ro-crate-metadata.json")
v.validator()
python validate-rocrate.py |& tee validate-rocrate.out | head
|
Otherwise, slightly off-topic here, but as a way to verify the filepaths, I managed to parse the filepaths and steps of The thing I noticed, in relation to is that not much info about the steps themselves are included, such as the commands executed. I see the I see in the BCO docs though that there's isn't perhaps a great way to include that per step, so I gather it the schema that is at fault here. Of course possible to export this info in a separate custom format based on the So I guess it is outside the scope of the BioCompute standard. Still feels a bit weird not to include such crucial info in a declarative provenance report, so I guess one would have to package in some other artifacts too, to have a fully reproducible research object. |
Sorry about those initial errors, think I made some last minute changes without testing. The BCO format produces two JSON files, one for the BCO and one for the RO-Crate which references the BCO. So the output path needs to be a directory that holds both files. I guess the overwrite check should just be more lenient. I based the PR on this tutorial which manually creates a BCO for a nf-core/chipseq run. So there might be some things missing against the current standard. But, the standard itself is fuzzy in some aspects, so we should figure out how far we want to go to meet the standard. |
Yea, I also felt it is not exactly clear where to draw the line about meeting the standard. It's a bummer that the validation tools do a full stop on the first "error". Had been useful to use them to spot any divergences in the actual output... |
Which tools are they? We could probably ask for help / contribute upstream to change things like that if it'd be helpful for us.. (especially the RO Crate ones). |
Very good point! The ones I have used were:
|
Signed-off-by: Ben Sherman <[email protected]>
I made a few fixes, so now the config file should work as is. For now I have a catch-all config setting |
Related: nextflow-io/nextflow#4034 |
Glad to see this both for BCO and RO-Crate side! The ro-crate-validator may be a bit too opinionated for particular use cases, so don't assume the RO-Crate is invalid even if it says so. Work on a more modular validator is been planned but only tested for workflow run crate profile. For reference:
|
Hi @stain , thanks for the feedback. I think Phil is going to meet with you and some others tomorrow. We would love to get more feedback from people who are more familiar with these standards, see if there is anything we can improve. My main concerns are:
|
Copying in some notes from the recent WorkflowHub meeting. Full notes are here.
So, I think I read the answer to this:
As "no, report on workdir paths instead, even if they're temporary. Published paths would be a bonus." Maybe @stain can correct me on this if I misinterpreted. I didn't find it super clear. |
It would be great if the RO-Crate generated by the plugin conformed to a Workflow Run RO-Crate profile (https://www.researchobject.org/workflow-run-crate/profiles/, as linked by @stain above). My understanding is that the plugin has access to individual step executions, so the crate could be made to conform to the Provenance Run Crate profile, which is the most detailed. A while ago I manually generated a Provenance Run Crate for an execution of the It was based on the Note that RO-Crate stores the data together with the metadata and thus uses paths relative to the crate root directory, i.e., the directory that hosts |
We have a working group for Workflow Run RO-Crate that meets every two weeks. I think it would be great if you guys joined the group, instructions are here: |
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
Thanks everyone for the feedback. I see that the tutorial I originally used as a reference was creating some generic RO crate, but now there is the Workflow Run Crate standard, which looks similar to BCO in its substance, but perhaps more extensible because it is an RO crate. I decided to remove the minimal RO crate from this PR and just render the BCO manifest. We can add the WRROC as a separate format in a separate PR, and also add the ability to specify render multiple formats for a single run so that they can be composed as needed. As for the BCO format, the main thing left to do for this PR is to make the workflow inputs point to a URL instead of a local path (e.g. Our primary interest is in the BCO format because (as I understand it) the FDA recently adopted it as the standard for research artifacts. Does anyone know if the RO crate standard is a part of this in any way? That will help us prioritize our efforts. @samuell Thanks for your suggestions and all the testing you did! I will try to incorporate your scripts into this project later on. |
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
Updated BCO example with validation errors fixed: {
"object_id": "urn:uuid:196ffc0a-7fb2-4c5c-8752-d9081ac40858",
"spec_version": "https://w3id.org/ieee/ieee-2791-schema/2791object.json",
"etag": "364e510a9602ae31fc0ed6feba5ddd01",
"provenance_domain": {
"name": "",
"version": "",
"created": "2023-09-27T21:28:13.821355019-05:00",
"modified": "2023-09-27T21:28:13.821355019-05:00",
"contributors": [
{
"contribution": [
"authoredBy"
],
"name": "Paolo Di Tommaso"
}
],
"license": ""
},
"usability_domain": [
],
"extension_domain": [
{
"extension_schema": "https://w3id.org/biocompute/extension_domain/1.1.0/scm/scm_extension.json",
"scm_extension": {
"scm_repository": "https://github.com/nextflow-io/rnaseq-nf",
"scm_type": "git",
"scm_commit": "d910312506c6539365ed70aacda5068dea9152dd",
"scm_path": "main.nf",
"scm_preview": "https://github.com/nextflow-io/rnaseq-nf/tree/d910312506c6539365ed70aacda5068dea9152dd/main.nf"
}
}
],
"description_domain": {
"keywords": [
],
"platform": [
"Nextflow"
],
"pipeline_steps": [
{
"step_number": 1,
"name": "641b807d0f3fdb87ca247e807f6e013e",
"description": "RNASEQ:INDEX (ggal_1_48850000_49020000)",
"input_list": [
{
"uri": "https://github.com/nextflow-io/rnaseq-nf/tree/d910312506c6539365ed70aacda5068dea9152dd/data/ggal/ggal_1_48850000_49020000.Ggal71.500bpflank.fa"
}
],
"output_list": [
{
"uri": "work/64/1b807d0f3fdb87ca247e807f6e013e/index"
}
]
},
{
"step_number": 2,
"name": "b0fde0a381b3abf254cba203158d78a5",
"description": "RNASEQ:FASTQC (FASTQC on ggal_gut)",
"input_list": [
{
"uri": "https://github.com/nextflow-io/rnaseq-nf/tree/d910312506c6539365ed70aacda5068dea9152dd/data/ggal/ggal_gut_1.fq"
},
{
"uri": "https://github.com/nextflow-io/rnaseq-nf/tree/d910312506c6539365ed70aacda5068dea9152dd/data/ggal/ggal_gut_2.fq"
}
],
"output_list": [
{
"uri": "work/b0/fde0a381b3abf254cba203158d78a5/fastqc_ggal_gut_logs"
}
]
},
{
"step_number": 3,
"name": "7a7e087d9ec32fc6b104c072ef42ee14",
"description": "RNASEQ:QUANT (ggal_gut)",
"input_list": [
{
"uri": "work/64/1b807d0f3fdb87ca247e807f6e013e/index"
},
{
"uri": "https://github.com/nextflow-io/rnaseq-nf/tree/d910312506c6539365ed70aacda5068dea9152dd/data/ggal/ggal_gut_1.fq"
},
{
"uri": "https://github.com/nextflow-io/rnaseq-nf/tree/d910312506c6539365ed70aacda5068dea9152dd/data/ggal/ggal_gut_2.fq"
}
],
"output_list": [
{
"uri": "work/7a/7e087d9ec32fc6b104c072ef42ee14/ggal_gut"
}
]
},
{
"step_number": 4,
"name": "8ec9b607fc6e5620c5437845fcf92fe2",
"description": "MULTIQC",
"input_list": [
{
"uri": "work/7a/7e087d9ec32fc6b104c072ef42ee14/ggal_gut"
},
{
"uri": "work/b0/fde0a381b3abf254cba203158d78a5/fastqc_ggal_gut_logs"
},
{
"uri": "https://github.com/nextflow-io/rnaseq-nf/tree/d910312506c6539365ed70aacda5068dea9152dd/multiqc"
}
],
"output_list": [
{
"uri": "work/8e/c9b607fc6e5620c5437845fcf92fe2/multiqc_report.html"
}
]
}
]
},
"execution_domain": {
"script": [
"https://github.com/nextflow-io/rnaseq-nf/tree/d910312506c6539365ed70aacda5068dea9152dd/main.nf"
],
"script_driver": "nextflow",
"software_prerequisites": [
{
"name": "Nextflow",
"version": "23.09.2-edge",
"uri": {
"uri": "https://github.com/nextflow-io/nextflow/releases/tag/v23.09.2-edge"
}
}
],
"external_data_endpoints": [
],
"environment_variables": {
}
},
"parametric_domain": [
{
"param": "outdir",
"value": "results",
"step": "0"
},
{
"param": "reads",
"value": "https://github.com/nextflow-io/rnaseq-nf/tree/d910312506c6539365ed70aacda5068dea9152dd/data/ggal/ggal_gut_{1,2}.fq",
"step": "0"
},
{
"param": "transcriptome",
"value": "https://github.com/nextflow-io/rnaseq-nf/tree/d910312506c6539365ed70aacda5068dea9152dd/data/ggal/ggal_1_48850000_49020000.Ggal71.500bpflank.fa",
"step": "0"
},
{
"param": "multiqc",
"value": "https://github.com/nextflow-io/rnaseq-nf/tree/d910312506c6539365ed70aacda5068dea9152dd/multiqc",
"step": "0"
}
],
"io_domain": {
"input_subdomain": [
{
"uri": {
"uri": "https://github.com/nextflow-io/rnaseq-nf/tree/d910312506c6539365ed70aacda5068dea9152dd/data/ggal/ggal_gut_1.fq"
}
},
{
"uri": {
"uri": "https://github.com/nextflow-io/rnaseq-nf/tree/d910312506c6539365ed70aacda5068dea9152dd/data/ggal/ggal_gut_2.fq"
}
},
{
"uri": {
"uri": "https://github.com/nextflow-io/rnaseq-nf/tree/d910312506c6539365ed70aacda5068dea9152dd/data/ggal/ggal_1_48850000_49020000.Ggal71.500bpflank.fa"
}
},
{
"uri": {
"uri": "https://github.com/nextflow-io/rnaseq-nf/tree/d910312506c6539365ed70aacda5068dea9152dd/multiqc"
}
}
],
"output_subdomain": [
{
"mediatype": "",
"uri": {
"filename": "work/b0/fde0a381b3abf254cba203158d78a5/fastqc_ggal_gut_logs",
"uri": "results/fastqc_ggal_gut_logs"
}
},
{
"mediatype": "text/html",
"uri": {
"filename": "work/8e/c9b607fc6e5620c5437845fcf92fe2/multiqc_report.html",
"uri": "results/multiqc_report.html"
}
}
]
},
"error_domain": {
"empirical_error": {
},
"algorithmic_error": {
}
}
} |
We also have this API endpoint for BCO validation: https://biocomputeobject.org/api/docs/#/BCO%20Management/api_objects_validate_create |
ok nice, thanks @HadleyKing! Doing a quick and dirty test by copying the above example into the Swagger interface I get the following response: {
"urn:uuid:196ffc0a-7fb2-4c5c-8752-d9081ac40858": {
"number_of_errors": 0,
"error_detail": [
"BCO Valid"
],
"https://w3id.org/biocompute/extension_domain/1.1.0/scm/scm_extension.json": {
"number_of_errors": 0,
"error_detail": [
"Extension Valid"
]
}
}
} So - I think that means that we're looking good..! |
Close #2
To test the plugin:
The plugin will generate a
bco.json
andro-crate-metadata.json
in theresults
directory. Check outBcoRenderer.groovy
to see how these files are generated.