Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BCO format #3

Merged
merged 6 commits into from
Sep 28, 2023
Merged

Add BCO format #3

merged 6 commits into from
Sep 28, 2023

Conversation

bentsherman
Copy link
Member

Close #2

To test the plugin:

git clone [email protected]:nextflow-io/nf-prov.git -b bco
cd nf-prov
make install

# use nextflow.config in the nf-prov directory
nextflow run [...]

The plugin will generate a bco.json and ro-crate-metadata.json in the results directory. Check out BcoRenderer.groovy to see how these files are generated.

Signed-off-by: Ben Sherman <[email protected]>
@bentsherman bentsherman requested a review from ewels August 25, 2023 17:40
@ewels

This comment was marked as resolved.

@ewels

This comment was marked as resolved.

nextflow.config Show resolved Hide resolved
@samuell

This comment was marked as resolved.

@samuell
Copy link

samuell commented Aug 26, 2023

I found running validation against the JSON Schema for BioCompute Objects complains about some missing sections and fields.

I guess some of these fields and sections might not be so important, but I guess adding stubs so that the validation runs through can help to see there aren't any more subtle differences in any of the actual data.

What I tried:

  1. Get the schema definition:
git clone http://opensource.ieee.org/2791-object/ieee-2791-schema.git

2a. Run validation with kwalify:

sudo apt install kwalify
kwalify -f ieee-2791-schema/2791object.json bco.json

2b. Run validation with jsonschema from PyPI/conda:

conda install jsonschema

Create a file validate.py:

import jsonschema

import json
from jsonschema import validate

# Load Biocompute Object JSON and schema
with open('ieee-2791-schema/2791object.json') as schema_file:
    schema = json.load(schema_file)

with open('bco.json') as bco_file:
    bco = json.load(bco_file)

# Validate
try:
    validate(instance=bco, schema=schema)
    print("Validation successful.")
except Exception as e:
    print("Validation failed:", e)

Run it:

python validate.py |& tee validation.out | head

@samuell
Copy link

samuell commented Aug 26, 2023

Validation for RO-Crate seems a bit thinner on the tooling side. But I tried this:

  1. Install some prerequisites not automatically installed with rocrateValidator:
pip install requests pytest rocrate
  1. Install the validator
pip install rocrateValidator
  1. Create a python file validate-rocrate.py:
from rocrateValidator import validate as validate

v = validate.validate("ro-crate-metadata.json")
v.validator()
  1. Run it:
python validate-rocrate.py |& tee validate-rocrate.out | head
  1. Get some output:
$ python validate-rocrate.py |& tee validate-rocrate.out | head
This is an INVALID RO-Crate
{
...snip...

@samuell
Copy link

samuell commented Aug 26, 2023

Otherwise, slightly off-topic here, but as a way to verify the filepaths, I managed to parse the filepaths and steps of bco.json into a DAG (code here), so this seems to work great!

image

The thing I noticed, in relation to is that not much info about the steps themselves are included, such as the commands executed.

I see the execution_domain lists the main Nextflow script, so indeed, all this info will be referenced from there of course, but not included in the report.

I see in the BCO docs though that there's isn't perhaps a great way to include that per step, so I gather it the schema that is at fault here.

Of course possible to export this info in a separate custom format based on the .nextflow cache, as discussed earlier, or just parse the .command.sh files in the work folders, before they are cleaned.

So I guess it is outside the scope of the BioCompute standard. Still feels a bit weird not to include such crucial info in a declarative provenance report, so I guess one would have to package in some other artifacts too, to have a fully reproducible research object.

@bentsherman
Copy link
Member Author

Sorry about those initial errors, think I made some last minute changes without testing. The BCO format produces two JSON files, one for the BCO and one for the RO-Crate which references the BCO. So the output path needs to be a directory that holds both files. I guess the overwrite check should just be more lenient.

I based the PR on this tutorial which manually creates a BCO for a nf-core/chipseq run. So there might be some things missing against the current standard. But, the standard itself is fuzzy in some aspects, so we should figure out how far we want to go to meet the standard.

@samuell
Copy link

samuell commented Aug 29, 2023

Sorry about those initial errors, think I made some last minute changes without testing. The BCO format produces two JSON files, one for the BCO and one for the RO-Crate which references the BCO. So the output path needs to be a directory that holds both files. I guess the overwrite check should just be more lenient.

I based the PR on this tutorial which manually creates a BCO for a nf-core/chipseq run. So there might be some things missing against the current standard. But, the standard itself is fuzzy in some aspects, so we should figure out how far we want to go to meet the standard.

Yea, I also felt it is not exactly clear where to draw the line about meeting the standard.

It's a bummer that the validation tools do a full stop on the first "error". Had been useful to use them to spot any divergences in the actual output...

@ewels
Copy link
Member

ewels commented Aug 29, 2023

Which tools are they? We could probably ask for help / contribute upstream to change things like that if it'd be helpful for us.. (especially the RO Crate ones).

@samuell
Copy link

samuell commented Aug 29, 2023

Which tools are they? We could probably ask for help / contribute upstream to change things like that if it'd be helpful for us.. (especially the RO Crate ones).

Very good point! The ones I have used were:

@bentsherman
Copy link
Member Author

I made a few fixes, so now the config file should work as is.

For now I have a catch-all config setting prov.metadata which we can use to insert any metadata not already covered by the manifest scope. We can build out this scope with whatever extra settings we think are important, and ideally even incorporate it into the manifest scope.

@ewels
Copy link
Member

ewels commented Aug 29, 2023

Related: nextflow-io/nextflow#4034

@stain
Copy link

stain commented Sep 5, 2023

Glad to see this both for BCO and RO-Crate side!

The ro-crate-validator may be a bit too opinionated for particular use cases, so don't assume the RO-Crate is invalid even if it says so. Work on a more modular validator is been planned but only tested for workflow run crate profile.

For reference:

@bentsherman
Copy link
Member Author

Hi @stain , thanks for the feedback. I think Phil is going to meet with you and some others tomorrow. We would love to get more feedback from people who are more familiar with these standards, see if there is anything we can improve. My main concerns are:

  1. Listing the tasks (steps) and input/output files. Nextflow tasks produce these files in a work directory during execution, then "publishes" the outputs to their final location at the end. So should the provenance report only use the publish paths? The work directory paths are temporary, but on the other hand, they also define the links between tasks.

  2. There seems to be a bunch of optional metadata fields about the pipeline, contributors, etc, much of which is not know to Nextflow. So I'm wondering how far we should go to provide this extra metadata, which parts are more important or more commonly used than others, etc

@ewels
Copy link
Member

ewels commented Sep 11, 2023

Copying in some notes from the recent WorkflowHub meeting. Full notes are here.

  • Should we put intermediate files into RO-crate which don’t exist, or if it should only be the published files
    • Could put intermediates
    • Semantic details are important: for example, imagine a workflow which consumes URLs (remote resource) - URL here is key
    • Depends on workflow system how much filename matters
  • Specific use case for nextflow?
  • There has to be evidence of an intermediate file, but not necessarily the file itself - if there is a guarantee that a file can be accessed, then a different approach is needed (e.g. health)
  • https://www.researchobject.org/workflow-run-crate/
  • https://www.researchobject.org/ro-crate/

So, I think I read the answer to this:

So should the provenance report only use the publish paths?

As "no, report on workdir paths instead, even if they're temporary. Published paths would be a bonus."

Maybe @stain can correct me on this if I misinterpreted. I didn't find it super clear.

@simleo
Copy link

simleo commented Sep 20, 2023

It would be great if the RO-Crate generated by the plugin conformed to a Workflow Run RO-Crate profile (https://www.researchobject.org/workflow-run-crate/profiles/, as linked by @stain above). My understanding is that the plugin has access to individual step executions, so the crate could be made to conform to the Provenance Run Crate profile, which is the most detailed.

A while ago I manually generated a Provenance Run Crate for an execution of the test.nf workflow:

https://github.com/ResearchObject/workflow-run-crate/tree/86e5d481a4857b997b7b019b92e354c99c957135/docs/examples/draft/nf-prov-test-run-1

It was based on the manifest.json generated by the plugin, which is also included in the crate. Some things are a bit forced, e.g. the FormalParameter @ids (I used a pattern based on CWLProv), but hopefully it can serve as an example.

Note that RO-Crate stores the data together with the metadata and thus uses paths relative to the crate root directory, i.e., the directory that hosts ro-crate-metadata.json. Intermediate files, when present, should also be included in the crate, but what matters from the RO-Crate metadata perspective is the path relative to the RO-Crate root (the plugin could copy intermediate files to the RO-Crate directory).

@simleo
Copy link

simleo commented Sep 20, 2023

We have a working group for Workflow Run RO-Crate that meets every two weeks. I think it would be great if you guys joined the group, instructions are here:

ResearchObject/workflow-run-crate#1

@bentsherman
Copy link
Member Author

Thanks everyone for the feedback. I see that the tutorial I originally used as a reference was creating some generic RO crate, but now there is the Workflow Run Crate standard, which looks similar to BCO in its substance, but perhaps more extensible because it is an RO crate.

I decided to remove the minimal RO crate from this PR and just render the BCO manifest. We can add the WRROC as a separate format in a separate PR, and also add the ability to specify render multiple formats for a single run so that they can be composed as needed.

As for the BCO format, the main thing left to do for this PR is to make the workflow inputs point to a URL instead of a local path (e.g. ${NXF_HOME}/.assets/nextflow-io/rnaseq-nf/multiqc -> https://github.com/nextflow-io/rnaseq-nf/tree/master/multiqc). There are several improvements that can still be made, but I will create separate issues for them instead.

Our primary interest is in the BCO format because (as I understand it) the FDA recently adopted it as the standard for research artifacts. Does anyone know if the RO crate standard is a part of this in any way? That will help us prioritize our efforts.

@samuell Thanks for your suggestions and all the testing you did! I will try to incorporate your scripts into this project later on.

@bentsherman
Copy link
Member Author

Updated BCO example with validation errors fixed:

{
    "object_id": "urn:uuid:196ffc0a-7fb2-4c5c-8752-d9081ac40858",
    "spec_version": "https://w3id.org/ieee/ieee-2791-schema/2791object.json",
    "etag": "364e510a9602ae31fc0ed6feba5ddd01",
    "provenance_domain": {
        "name": "",
        "version": "",
        "created": "2023-09-27T21:28:13.821355019-05:00",
        "modified": "2023-09-27T21:28:13.821355019-05:00",
        "contributors": [
            {
                "contribution": [
                    "authoredBy"
                ],
                "name": "Paolo Di Tommaso"
            }
        ],
        "license": ""
    },
    "usability_domain": [
        
    ],
    "extension_domain": [
        {
            "extension_schema": "https://w3id.org/biocompute/extension_domain/1.1.0/scm/scm_extension.json",
            "scm_extension": {
                "scm_repository": "https://github.com/nextflow-io/rnaseq-nf",
                "scm_type": "git",
                "scm_commit": "d910312506c6539365ed70aacda5068dea9152dd",
                "scm_path": "main.nf",
                "scm_preview": "https://github.com/nextflow-io/rnaseq-nf/tree/d910312506c6539365ed70aacda5068dea9152dd/main.nf"
            }
        }
    ],
    "description_domain": {
        "keywords": [
            
        ],
        "platform": [
            "Nextflow"
        ],
        "pipeline_steps": [
            {
                "step_number": 1,
                "name": "641b807d0f3fdb87ca247e807f6e013e",
                "description": "RNASEQ:INDEX (ggal_1_48850000_49020000)",
                "input_list": [
                    {
                        "uri": "https://github.com/nextflow-io/rnaseq-nf/tree/d910312506c6539365ed70aacda5068dea9152dd/data/ggal/ggal_1_48850000_49020000.Ggal71.500bpflank.fa"
                    }
                ],
                "output_list": [
                    {
                        "uri": "work/64/1b807d0f3fdb87ca247e807f6e013e/index"
                    }
                ]
            },
            {
                "step_number": 2,
                "name": "b0fde0a381b3abf254cba203158d78a5",
                "description": "RNASEQ:FASTQC (FASTQC on ggal_gut)",
                "input_list": [
                    {
                        "uri": "https://github.com/nextflow-io/rnaseq-nf/tree/d910312506c6539365ed70aacda5068dea9152dd/data/ggal/ggal_gut_1.fq"
                    },
                    {
                        "uri": "https://github.com/nextflow-io/rnaseq-nf/tree/d910312506c6539365ed70aacda5068dea9152dd/data/ggal/ggal_gut_2.fq"
                    }
                ],
                "output_list": [
                    {
                        "uri": "work/b0/fde0a381b3abf254cba203158d78a5/fastqc_ggal_gut_logs"
                    }
                ]
            },
            {
                "step_number": 3,
                "name": "7a7e087d9ec32fc6b104c072ef42ee14",
                "description": "RNASEQ:QUANT (ggal_gut)",
                "input_list": [
                    {
                        "uri": "work/64/1b807d0f3fdb87ca247e807f6e013e/index"
                    },
                    {
                        "uri": "https://github.com/nextflow-io/rnaseq-nf/tree/d910312506c6539365ed70aacda5068dea9152dd/data/ggal/ggal_gut_1.fq"
                    },
                    {
                        "uri": "https://github.com/nextflow-io/rnaseq-nf/tree/d910312506c6539365ed70aacda5068dea9152dd/data/ggal/ggal_gut_2.fq"
                    }
                ],
                "output_list": [
                    {
                        "uri": "work/7a/7e087d9ec32fc6b104c072ef42ee14/ggal_gut"
                    }
                ]
            },
            {
                "step_number": 4,
                "name": "8ec9b607fc6e5620c5437845fcf92fe2",
                "description": "MULTIQC",
                "input_list": [
                    {
                        "uri": "work/7a/7e087d9ec32fc6b104c072ef42ee14/ggal_gut"
                    },
                    {
                        "uri": "work/b0/fde0a381b3abf254cba203158d78a5/fastqc_ggal_gut_logs"
                    },
                    {
                        "uri": "https://github.com/nextflow-io/rnaseq-nf/tree/d910312506c6539365ed70aacda5068dea9152dd/multiqc"
                    }
                ],
                "output_list": [
                    {
                        "uri": "work/8e/c9b607fc6e5620c5437845fcf92fe2/multiqc_report.html"
                    }
                ]
            }
        ]
    },
    "execution_domain": {
        "script": [
            "https://github.com/nextflow-io/rnaseq-nf/tree/d910312506c6539365ed70aacda5068dea9152dd/main.nf"
        ],
        "script_driver": "nextflow",
        "software_prerequisites": [
            {
                "name": "Nextflow",
                "version": "23.09.2-edge",
                "uri": {
                    "uri": "https://github.com/nextflow-io/nextflow/releases/tag/v23.09.2-edge"
                }
            }
        ],
        "external_data_endpoints": [
            
        ],
        "environment_variables": {
            
        }
    },
    "parametric_domain": [
        {
            "param": "outdir",
            "value": "results",
            "step": "0"
        },
        {
            "param": "reads",
            "value": "https://github.com/nextflow-io/rnaseq-nf/tree/d910312506c6539365ed70aacda5068dea9152dd/data/ggal/ggal_gut_{1,2}.fq",
            "step": "0"
        },
        {
            "param": "transcriptome",
            "value": "https://github.com/nextflow-io/rnaseq-nf/tree/d910312506c6539365ed70aacda5068dea9152dd/data/ggal/ggal_1_48850000_49020000.Ggal71.500bpflank.fa",
            "step": "0"
        },
        {
            "param": "multiqc",
            "value": "https://github.com/nextflow-io/rnaseq-nf/tree/d910312506c6539365ed70aacda5068dea9152dd/multiqc",
            "step": "0"
        }
    ],
    "io_domain": {
        "input_subdomain": [
            {
                "uri": {
                    "uri": "https://github.com/nextflow-io/rnaseq-nf/tree/d910312506c6539365ed70aacda5068dea9152dd/data/ggal/ggal_gut_1.fq"
                }
            },
            {
                "uri": {
                    "uri": "https://github.com/nextflow-io/rnaseq-nf/tree/d910312506c6539365ed70aacda5068dea9152dd/data/ggal/ggal_gut_2.fq"
                }
            },
            {
                "uri": {
                    "uri": "https://github.com/nextflow-io/rnaseq-nf/tree/d910312506c6539365ed70aacda5068dea9152dd/data/ggal/ggal_1_48850000_49020000.Ggal71.500bpflank.fa"
                }
            },
            {
                "uri": {
                    "uri": "https://github.com/nextflow-io/rnaseq-nf/tree/d910312506c6539365ed70aacda5068dea9152dd/multiqc"
                }
            }
        ],
        "output_subdomain": [
            {
                "mediatype": "",
                "uri": {
                    "filename": "work/b0/fde0a381b3abf254cba203158d78a5/fastqc_ggal_gut_logs",
                    "uri": "results/fastqc_ggal_gut_logs"
                }
            },
            {
                "mediatype": "text/html",
                "uri": {
                    "filename": "work/8e/c9b607fc6e5620c5437845fcf92fe2/multiqc_report.html",
                    "uri": "results/multiqc_report.html"
                }
            }
        ]
    },
    "error_domain": {
        "empirical_error": {
            
        },
        "algorithmic_error": {
            
        }
    }
}

@HadleyKing
Copy link

Which tools are they? We could probably ask for help / contribute upstream to change things like that if it'd be helpful for us.. (especially the RO Crate ones).

Very good point! The ones I have used were:

We also have this API endpoint for BCO validation: https://biocomputeobject.org/api/docs/#/BCO%20Management/api_objects_validate_create

@ewels
Copy link
Member

ewels commented Jan 9, 2024

ok nice, thanks @HadleyKing!

Doing a quick and dirty test by copying the above example into the Swagger interface I get the following response:

{
  "urn:uuid:196ffc0a-7fb2-4c5c-8752-d9081ac40858": {
    "number_of_errors": 0,
    "error_detail": [
      "BCO Valid"
    ],
    "https://w3id.org/biocompute/extension_domain/1.1.0/scm/scm_extension.json": {
      "number_of_errors": 0,
      "error_detail": [
        "Extension Valid"
      ]
    }
  }
}

So - I think that means that we're looking good..!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Biocompute Objects / RO Crates
6 participants