Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardized bioinformatic pipeline #6

Open
SSuominen1 opened this issue Mar 18, 2021 · 5 comments
Open

Standardized bioinformatic pipeline #6

SSuominen1 opened this issue Mar 18, 2021 · 5 comments

Comments

@SSuominen1
Copy link
Contributor

How is it best to register used bioinformatic tool/pipelines?

I understood there are some developments for this in ocean best practices, we should look into that.

Through the PacMAN project, OBIS will also be developing a pipeline, or researching how output from existing pipelines will be formatted for Dwc-A. Is there need for this from other users?

@cpavloud
Copy link

Could we use the term "identificationRemarks" to specify the pipeline used (along with all its relevant - user selected - parameters, separated by vertical bar space ( | )) and the "identificationReferences" term for the reference/citation/url of the pipeline?

@dschigel
Copy link

Looks like our DNA guide recommends identifictionRererences, see https://docs.gbif.org/publishing-dna-derived-data/1.0/en/#mapping-metabarcoding-edna-and-barcoding-data @thomasstjerne please take a look: I think the issue that we have remarks and reference, but no clear place to paste the pipeline name. One may claim that reference includes the name and number, but perhaps this is not good enough for @cpavloud?

@pieterprovoost
Copy link
Member

Just thinking out loud here, but for many pipelines a run with a specific set of parameters will be defined by a custom configuration file or makefile. Perhaps the recommendation should be that this file is committed to source control (GitHub or other) and included as one of the identificationReferences. I think that would benefit reproducibility.

@cpavloud
Copy link

cpavloud commented Nov 26, 2021

@dschigel
My issue is that
a) in the case that a pipeline is used (e.g. QIIME2), providing just the name is not enough. The parameters that were selected by the user for each step of the bioinformatic analysis should be documented, so that the analysis is replicable.
b) in the case that different individuals tools are used (one for each step of the analysis, e.g. sickle for the quality filtering, pandaseq for the merging, UCHIME for the chimera removal etc.) then the identificationReferences should contain more than name and also (again) the parameters that were selected by the user for each tool should be documented.

@pieterprovoost yes, this is a good idea and it can be used for certain pipelines. Also, maybe the sop term can be used for a full documentation of the analysis instead of the identificationReferences? In this case (again), the user/data provider should have deposited the sop in a (GitHub or other) repository.

@thomasstjerne
Copy link

thomasstjerne commented Nov 26, 2021

@cpavloud in the DNA derived data extension there are dedicated fields for (at least some) individual pipeline steps.
For example the field chimera_check is supposed to have a value like uchime;v4.1;default parameters.

These fields origins from the MIxS standard and I think it would be fair to ask if e.g. the seq_quality_check field is appropriate for information about quality filtering. And also if there is a field intended for the merging.

But I think that it would always be desirable to have a link in the sop field to a structured pipeline description.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants