Feature/bruker data #275

jspaezp · 2023-08-06T17:08:17Z

TODO

Add support for non-archived .d files
Docs
inline docs
Make input naming ocnsistent (right now a lot of places are called mzml and shoudl be ms_file)

PR checklist

sonatype-lift · 2023-08-06T17:08:20Z

Sonatype Lift is retiring

Sonatype Lift will be retiring on Sep 12, 2023, with its analysis stopping on Aug 12, 2023. We understand that this news may come as a disappointment, and Sonatype is committed to helping you transition off it seamlessly. If you’d like to retain your data, please export your issues from the web console.
We are extremely grateful and thank you for your support over the years.

📖 Read about the impacts and timeline

bin/diann_convert.py

ypriverol · 2023-08-07T05:59:18Z

@jspaezp this PR from nf-core templates nf-core#103 would be good to merged into your branch. While it creates some unnecessary work, it is better to integrate in an early stage than later to avoid the need to change more files when we go for release.

jspaezp · 2023-09-10T05:16:29Z

Most of the datasets in ProteomeXchange has the extension .d.zip which means that we should support also that format apart from .d.gz

that sounds good! the only issue here is that the current base docker image I am using for it does not have zip built-in. solutions: A: build and host a new image with zip, B add to the documentation the note saying that if zip is desired, the container needs to be over-written (pass the burden of zip to the user).

This would be pretty much the dockerfile for reference.

FROM continuumio/miniconda3:23.5.2-0-alpine
RUN apk add --update zip

I recommend as we did with the other DIA experiment creating an SDRF with only one file for the following dataset https://www.ebi.ac.uk/pride/archive/projects/PXD037164 and testing the pipeline with that SDRF.

EDIT: https://github.com/jspaezp/miniconda-alpine-zip/pkgs/container/miniconda-alpine-zip I created the image here and will be use it for now, we can figure out any alternatives later

This is great, thanks a lot!

subworkflows/local/file_preparation.nf

bin/diann_convert.py

* changed files to paths in final diann analysis * updated decompression to support zip * added exception when the decompressed file already matches the required pattern

jspaezp · 2023-09-19T17:01:43Z

I am happy to let you know that I think this is ready to have a final review/merge.

Since the last update I:

Added support for .zip files, I still do not like them and they do not offer any significant compression advantage, therefore PRIDE should suggest people to use an archival-uncompressed method.
Added correct handling of file paths when decompressing the files.
Made sure pmultiqc worked (thank you for the heavy lifting on that end!)

I am testing it using the following data:

Input sdrf:

source name	characteristics[organism]	characteristics[organism part]	characteristics[cell type]	characteristics[disease]	characteristics[cell line]	characteristics[biological replicate]	assay name	comment[technical replicate]	comment[data file]	comment[fraction identifier]	comment[label]	comment[cleavage agent details]	comment[instrument]	comment[proteomics data acquisition method]	comment[modification parameters]	comment[modification parameters]	comment[precursor mass tolerance]	comment[fragment mass tolerance]	comment[file uri]	factor value[phenotype]
Sample 1	homo sapiens	cancer	cells	not applicable	not applicable	1	Run 1	1	3817_TIMS2_2col-80m_37_1_Slot1-46_1_4768.d.zip	1	AC=MS:1002038;NT=label free sample	AC=MS:1001251;NT=Trypsin	AC=MS:1003231;NT=TimsTOF SCP	NT=Data-Independent Acquisition;AC=NCIT:C161786	NT=Oxidation; MT=Variable; TA=M; AC=Unimod:35	NT=Carbamidomethyl; MT=Fixed; TA=C; AC=Unimod:4	15 ppm	15 ppm	https://ftp.pride.ebi.ac.uk/pride/data/archive/2023/05/PXD037164/3817_TIMS2_2col-80m_37_1_Slot1-46_1_4768.d.zip	A
Sample 2	homo sapiens	cancer	cells	not applicable	not applicable	2	Run 2	1	3817_TIMS2_2col-80m_38_2_Slot1-47_1_4816.d.zip	1	AC=MS:1002038;NT=label free sample	AC=MS:1001251;NT=Trypsin	AC=MS:1003231;NT=TimsTOF SCP	NT=Data-Independent Acquisition;AC=NCIT:C161786	NT=Oxidation; MT=Variable; TA=M; AC=Unimod:35	NT=Carbamidomethyl; MT=Fixed; TA=C; AC=Unimod:4	15 ppm	15 ppm	https://ftp.pride.ebi.ac.uk/pride/data/archive/2023/05/PXD037164/3817_TIMS2_2col-80m_38_2_Slot1-47_1_4816.d.zip	A
Sample 3	homo sapiens	cancer	cells	not applicable	not applicable	1	Run 1	1	3817_TIMS2_2col-80m_13_1_Slot1-22_1_4772.d.zip	1	AC=MS:1002038;NT=label free sample	AC=MS:1001251;NT=Trypsin	AC=MS:1003231;NT=TimsTOF SCP	NT=Data-Independent Acquisition;AC=NCIT:C161786	NT=Oxidation; MT=Variable; TA=M; AC=Unimod:35	NT=Carbamidomethyl; MT=Fixed; TA=C; AC=Unimod:4	15 ppm	15 ppm	https://ftp.pride.ebi.ac.uk/pride/data/archive/2023/05/PXD037164/3817_TIMS2_2col-80m_13_1_Slot1-22_1_4772.d.zip	B
Sample 4	homo sapiens	cancer	cells	not applicable	not applicable	2	Run 2	1	3817_TIMS2_2col-80m_14_1_Slot1-23_1_4690.d.zip	1	AC=MS:1002038;NT=label free sample	AC=MS:1001251;NT=Trypsin	AC=MS:1003231;NT=TimsTOF SCP	NT=Data-Independent Acquisition;AC=NCIT:C161786	NT=Oxidation; MT=Variable; TA=M; AC=Unimod:35	NT=Carbamidomethyl; MT=Fixed; TA=C; AC=Unimod:4	15 ppm	15 ppm	https://ftp.pride.ebi.ac.uk/pride/data/archive/2023/05/PXD037164/3817_TIMS2_2col-80m_14_1_Slot1-23_1_4690.d.zip

Nextflow config

// Pipeline Parameters
params {
    // Input options
    input = 'THAT_SDRF.SDRF.TSV'
    database = 'SOMEHUMANFASTA.fasta'

    skip_post_msstats = true
    add_decoys = true
    acquisition_method = 'dia'

    // DIA-NN related
    mass_acc_automatic = false
    diann_normalize = false
    diann_speclib = ''

}

modules/local/diannsummary/main.nf

modules/local/dotd_to_mqc/main.nf

modules/local/pmultiqc/main.nf

modules/local/tdf2mzml/main.nf

jpfeuffer · 2023-09-23T10:28:50Z

nextflow.config

@@ -156,6 +156,9 @@ params {
    add_triqler_output       = false
    quantify_decoys          = false

+    // Bruker data
+    convert_dotd            = false


Ah nice, so with this we could still force conversion to mzml if ever needed?

Indeed that is how it would work! the current implementation uses tdf2mzml to generate the mzML from .d files and that gets passed to DIA-NN. I have not really tested that it gives good results (In my experience it doesnt ... ) but it would be a possibility.

Interesting! Yes it is perfectly fine how it is, as a last resort/fallback.

Out of curiosity: According to your experience, what goes wrong when using converted mzmls? Is it just runtime/storage or also quality of the results?

Well ... I have not tried in a while, but If my memory is not failing me it was all of the former to different degrees depending on how the mzml was generaged.

If the mzml was generated with arguments/sotware/modes that collapsed the mobility dimension ("centroided over it"), results tend to be poor, I believe this happens becuase the noise that can be easily identified as noise by resolving on the mobility dimension gets over-represented, which leads to "virtually poor" scan quality.

In the case of software that does not collapse the mobility dimension, the first thing is that files end up being absoluteley massive. Since each "scan" along the mobility dimension ends up being hundreds of scans in the mzml, the resulting file is the equivalent of having an instrument scanning at +1_000 Hz. Which makes any software reading it very slow and disk usage absolutely horrendous.

I have not tried in a while and we have been exploring different ways to have a good intermediate file format ... the search remains :P

pyproject.toml

jpfeuffer · 2023-09-23T10:45:02Z

bin/dotd_2_mqc.py

+logger = getLogger(__name__)
+
+SECOND_RESOLUTION = 5
+MQC_YML = """


Why is this needed?
What is the difference to the multiqc yml file in the repo?

This adds as part of the qc information on the chromatogram and total ions that is extracted directly from the .d files in a distributed way.

This specific yml "file" is not used unless multiqc is run in a standalone way (and it lets me test it in a much faster way than having it in a sharded way as part of the pipeline)

Ok I guess that's fine. Might just be awkward if the two versions get out-of-sync at some point. Not sure how likely that is.

I didn't check, but is this documented? Maybe add a comment about its "standalone usage" to that variable or the function that writes out its contents.

I am adding a couple of comments and more information on the module docstring for the standalone usage (outside of the quantms pipeline).

on f1b0bf7

ypriverol

@jspaezp check some of my comments, solutions for which containers must be used to be able to deploy using conda/docker/singularity.

Let me know what do you think.

modules/local/decompress_dotd/main.nf

modules/local/dotd_to_mqc/main.nf

Co-authored-by: Yasset Perez-Riverol <[email protected]>

jspaezp · 2023-09-28T16:24:49Z

Hello @ypriverol, @jpfeuffer, @benpullman and @WangHong007 Thank you very much for the review and suggestions!

I believe I have incorporated all the suggestions and notes. I am happy to take into account any other suggestions you might have and to work together in the future!

Kindest wishes,
Sebastian

jspaezp added 16 commits August 3, 2023 14:31

added tdf2mzml

c4b17e4

added path to conversion

0b55503

changed comment character

4e9a931

added tuple of meta to tdf2mzml outs

51a6e66

added debug prints to diann conversion

9cc93fa

added renaming of dotd files after extraction

e8752f6

yet more debug printing info

b559b21

added not to branching

0793038

refactoring of diann convert

224e340

fixed bug where mzml AND raw files were passed

9e4c872

added speclib to schema

1a86393

returned report in abstracted diannconvert

92ee452

refactor and speedup of diann summary

33b3579

added debug info to versions

6ae3565

moved tar version in the workflow from tracking to logging

4b58fba

fixed dumb error

288415a