Single cell analysis performance with worse results #287

daichengxin · 2023-09-19T09:35:38Z

Description of the bug

Analyzing single cell dataset PXD016291. The results folder can be found in here. The MBR is enabled. The results looks quite different from original paper. The less proteins are quantified in single cell samples than original results. And more proteins are misquantified in blank sample than original results. The original paper used MaxQuant with as follows parameters:

All raw files were processed using MaxQuant (version 1.6.3.3) for feature detection, database searching, and protein/peptide quantification. MS/MS spectra were searched against the UniProtKB/Swiss-Prot human database (downloaded on October 26, 2018, containing 20 397 reviewed sequences). N-Terminal protein acetylation and methionine oxidation were selected as variable modifications. Carbamidomethylation of cysteine residues was set as a fixed modification. The peptide mass tolerances of the first search and main search (recalibrated) were <30 and 5 ppm, respectively. The minimum peptide length was six amino acids, and the maximum peptide mass was 4600 Da. Only two missed cleavages were allowed for each peptide. The second peptide search was activated to identify coeluting and cofragmented peptides from one MS/MS spectrum. Both peptides and proteins were filtered with a maximum false discovery rate (FDR) of 0.01. The MBR feature with a matching window of 0.7 min and an alignment window of 20 min, was activated. Label-free quantitation (LFQ) calculations were performed separately in each parameter group containing similar cell loadings. Both unique and razor peptides were selected for protein quantification. Other unmentioned parameters were the MaxQuant default settings. Potential contaminants and reverse sequences were filtered out.

Command used and terminal output

$ nextflow run /hps/nobackup/juan/pride/reanalysis/quantms/main.nf -c /hps/nobackup/juan/pride/reanalysis/quantms/nextflow.config -profile ebislurm --input PXD016921-MBR.sdrf.tsv --search_engines comet,msgf --root_folder /hps/nobackup/juan/pride/reanalysis/single-cell/PXD016921/ --local_input_type raw --outdir PXD016921-MBR --database /hps/nobackup/juan/pride/reanalysis/multiomics-configs/databases/Homo-sapiens-uniprot-reviewed-contaminants-decoy-202210.fasta --protein_level_fdr_cutoff 0.01 --psm_level_fdr_cutoff 0.01 --quantify_decoys true --transfer_ids mean --targeted_only false --skip_post_msstats false --enable_pmultiqc true -resume

Relevant files

Original MaxQuant results
quantms results

System information

No response

ypriverol · 2023-09-19T09:38:42Z

@daichengxin can you add some description about How the data was analyzed with MaxQuant and the original paper.

jpfeuffer · 2023-09-19T10:43:56Z

Thank you for the analysis. Very interesting. Can we maybe run the analysis with looser FDR cutoffs to see if it's more an identification problem or a quantification problem?

ypriverol · 2023-09-19T12:06:58Z

@jpfeuffer a couple of ideas:

In single cell, looks for me that MBR would be doing most of the job for the single cell runs. If you see we ID most of the peptides in the 20 and 100 samples (complex samples), more than MQ. Those samples are used as "channels" to look for poor signals from the single cell files. My point is the following:

The MBR feature with a matching window of 0.7 min and an alignment window of 20 min, was activated.

Do you think this 20 min window that they use for MBR in MQ can be the difference with us?

Do you think we need to do something in the experimental design.? For the MBR we are annotating every sample as a biological replicate to enable MBRs.

jpfeuffer · 2023-09-19T12:21:02Z

Yes, if we find out that it's not the identification performance we will check quantification.
But if you have only 100 proteins that pass FDR, you cannot quantify more than that and it is useless to check quantification.

ypriverol · 2023-09-19T13:03:12Z

@daichengxin can we check in the output of MQ of the original project how many peptides from the single cell samples are quantify/identified using MBR, I think MQ use a notation like No MS/MS or something like that.

daichengxin · 2023-09-19T13:55:26Z

More than half of the unique peptides originated with the MBR.

ypriverol · 2023-09-19T14:09:05Z

This is what I thought, see @jpfeuffer The majority of peptides are not coming from IDs but from the MBR, then we should see how we fail to do proper MBR in our side, my previous two comments.

jpfeuffer · 2023-09-19T14:31:06Z

Still.. if they don't pass FDR, there is nothing to quantify. So we have to wait for the ID analysis

daichengxin · 2023-09-21T02:55:40Z

Almost all of the peptides quantified only in MQ are from MBR. However, for quantms, these peptides also are only quantified in other files rather than from same file. it suggest we fail to do proper MBR, right?

jpfeuffer · 2023-09-21T06:27:47Z

Can you please add the command you used for quantms?

daichengxin · 2023-09-21T06:43:12Z

nextflow run /hps/nobackup/juan/pride/reanalysis/quantms/main.nf -c /hps/nobackup/juan/pride/reanalysis/quantms/nextflow.config -profile ebislurm --input PXD016921-MBR.sdrf.tsv --search_engines comet,msgf --root_folder /hps/nobackup/juan/pride/reanalysis/single-cell/PXD016921/ --local_input_type raw --outdir PXD016921-MBR --database /hps/nobackup/juan/pride/reanalysis/multiomics-configs/databases/Homo-sapiens-uniprot-reviewed-contaminants-decoy-202210.fasta --protein_level_fdr_cutoff 0.01 --psm_level_fdr_cutoff 0.01 --quantify_decoys true --transfer_ids mean --targeted_only false --skip_post_msstats false --enable_pmultiqc true -resume @jpfeuffer

jpfeuffer · 2023-09-21T06:48:59Z

I don't know. I can't see any signs of transfer in the Proteomicslfq logs.
Do you have the call to proteomicslfq?

ypriverol · 2023-09-21T07:22:09Z

The command with the relapsed FDR is:

nextflow run /hps/nobackup/juan/pride/reanalysis/quantms/main.nf -c /hps/nobackup/juan/pride/reanalysis/quantms/nextflow.config -profile ebislurm --input PXD016921-MBR.sdrf.tsv --search_engines comet,msgf --root_folder /hps/nobackup/juan/pride/reanalysis/single-cell/PXD016921/ --local_input_type raw --outdir PXD016921-MBR --database /hps/nobackup/juan/pride/reanalysis/multiomics-configs/databases/Homo-sapiens-uniprot-reviewed-contaminants-decoy-202210.fasta --protein_level_fdr_cutoff 0.15 --psm_level_fdr_cutoff 0.15 --quantify_decoys true --transfer_ids mean --targeted_only false --skip_post_msstats false --enable_pmultiqc true -resume

Here the results: http://ftp.pride.ebi.ac.uk/pub/databases/pride/resources/proteomes/quantms-benchmark/PXD016921-MBR/

jpfeuffer · 2023-09-21T07:29:35Z

I found the plfq command. Looks correct.
One thing is, that we only transfer_ids if they occur in more than 50% of the samples. I don't know how MQ does it.

daichengxin · 2023-09-21T07:31:00Z

@jpfeuffer @ypriverol ran the data with a more relaxed FDR 0.15 at Protein and PSM level. However, the number of quantified peptides did not increase significantly. Similarily, almost all of the peptides quantified only in MQ are from MBR. For quantms, these peptides also are only quantified in other files rather than from same file.

jpfeuffer · 2023-09-21T07:36:01Z

I think if we loosen the percentage of aligned samples required to trigger requantification, we should think about an FDR approach for transferred quantification first.
E.g. shifted decoy EICs

jpfeuffer · 2023-09-21T08:19:34Z

You can use the new container tomorrow to play around with a parameter that controls this proportion.
You need to expose it quickly in the pipeline.

ypriverol · 2023-09-21T08:26:12Z

Can you let me know what is the parameter that we will need to use.

jpfeuffer · 2023-09-21T08:29:42Z

id_transfer_threshold

ypriverol · 2023-09-21T08:30:49Z

What is the default value?

timosachsenberg · 2023-10-06T15:29:42Z

like

cut -d',' -f 2,11 msstats.csv |  sed 's/([^)]*)//g' | sort | uniq | grep Singlecell4 | wc -l

?
Do you have a command?

These are my values with filtering out modified ones:

Blank 97
Singlecell1 3827
Singlecell2 3172
Singlecell3 3061
Singlecell4 4533
HeLa 8848
100HeLa 11357

jpfeuffer · 2023-10-07T12:48:44Z

I think he means unique as in "unique for a protein". But IIRC we tried to only export unique-to-group peptides to MSstats so Timo's approach might be correct.

jpfeuffer · 2023-10-07T13:11:21Z

You need to check what MQ exports and how it defines uniqueness if you want to compare by #identifications. You can check by picking some of the group-unique peptides and see how MQ defines them.
Ideally one would just compare the number of features with an ID though. I have no idea if you can do that with MQ.

daichengxin · 2023-10-07T13:16:22Z

Here is my script. And i try to run @timosachsenberg command, but got different values from @timosachsenberg.

def sub_mod(peptide):
    peptide = peptide.replace(".", "")
    peptide = re.sub(r"\(.*?\)", "", peptide)
    return peptide

quantms = pd.read_csv("./PXD016921MBRnewSVM/PXD016921-MBR.sdrf_openms_design_msstats_in.csv", sep=',', header=0)
quantms = quantms[-quantms['ProteinName'].str.contains("DECOY_|CONTAMINANT_|REV_|BOVIN")]
quantms["sequence"] = quantms.apply(lambda x: sub_mod(x["PeptideSequence"]), axis=1)
msstats_in_data = quantms.groupby('sequence').filter(lambda x: len(set(x["ProteinName"])) == 1)
print(quantms.groupby("Reference")["sequence"].nunique())
print(quantms.groupby("Reference")["ProteinName"].nunique())

jpfeuffer · 2023-10-07T13:19:37Z

How is your MQ script?
And can we put the MQ data next to the quantms results for both UPS and this one?
Edit: Ah I see the MQ files in the beginning of this thread.

daichengxin · 2023-10-07T14:28:15Z

MQ = pd.read_csv("./MQsearchresults_Single_cell/txt/peptides.txt", sep="\t")
# MQ = MQ[(MQ["Reverse"] != "+")& (MQ["Potential contaminant"] != "+") & (MQ["Unique (Groups)"] == "yes")]
MQ = MQ[(MQ["Reverse"] != "+")& (MQ["Potential contaminant"] != "+")]
print(MQ[-MQ["Identification type Single cell 1"].isna()]["Sequence"].nunique())
print(MQ[-MQ["Identification type Single cell 2"].isna()]["Sequence"].nunique())
print(MQ[-MQ["Identification type Single cell 3"].isna()]["Sequence"].nunique())
print(MQ[-MQ["Identification type Single cell 4"].isna()]["Sequence"].nunique())

MQ = pd.read_csv("./MQsearchresults_Single_cell/txt/proteinGroups.txt", sep="\t")
MQ = MQ[(MQ["Reverse"] != "+")& (MQ["Potential contaminant"] != "+")]
# MQ = MQ[MQ["Unique peptides"] > 0]
print(MQ[-MQ["Identification type Single cell 1"].isna()]["Protein IDs"].nunique())
print(MQ[-MQ["Identification type Single cell 2"].isna()]["Protein IDs"].nunique())
print(MQ[-MQ["Identification type Single cell 3"].isna()]["Protein IDs"].nunique())
print(MQ[-MQ["Identification type Single cell 4"].isna()]["Protein IDs"].nunique())
print(MQ[-MQ["Identification type Blank"].isna()]["Protein IDs"].nunique())
print(MQ[-MQ["Identification type 20 HeLa cells"].isna()]["Protein IDs"].nunique())
print(MQ[-MQ["Identification type 100 HeLa cells"].isna()]["Protein IDs"].nunique())

daichengxin · 2023-10-07T15:30:25Z

0.1&0.9 results. Results folder: http://ftp.pride.ebi.ac.uk/pub/databases/pride/resources/proteomes/quantms-benchmark/PXD016921-MBR-newSVM/

jpfeuffer · 2023-10-07T16:23:36Z

I think if you wanted more IDs, one should have used 0.1 and something that is less than 0.75, e.g. 0.5

jpfeuffer · 2023-10-07T16:26:34Z

MQ = pd.read_csv("./MQsearchresults_Single_cell/txt/peptides.txt", sep="\t")
# MQ = MQ[(MQ["Reverse"] != "+")& (MQ["Potential contaminant"] != "+") & (MQ["Unique (Groups)"] == "yes")]
MQ = MQ[(MQ["Reverse"] != "+")& (MQ["Potential contaminant"] != "+")]
print(MQ[-MQ["Identification type Single cell 1"].isna()]["Sequence"].nunique())
print(MQ[-MQ["Identification type Single cell 2"].isna()]["Sequence"].nunique())
print(MQ[-MQ["Identification type Single cell 3"].isna()]["Sequence"].nunique())
print(MQ[-MQ["Identification type Single cell 4"].isna()]["Sequence"].nunique())

MQ = pd.read_csv("./MQsearchresults_Single_cell/txt/proteinGroups.txt", sep="\t")
MQ = MQ[(MQ["Reverse"] != "+")& (MQ["Potential contaminant"] != "+")]
# MQ = MQ[MQ["Unique peptides"] > 0]
print(MQ[-MQ["Identification type Single cell 1"].isna()]["Protein IDs"].nunique())
print(MQ[-MQ["Identification type Single cell 2"].isna()]["Protein IDs"].nunique())
print(MQ[-MQ["Identification type Single cell 3"].isna()]["Protein IDs"].nunique())
print(MQ[-MQ["Identification type Single cell 4"].isna()]["Protein IDs"].nunique())
print(MQ[-MQ["Identification type Blank"].isna()]["Protein IDs"].nunique())
print(MQ[-MQ["Identification type 20 HeLa cells"].isna()]["Protein IDs"].nunique())
print(MQ[-MQ["Identification type 100 HeLa cells"].isna()]["Protein IDs"].nunique())

Why did you comment the filter for Unique (groups)?
This sounded like the right way to compare to our (unfiltered) results.

daichengxin · 2023-10-08T07:48:53Z

filtered results (unique for protein groups) @jpfeuffer

jpfeuffer · 2023-10-08T09:40:31Z

Ok can we compare it with different quantms cutoffs now? Especially something lower than 0.9 for unidentified.

And without .filter(lambda x: len(set(x["ProteinName"])) == 1) for the quantms results. This should be the same as "Group (unique)" then.

jpfeuffer · 2023-10-08T09:41:21Z

I don't think we need to compare protein groups for now. They will (almost) always be less if the peptides are less. I'd rather see multiple cutoffs in the table.

timosachsenberg · 2023-10-08T10:31:36Z

Here is my script. And i try to run @timosachsenberg command, but got different values from @timosachsenberg.

This is interesting. I used the old id files that Yasset provided in the original analysis and ran them through the ProteomicsLFQ tool. I got numbers much closer to MQ. I am currently traveling but would like to dig into the exact parameters used a bit more.

daichengxin · 2023-10-08T11:09:17Z

jpfeuffer · 2023-10-08T12:24:15Z

Thank you very much for the quick table. Surprisingly very few changes. This must mean that all unidentified Features have scores either below 0.6 or above 0.9.
Maybe an FDR approach is really really needed

jpfeuffer · 2023-10-08T12:36:27Z

@timosachsenberg exact parameters are always visible in the pipeline_info folder

timosachsenberg · 2023-10-09T06:44:59Z

That's the command I used. FDR settings seem the be the same. Do you see anything I missed?

/ceph/ibmi/abi/projects/sachsenb/OpenMS/openms-build/bin/ProteomicsLFQ \
-in Blank.mzML Singlecell1.mzML Singlecell2.mzML Singlecell3.mzML Singlecell4.mzML 20HeLacells.mzML 100HeLacells.mzML \
-ids Blank_consensus_fdr_filter.idXML Singlecell1_consensus_fdr_filter.idXML Singlecell2_consensus_fdr_filter.idXML Singlecell3_consensus_fdr_filter.idXML Singlecell4_consensus_fdr_filter.idXML 20HeLacells_consensus_fdr_filter.idXML 100HeLacells_consensus_fdr_filter.idXML \
    -design PXD016921-MBR.sdrf_openms_design.tsv \
    -out_cxml notransfer_1.0_1000_star.consensusXML \
    -out_msstats notransfer_1.0_1000_star.csv \
    -out notransfer_1.0_1000_star.mzTab \
    -fasta Homo-sapiens-uniprot-reviewed-contaminants-decoy-202210.fasta \
    -threads 40 \
    -Seeding:intThreshold 1000.0 \
    -protein_inference aggregation \
    -quantification_method feature_intensity \
    -feature_with_id_min_score 0.25 \
    -feature_without_id_min_score 0.75 \
    -targeted_only false \
    -mass_recalibration false \
    -protein_quantification unique_peptides \
    -alignment_order star \
    -PeptideQuantification:quantify_decoys \
    -psmFDR 0.01 \
    -proteinFDR 0.01 \
    -picked_proteinFDR true \
    2>&1 | tee notransfer_1.0_1000_star.txt

jpfeuffer · 2023-10-09T06:54:00Z

No, looks the same as the 0.25_0.75 directory.
@timosachsenberg can you maybe download the Msstats file and run your command on your computer?

https://ftp.pride.ebi.ac.uk/pub/databases/pride/resources/proteomes/quantms-benchmark/PXD016921-MBR-newSVM-0.25-0.75/proteomicslfq/

jpfeuffer · 2023-10-09T06:55:11Z

Maybe the idxmls changed meanwhile? Changes in Sage or percolator adapter? No idea.

timosachsenberg · 2023-10-09T07:57:55Z

No, looks the same as the 0.25_0.75 directory. @timosachsenberg can you maybe download the Msstats file and run your command on your computer?

https://ftp.pride.ebi.ac.uk/pub/databases/pride/resources/proteomes/quantms-benchmark/PXD016921-MBR-newSVM-0.25-0.75/proteomicslfq/

I can confirm that these outputs match the ones Dai reported. Will now check if -Seeding:intThreshold 1000.0 makes such a difference.

ypriverol · 2023-10-09T08:02:15Z

It looks like we are using the default parameter for -Seeding:intThreshold. Our command now for proteomicsLFQ is:

    ProteomicsLFQ \\
        -threads ${task.cpus} \\
        -in ${mzml_sorted.join(' ')} \\
        -ids ${id_sorted.join(' ')} \\
        -design ${expdes} \\
        -fasta ${fasta} \\
        -protein_inference ${params.protein_inference_method} \\
        -quantification_method ${params.quantification_method} \\
        -targeted_only ${params.targeted_only} \\
        ${feature_with_id_min_score} \\
        ${feature_without_id_min_score} \\
        -mass_recalibration ${params.mass_recalibration} \\
        -protein_quantification ${params.protein_quant} \\
        -alignment_order ${params.alignment_order} \\
        ${decoys_present} \\
        -psmFDR ${params.psm_level_fdr_cutoff} \\
        -proteinFDR ${params.protein_level_fdr_cutoff} \\
        -picked_proteinFDR ${params.picked_fdr} \\
        -out_cxml ${expdes.baseName}_openms.consensusXML \\
        -out ${expdes.baseName}_openms.mzTab \\
        ${msstats_present} \\
        ${triqler_present} \\
        $args \\
        2>&1 | tee proteomicslfq.log

timosachsenberg · 2023-10-09T08:54:18Z

Yep, main reason seems to be the intensity threshold.
I ran it with 0.1 and 0.9 and get:

intensity > 1000
8
2964
2068
3114
3343
7671
10669

intensity > 10000
7
1563
1408
2358
3334
7619
10503

Dai would it be possible to generate the results for different thresholds using the new parameter?

jpfeuffer · 2023-10-10T06:44:38Z

Numbers look really good. I wonder if MQ has a min. feature ity. cutoff.

I guess we need to see the effect on UPS

daichengxin · 2023-10-10T08:10:46Z

1000 intensity threshold:

ypriverol · 2023-10-10T08:18:30Z

Looks like the rigth parameters are close to 0.20 - 0.80. @jpfeuffer @timosachsenberg

ypriverol · 2023-10-13T08:55:39Z

I will close the current issue in favor of #303 . All discussions about MBR LFQ must be move to that issue.

daichengxin added the bug Something isn't working label Sep 19, 2023

ypriverol added enhancement New feature or request help wanted Extra attention is needed question Further information is requested and removed bug Something isn't working labels Sep 19, 2023

daichengxin assigned ypriverol Sep 19, 2023

ypriverol added the high-priority label Sep 19, 2023

daichengxin assigned timosachsenberg and jpfeuffer Sep 19, 2023

jpfeuffer mentioned this issue Sep 21, 2023

[ProteomicsLFQ] Parameterize min. proportion for requant. OpenMS/OpenMS#7095

Merged

5 tasks

ypriverol mentioned this issue Oct 6, 2023

proteomicsLFQ with new SVM results in UPS1 dataset #301

Closed

ypriverol mentioned this issue Oct 13, 2023

LFQ MBR FDR algorithm needed. #303

Open

3 tasks

ypriverol closed this as completed Oct 13, 2023

timosachsenberg mentioned this issue Oct 16, 2023

[FFID][tweak] thoughts on settings/scores OpenMS/OpenMS#7130

Draft

5 tasks

Single cell analysis performance with worse results #287

Single cell analysis performance with worse results #287

Comments

daichengxin commented Sep 19, 2023 • edited Loading

Description of the bug

Command used and terminal output

Relevant files

System information

ypriverol commented Sep 19, 2023

jpfeuffer commented Sep 19, 2023

ypriverol commented Sep 19, 2023

jpfeuffer commented Sep 19, 2023 • edited Loading

ypriverol commented Sep 19, 2023

daichengxin commented Sep 19, 2023 • edited Loading

ypriverol commented Sep 19, 2023

jpfeuffer commented Sep 19, 2023

daichengxin commented Sep 21, 2023

jpfeuffer commented Sep 21, 2023 • edited Loading

daichengxin commented Sep 21, 2023

jpfeuffer commented Sep 21, 2023

ypriverol commented Sep 21, 2023

jpfeuffer commented Sep 21, 2023

daichengxin commented Sep 21, 2023

jpfeuffer commented Sep 21, 2023 • edited Loading

jpfeuffer commented Sep 21, 2023

ypriverol commented Sep 21, 2023

jpfeuffer commented Sep 21, 2023

ypriverol commented Sep 21, 2023

timosachsenberg commented Oct 6, 2023 • edited Loading

jpfeuffer commented Oct 7, 2023 • edited Loading

jpfeuffer commented Oct 7, 2023

daichengxin commented Oct 7, 2023

jpfeuffer commented Oct 7, 2023 • edited Loading

daichengxin commented Oct 7, 2023

daichengxin commented Oct 7, 2023

jpfeuffer commented Oct 7, 2023

jpfeuffer commented Oct 7, 2023 • edited Loading

daichengxin commented Oct 8, 2023

jpfeuffer commented Oct 8, 2023 • edited Loading

jpfeuffer commented Oct 8, 2023 • edited Loading

timosachsenberg commented Oct 8, 2023

daichengxin commented Oct 8, 2023

jpfeuffer commented Oct 8, 2023

jpfeuffer commented Oct 8, 2023

timosachsenberg commented Oct 9, 2023

jpfeuffer commented Oct 9, 2023 • edited Loading

jpfeuffer commented Oct 9, 2023

timosachsenberg commented Oct 9, 2023

ypriverol commented Oct 9, 2023

timosachsenberg commented Oct 9, 2023 • edited Loading

jpfeuffer commented Oct 10, 2023

daichengxin commented Oct 10, 2023

ypriverol commented Oct 10, 2023

ypriverol commented Oct 13, 2023

daichengxin commented Sep 19, 2023 •

edited

Loading

jpfeuffer commented Sep 19, 2023 •

edited

Loading

daichengxin commented Sep 19, 2023 •

edited

Loading

jpfeuffer commented Sep 21, 2023 •

edited

Loading

jpfeuffer commented Sep 21, 2023 •

edited

Loading

timosachsenberg commented Oct 6, 2023 •

edited

Loading

jpfeuffer commented Oct 7, 2023 •

edited

Loading

jpfeuffer commented Oct 7, 2023 •

edited

Loading

jpfeuffer commented Oct 7, 2023 •

edited

Loading

jpfeuffer commented Oct 8, 2023 •

edited

Loading

jpfeuffer commented Oct 8, 2023 •

edited

Loading

jpfeuffer commented Oct 9, 2023 •

edited

Loading

timosachsenberg commented Oct 9, 2023 •

edited

Loading