Merge pull request #65 from zprobot/master

update: benchmark
bigbio · Jun 4, 2024 · 946c7fe · 946c7fe
2 parents 9a10435 + 9789a34
commit 946c7fe
Show file tree

Hide file tree

Showing 3 changed files with 90 additions and 10 deletions.
diff --git a/README.md b/README.md
@@ -76,7 +76,7 @@ A peptidoform is a combination of a `PeptideSequence(Modifications) + Charge + B
 > Note: At the moment, ibaqpy computes the ibaq values only based on unique peptides. Shared peptides are discarded. However, if a group of proteins share the same unique peptides (e.g., Pep1 -> Prot1;Prot2 and Pep2 -> Prot1;Prot2), the intensity of the proteins is summed and divided by the number of proteins in the group.
 
 #### Calculate the IBAQ Value
-First, peptide intensity dataframe was grouped according to protein name, sample name and condition. The protein intensity of each group was summed. Due to the experimental type, the same protein may exhibit missing peptides in different samples, resulting in variations in the number of peptides detected for the protein across different samples. To handle this difference, normalization within the same group can be achieved by using the formula `sum(peptides) / n`(n represents the number of detected peptide segments). Finally, the sum of the intensity of the protein is divided by the number of theoretical peptides.See details in `peptides2proteins`.
+First, peptide intensity dataframe was grouped according to protein name, sample name and condition. The protein intensity of each group was summed. Due to the experimental type, the same protein may exhibit missing peptides in different samples, resulting in variations in the number of peptides detected for the protein across different samples. To handle this difference, normalization within the same group can be achieved by using the formula `sum(peptides) / n`(n represents the number of detected peptide segments). Finally, the normalized intensity of the protein is divided by the number of theoretical peptides.See details in `peptides2proteins`.
 
 > Note: In all scripts and result files, *uniprot accession* is used as the protein identifier.
 

diff --git a/benchmarks/README.md b/benchmarks/README.md
@@ -56,7 +56,12 @@ In summary, both datasets were searched with three search engines _SAGE_, _COMET
 
 #### Coefficient of Variation (CV)
 
-Coefficient of variation for all samples in both experiments using `quantile`, `median`, `median-cov`. We extracted human proteins common to 11 samples from IBAQ data. The mean of the coefficient of variation of all proteins in 11 samples was then calculated.
+Coefficient of variation for all samples in both experiments using `quantile`, `median`, `median-cov`. 
+- `quantile`: In the data preprocessing, adjust the samples to ensure that the mean and variance of all samples are equal.  Finally, the sum of the intensity of the protein is divided by the number of theoretical peptides.
+- `median`: In the data preprocessing, adjust the samples to ensure that the median of all samples are equal. Finally, the sum of the intensity of the protein is divided by the number of theoretical peptides.
+- `median-cov`: In the data preprocessing, adjust the samples to ensure that the median of all samples are equal. Due to the experimental type, the same protein may exhibit missing peptides in different samples, resulting in variations in the number of peptides detected for the protein across different samples. To handle this difference, normalization within the same group can be achieved by using the formula `sum(peptides) / n`(n represents the number of detected peptide segments). Finally, the normalized intensity of the protein is divided by the number of theoretical peptides.
+
+We extracted human proteins common to 11 samples from IBAQ data. The mean of the coefficient of variation of all proteins in 11 samples was then calculated.
 
 Compared to the `quantile`, `median` and `median-cov` has a smaller coefficient of variation. `median-cov` has the smallest CV in the lfq experiment.
 
@@ -179,8 +184,82 @@ We will normalize the MaxLFQ values of the proteins in the DIANN report by divid
 </center>
 
 ### Performance testing
-The [PXD030304](https://ftp.pride.ebi.ac.uk/pub/databases/pride/resources/proteomes/absolute-expression/PXD030304/)  project collected mass spectrometry data from 949 cancer cell lines and reanalyzed it using the DIANN analysis pipeline within the quantms platform.The size of the `diann_report.tsv` file is 167GB, after being converted to a parquet file using quantmsio, the size is 15.8GB.We conducted performance testing in a 128GB memory environment.
 
-| Project | Samples | Size(diann report) | Size(parquet file) | Runn time |
-|--------|---------|----------|----------|----------|
-| PXD030304 |  2013 |  167G  | 15.8G    | 2.75h  |
+We have conducted performance tests on three methods. Since `median` and `median-cov` only differ when calculating ibaq, they are referred to as `median` below. It can be seen that the `median` is based on the sample level. It does not read all data at once like the `quantile`, but reads it in batches (by default, it reads 20 samples at a time), which greatly reduces memory consumption.
+
+<table align="center">
+    <thead>
+        <tr>
+            <th>Project</th>
+            <th>File size(original)</th>
+            <th>File size(transform)</th>
+            <th>Ms runs</th>
+            <th>Samples</th>
+            <th>Method</th>
+            <th>Memory</th>
+            <th>Run time</th>
+        </tr>
+    </thead>
+    <tbody>
+        <tr>
+            <td rowspan=2>PXD016999.1</td>
+            <td rowspan=2>5.7 G</td>
+            <td rowspan=2>292 M</td>
+            <td rowspan=2>336</td>
+            <td rowspan=2>280</td>
+            <td>quantile</td>
+            <td>36.4 G</td>
+            <td>14 min</td>
+        </tr>
+        <tr>
+            <td>median</td>
+            <td>8.4 G</td>
+            <td>20 min</td>
+        </tr>
+        <tr>
+            <td rowspan=2>PXD019909</td>
+            <td rowspan=2>1.9 G</td>
+            <td rowspan=2>171 M</td>
+            <td rowspan=2>43</td>
+            <td rowspan=2>43</td>
+            <td>quantile</td>
+            <td>7.9 G</td>
+            <td>30 s</td>
+        </tr>
+        <tr>
+            <td>median</td>
+            <td>4.0 G</td>
+            <td>1.4 min</td>
+        </tr>
+        <tr>
+            <td rowspan=2>PXD010154</td>
+            <td rowspan=2>1.9 G</td>
+            <td rowspan=2>287 M</td>
+            <td rowspan=2>1367</td>
+            <td rowspan=2>38</td>
+            <td>quantile</td>
+            <td>32.1 G</td>
+            <td>8 min</td>
+        </tr>
+        <tr>
+            <td>median</td>
+            <td>16.2 G</td>
+            <td>12 min</td>
+        </tr>
+        <tr>
+            <td rowspan=2>PXD030304</td>
+            <td rowspan=2>167 G</td>
+            <td rowspan=2>15.8 G</td>
+            <td rowspan=2>6862</td>
+            <td rowspan=2>2013</td>
+            <td>quantile</td>
+            <td>> 128 G</td>
+            <td>> 2 days</td>
+        </tr>
+        <tr>
+            <td>median</td>
+            <td>13.1 G</td>
+            <td>2.75 h</td>
+        </tr>
+    </tbody>
+</table>
diff --git a/ibaqpy/ibaq/peptide_normalization.py b/ibaqpy/ibaq/peptide_normalization.py
@@ -194,9 +194,9 @@ def data_common_process(data_df: pd.DataFrame, min_aa: int) -> pd.DataFrame:
     data_df = data_df[data_df["Condition"] != "Empty"]
 
     # Filter peptides with less amino acids than min_aa (default: 7)
-    data_df = data_df[
-        data_df.apply(lambda x: len(x[PEPTIDE_CANONICAL]) >= min_aa, axis=1)
-    ]
+    data_df.loc[:,'len'] = data_df[PEPTIDE_CANONICAL].apply(len)
+    data_df = data_df[data_df['len']>=min_aa]
+    data_df.drop(['len'],inplace=True,axis=1)
     data_df[PROTEIN_NAME] = data_df[PROTEIN_NAME].apply(parse_uniprot_accession)
     if FRACTION not in data_df.columns:
         data_df[FRACTION] = 1
@@ -561,7 +561,8 @@ def peptide_normalization(
         technical_repetitions, label, sample_names, choice = analyse_sdrf(sdrf)
     else:
         technical_repetitions, label, sample_names, choice = feature.experimental_inference
-    low_frequency_peptides = feature.low_frequency_peptides
+    if remove_low_frequency_peptides:
+        low_frequency_peptides = feature.low_frequency_peptides
     header = False
     if not skip_normalization and pnmethod == "globalMedian":
         med_map = feature.get_median_map()