Add start and end positions in output matrix #135

acesnik · 2024-05-04T17:01:03Z

I'm drafting a PR for #110

lazear

Looks good overall, left a few specific comments.

Additional considerations not addressed:
During deduplication of identical peptide sequences post-modification, add protein-positions

sage/crates/sage/src/database.rs

Line 197 in a969ca7

keep.proteins.extend(remove.proteins.iter().cloned());

Ensure that positions and proteins identified are co-sorted:

sage/crates/sage/src/database.rs

Line 206 in a969ca7

.for_each(|peptide| peptide.proteins.sort_unstable());

Might be best to define a new struct for assigning proteins that includes both the identifier and positions, to preclude any potential bugs from the above.

struct ProteinAssignment {
   identifier: String,
   // do these need to be usize (8 bytes)? u32 (4 bytes) is probably sufficient...
   start: usize,
   end: usize,
}

lazear · 2024-05-06T17:59:07Z

crates/sage/src/peptide.rs

@@ -28,6 +28,10 @@ pub struct Peptide {
    pub position: Position,

    pub proteins: Vec<Arc<String>>,
+    /// What residue does this peptide start at in the protein (1-based inclusive)?
+    pub start_position: Vec<Arc<usize>>,


No need for Arc - this is a smart pointer (atomic reference counted) allocated on the heap. It's used for proteins: String (which is already heap allocated) to prevent repeated clones of protein identifiers. Doesn't make sense to use an Arc for a simple usize in this case!

Ah, that makes sense!

lazear · 2024-05-06T18:00:59Z

crates/sage/src/enzyme.rs

@@ -76,6 +82,7 @@ impl std::hash::Hash for Digest {
    fn hash<H: std::hash::Hasher>(&self, state: &mut H) {
        self.sequence.hash(state);
        self.position.hash(state);
+        self.start_position.hash(state);


I don't think this is correct behavior (won't necessarily cause a bug though). We use hash to deduplicate digests prior to creating peptides.

Only sequence and position are considered to ensure that we don't accidentally deduplicate peptide sequences that occur on protein termini, which might be assigned termini-specific modifications. Including start_position will lead to extra (duplicated) digests being created -> more duplicated peptides that must be generated (expensive) and then trashed during de-duplication.

Okay, that makes sense. I was thinking about maybe including a list of start indices for each protein in the proteins list, but that probably doesn't make sense from a peptide-centric search perspective, and other search engines just list the index for the leading protein.

lazear · 2024-05-06T18:02:38Z

crates/sage-cli/src/output.rs

            "rank",
            "label",
-            "expmass",
-            "calcmass",
+            "measured_mass",


What's the rationale for changing column names?

Oh, the rationale here, although I'm not wedded to them, are to 1) keep the underscore delimiting in the rest of the columns, 2) spell out calculated and experimental/measured, and 3) including number after scan doesn't feel necessary since every entry has "scan=[scan number]".

I'm not opposed to it - but the column names have been stable for ~2.5 years so I don't see a huge need to rename them (and thus force people to retool pipelines downstream of Sage). They were originally named like this because sage results (used) to be passed directly to mokapot for rescoring.

lazear · 2024-05-06T18:09:01Z

crates/sage/src/peptide.rs

@@ -373,6 +380,8 @@ impl TryFrom<Digest> for Peptide {
            missed_cleavages: value.missed_cleavages,
            semi_enzymatic: value.semi_enzymatic,
            proteins: vec![value.protein],
+            start_position: value.start_position,


Must be a Vec of positions.

acesnik · 2024-05-07T21:50:22Z

Thanks for the feedback on this! I'll continue working on the PR and mark it ready for review when ready or ask some follow-up questions along the way.

starting on recording start and end positions in output matrix

0e39333

lazear reviewed May 6, 2024

View reviewed changes

incorporating feedback, still in progress

745e5af

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add start and end positions in output matrix #135

Add start and end positions in output matrix #135

acesnik commented May 4, 2024

lazear left a comment •

edited

Loading

lazear May 6, 2024

acesnik May 7, 2024

lazear May 6, 2024 •

edited

Loading

acesnik May 7, 2024

lazear May 6, 2024

acesnik May 7, 2024

lazear May 8, 2024 •

edited

Loading

lazear May 6, 2024

acesnik May 7, 2024

acesnik commented May 7, 2024

Add start and end positions in output matrix #135

Are you sure you want to change the base?

Add start and end positions in output matrix #135

Conversation

acesnik commented May 4, 2024

lazear left a comment • edited Loading

Choose a reason for hiding this comment

lazear May 6, 2024

Choose a reason for hiding this comment

acesnik May 7, 2024

Choose a reason for hiding this comment

lazear May 6, 2024 • edited Loading

Choose a reason for hiding this comment

acesnik May 7, 2024

Choose a reason for hiding this comment

lazear May 6, 2024

Choose a reason for hiding this comment

acesnik May 7, 2024

Choose a reason for hiding this comment

lazear May 8, 2024 • edited Loading

Choose a reason for hiding this comment

lazear May 6, 2024

Choose a reason for hiding this comment

acesnik May 7, 2024

Choose a reason for hiding this comment

acesnik commented May 7, 2024

lazear left a comment •

edited

Loading

lazear May 6, 2024 •

edited

Loading

lazear May 8, 2024 •

edited

Loading