Skip to content

Conversation

@Kevin-Brockers
Copy link

@Kevin-Brockers Kevin-Brockers commented Nov 3, 2025

User description

Hi all,

here are some proposed changes to parse UNIMOD combined modifications.

  • Rational: Comet in the bigbio/quantms pipeline breaks when Met-loss+Acetyl is set as a variable modification. The current version of parse_sdrf convert-openms converts it to Met-loss+acetyl

  • By UNIMOD convention, combined modifications should be delimited with '+' (see: https://www.unimod.org/names.html, Rule - 11)

  • I propose a minimal implementation. It is likely not solving all UNIMOD combined modifications, since some the it.capitalize() might still break some, but I believe the proposed changes should cover the majority.


PR Type

Bug fix


Description

  • Fix parsing of combined UNIMOD modifications delimited with '+'

  • Preserve capitalization of individual components in combined modifications

  • Prevent incorrect lowercasing that broke Comet pipeline compatibility


Diagram Walkthrough

flowchart LR
  A["Combined Modification<br/>e.g. Met-loss+Acetyl"] -->|Extract Name| B["Split by '+'"]
  B -->|Capitalize Each| C["Met-loss+Acetyl"]
  C -->|Lookup PTM| D["Valid UNIMOD Entry"]
Loading

File Walkthrough

Relevant files
Bug fix
openms.py
Handle combined UNIMOD modifications with proper capitalization

sdrf_pipelines/openms/openms.py

  • Added detection for combined modifications containing '+' delimiter
  • Split combined modifications and capitalize each component
    individually
  • Preserve '+' as separator between modification components
  • Maintain backward compatibility for single modifications
+6/-1     

Summary by CodeRabbit

  • Bug Fixes
    • Improved normalization of combined protein modification names in OpenMS converter to preserve proper formatting (e.g., "Phospho+Oxidation"), ensuring more accurate downstream identification and name resolution of post-translational modifications from the database.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 3, 2025

Walkthrough

Modified UNIMOD modification name normalization in the OpenMS converter to handle combined PTM names. When a modification name contains '+' delimiters, each segment is capitalized individually before joining; otherwise, the name is capitalized as a whole.

Changes

Cohort / File(s) Summary
OpenMS modification normalization
sdrf_pipelines/openms/openms.py
Updated NT name capitalization logic to segment-wise capitalization for combined modification names (containing '+') while preserving existing capitalization behavior for single names

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

  • Focus area: The conditional logic for modification name segmentation and capitalization to ensure it handles both combined and single modification names correctly
  • Verify edge cases: modification names with multiple '+' delimiters, empty segments, and interaction with existing unimod database lookup

Poem

🐰 A modification split in twain,
With '+' marks and careful strain,
Each segment gleams with caps so bright,
Combined names now normalize right!

Pre-merge checks and finishing touches

✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title 'Updated parsing of combined modifications' directly aligns with the main change described in the PR objectives. The core modification involves updating how UNIMOD combined modifications (delimited by '+') are parsed and normalized in the OpenMS converter, specifically to preserve proper capitalization of each segment in combined modifications like 'Met-loss+Acetyl'. The title accurately reflects this primary objective.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@qodo-merge-pro
Copy link

qodo-merge-pro bot commented Nov 3, 2025

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
🟢
No security concerns identified No security vulnerabilities detected by AI analysis. Human verification advised for critical code.
Ticket Compliance
🎫 No ticket provided
  • Create ticket/issue
Codebase Duplication Compliance
Codebase context is not defined

Follow the guide to enable codebase context checks.

Custom Compliance
🟢
Generic: Meaningful Naming and Self-Documenting Code

Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting

Status: Passed

Generic: Secure Error Handling

Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.

Status: Passed

Generic: Secure Logging Practices

Objective: To ensure logs are useful for debugging and auditing without exposing sensitive
information like PII, PHI, or cardholder data.

Status: Passed

Generic: Comprehensive Audit Trails

Objective: To create a detailed and reliable record of critical system actions for security analysis
and compliance.

Status:
No audit logs: The new parsing logic for UNIMOD names adds decision paths without any logging of actions
or outcomes, making it unclear how modifications are transformed during processing.

Referred Code
# Check for combined modifications like "Phospho+Oxidation"
if '+' in name:
    name = [it.capitalize() for it in name.split('+')]
    name = '+'.join(name)
else:
    name = name.capitalize()

accession = re.search("AC=(.+?)(;|$)", mod_string).group(1)
ptm = self._unimod_database.get_by_accession(accession)
if ptm is not None:
    name = ptm.get_name()
Generic: Robust Error Handling and Edge Case Management

Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation

Status:
Edge cases unhandled: The split-and-capitalize path for combined modifications does not handle empty segments or
malformed names and provides no contextual error/logging if parsing fails.

Referred Code
# Check for combined modifications like "Phospho+Oxidation"
if '+' in name:
    name = [it.capitalize() for it in name.split('+')]
    name = '+'.join(name)
else:
    name = name.capitalize()
Generic: Security-First Input Validation and Data Handling

Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities

Status:
Input not validated: The new '+'-based parsing trusts the extracted name without additional
validation or sanitization of combined components, which may allow unexpected tokens to
pass through.

Referred Code
# Check for combined modifications like "Phospho+Oxidation"
if '+' in name:
    name = [it.capitalize() for it in name.split('+')]
    name = '+'.join(name)
else:
    name = name.capitalize()
Compliance status legend 🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

@qodo-merge-pro
Copy link

qodo-merge-pro bot commented Nov 3, 2025

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Fix incorrect combined modification parsing

Move the UniMod accession lookup to only execute for single modifications,
preventing the overwriting of combined modification names.

sdrf_pipelines/openms/openms.py [245-258]

 name = re.search("NT=(.+?)(;|$)", mod_string).group(1)
 # Check for combined modifications like "Phospho+Oxidation"
 if '+' in name:
-    name = [it.capitalize() for it in name.split('+')]
-    name = '+'.join(name)
+    name = '+'.join([it.capitalize() for it in name.split('+')])
 else:
     name = name.capitalize()
-
-accession = re.search("AC=(.+?)(;|$)", mod_string).group(1)
-ptm = self._unimod_database.get_by_accession(accession)
-if ptm is not None:
-    name = ptm.get_name()
+    accession = re.search("AC=(.+?)(;|$)", mod_string).group(1)
+    ptm = self._unimod_database.get_by_accession(accession)
+    if ptm is not None:
+        name = ptm.get_name()
 
 return name
  • Apply / Chat
Suggestion importance[1-10]: 9

__

Why: The suggestion correctly identifies a bug in the PR's logic where combined modification names are overwritten, and the proposed fix correctly resolves this issue.

High
  • More

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
sdrf_pipelines/openms/openms.py (1)

246-251: Now let me examine the test files to understand what's currently being tested and check the actual implementation context:

Based on my verification analysis, I found:

  1. No existing test cases for combined modifications in the test files
  2. SDRF format validation includes whitespace validators, supporting the need for defensive .strip() handling
  3. Single method call site at line 312, meaning impact is localized
  4. UNIMOD database lookup provides a safety net for canonical modification names
  5. No SDRF test data with combined modifications found in the repository

The original review comment makes valid suggestions about edge cases, but lacks context about test coverage and UNIMOD lookup behavior. The suggested improvements are defensive programming measures that are reasonable but may not be strictly necessary for well-formed SDRF input.


Enhance robustness for combined modifications with whitespace and empty segment handling.

The logic correctly addresses combined modifications, with UNIMOD database lookup (lines 253-256) providing a fallback for canonical names. However, two defensive measures improve robustness:

  1. Whitespace handling: SDRF validation includes whitespace validators, so user input might have spaces around delimiters. Stripping each segment ensures proper matching.

  2. Empty segments: While unlikely with valid UNIMOD format, filtering prevents issues from consecutive '+' characters.

Apply this enhancement:

 # Check for combined modifications like "Phospho+Oxidation"
 if '+' in name:
-    name = [it.capitalize() for it in name.split('+')]
+    name = [it.strip().capitalize() for it in name.split('+') if it.strip()]
     name = '+'.join(name)
 else:
     name = name.capitalize()

Note: The capitalize() limitation (e.g., "DiMethyl" → "Dimethyl") acknowledged in the PR is mitigated by UNIMOD database lookup, which retrieves canonical names when available.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7b7dd36 and d0a8f13.

📒 Files selected for processing (1)
  • sdrf_pipelines/openms/openms.py (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Codacy Static Code Analysis

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant