New unimod developments #7

lgatto · 2019-02-08T11:14:51Z

Hi @sgibb - following up with our conversation about unimod at the EuroBioc2018 conference, I would suggest the following class (see also #6).

setClass("Modification",
         slots = c(modification = "character",
                   position = "integer",
                   n = "integer"))

where

modification refers to a specific modification in unimod::modifications.
position is an optional integer defining where in a peptide sequence a modification is expected. Default would be NA, for a modification to be present on any of the amino acids of that peptide
n is an integer(0 - n) where 0:1 defines a variable modification, 1 a fixed modification, and 1 - n the case where 1 to n modifications can be present (@pavel-shliaha's example of methyl, dimethyl, trimethyl)

From there on we could build a `ModificationList. What do you think?

@tfkillian, who's starting to work with me, will have a go once we agree that this is the way forward.

The text was updated successfully, but these errors were encountered:

pavel-shliaha · 2019-02-08T15:19:19Z

A quick input from my collegues (not sure how this can be implemented):

There are instances where modification is fixed but only on one residue, i.e. when you know a location of a PTM
there are many instances whereby the exact position of a modification is unknown, but it is known there is a certain amount of these modifications is present, e.g. you might know there are 2 Met oxidations on some of the 5 Met

sgibb · 2019-02-08T16:37:48Z

While I didn't work on this since the EuroBioc 2018 I now pushed my draft with the classes into the ModificationClass branch.

There I have a complex class hierarchy implemented:

Modification (VIRTUAL)
|- ModificationFixed # slot: @site, fixed modification, that is always applied on each occurence
|   |- ModificationFixedLocalized # slot: @globalIndex: apply mod if on the `globalIndex`th amino acid in the sequence, @siteIndex: apply mod if on the `siteIndex` occurence of the specific amino acid/pattern
|   |    `- ModificationVariable # slot: @globalMaxN, @siteMaxN: maximal occurrence in the whole peptide sequence or at a specific amino acid
|   |
|    `- ModificationFixedRegExp # slot @replacement: apply mod and replace amino acid
|
`- ModificationTerm (VIRTUAL)
     |- ModificationNterm # apply N-term modification (e.g. neutral loss)
     `- ModificationCterm # apply C-term modification

The user just needs to call the universal constructor UnimodModification("UnimodDescription") and the corresponding class is generated. Some modifications, e.g. *Met-loss+Acetyl:P-M" are actually combined modifications. That's why UnimodModification always returns a list.
A simple example:

x <- "GKHNICHHGFBEOGMFHHLBRRQPQARHKSKFRMIJAGCSDGDQOHGRSKAMTSMBLDAIMGFAGENRGSPB"
m <- c(
    UnimodModification("Met-loss+Acetyl:P-M"),
    UnimodModification("Acetyl:K"),
    UnimodModification("Acetyl:N-term"),
    UnimodModification("Acetyl:S", fixed=FALSE, globalIndex=30, siteIndex=3:5) 
    # 30 40 49 54 70
)
m

sapply(m, class)
# [1] "ModificationFixedRegExp" "ModificationFixedRegExp"
# [3] "ModificationFixed"       "ModificationNterm"      
# [5] "ModificationVariable"   

lapply(m , .countSites, pepseq=x)
# [[1]]
# [1] 0
# 
# [[2]]
# [1] 0
# 
# [[3]]
# [1] 4
# 
# [[4]]
# [1] 1
# 
# [[5]]
# [1] 0 1 2 3 4 5

.pepseq(m[[1]], x)
# [1] "GKHNICHHGFBEOGMFHHLBRRQPQARHKSKFRMIJAGCSDGDQOHGRSKAMTSMBLDAIMGFAGENRGSPB"

deltaMass(m, x)
# [[1]]
# [1] 0
# 
# [[2]]
# [1] 0
# 
# [[3]]
# [1] 168.0423
# 
# [[4]]
# [1] 42.01056
# 
# [[5]]
# [1]   0.00000  42.01056  84.02113 126.03169 168.04226 210.05282

@tfkillian: Feel free to completely remove this draft or build upon it. Your are welcome to ask any question. I am happy to help and review your code if you wish.

(I wrote this at the train journey back home from the EuroBioc2018. I didn't touch it since then. Unfortunately there is no documentation and no unit tests in the ModificationClass branch.)

Some notes I had taken at the EuroBioc 2018 (some may be outdated or do not correspond to the draft in the ModificationClass branch.

2 use cases: modify mass and modify sequence (somehow coupled)

modify mass

Fixed modifications on all residues of a specific type (already
possible, could replace the current functionality in MSnbase):

unimod:::.mass("MACE",
               fixedModifications="Carbamidomethyl:C")

## [1] 491.1508
## attr(,"sequence")
## [1] "MACE"

unimod:::.mass(c("ACE", "MACE", "CDE"),
               fixedModifications=c("Acetyl:N-term",
                                    "Carbamidomethyl:C"))

## [1] 402.1209 533.1614 446.1107
## attr(,"sequence")
## [1] "ACE"  "MACE" "CDE"

Fixed modification on a specific residual (e.g. just modify the 2nd
K).
Variable modification (modification on the 1st and/or 2nd, and/or
3rd … K or none at all, becomes more complicated for multiple
modifications, e.g. up to 3 modifications like methylation and/or
acetylation on 1st/2nd/3rd/… K ).

K methyl 14.015650
K dimethyl 28.031300
K trimethyl 42.046950

modify sequence

Some modifications like Met-loss modify the sequence, e.g. remove M:

unimod:::.mass("MACE", fixedModifications="Met-loss:P-M")

## [1] 303.0889
## attr(,"sequence")
## [1] "ACE"

Problems

Allow custom modifications (not part of unimod project), that
could be fixed all/fixed specific/variable.
UI for input.
UI for output.

User Interface for Modification Input

Currently all unimod modifications are in a data.frame:

head(unimod::modifications)

##                              Id UnimodId   Name Description Composition
## Acetyl:K               Acetyl:K        1 Acetyl Acetylation H(2) C(2) O
## Acetyl:N-term     Acetyl:N-term        1 Acetyl Acetylation H(2) C(2) O
## Acetyl:C               Acetyl:C        1 Acetyl Acetylation H(2) C(2) O
## Acetyl:S               Acetyl:S        1 Acetyl Acetylation H(2) C(2) O
## Acetyl:P-N-term Acetyl:P-N-term        1 Acetyl Acetylation H(2) C(2) O
## Acetyl:T               Acetyl:T        1 Acetyl Acetylation H(2) C(2) O
##                 AvgMass MonoMass   Site       Position     Classification
## Acetyl:K        42.0367 42.01056      K       Anywhere           Multiple
## Acetyl:N-term   42.0367 42.01056 N-term     Any N-term           Multiple
## Acetyl:C        42.0367 42.01056      C       Anywhere Post-translational
## Acetyl:S        42.0367 42.01056      S       Anywhere Post-translational
## Acetyl:P-N-term 42.0367 42.01056 N-term Protein N-term Post-translational
## Acetyl:T        42.0367 42.01056      T       Anywhere Post-translational
##                 SpecGroup NeutralLoss        LastModified Approved Hidden
## Acetyl:K                1       FALSE 2017-11-08 16:08:56     TRUE  FALSE
## Acetyl:N-term           2       FALSE 2017-11-08 16:08:56     TRUE  FALSE
## Acetyl:C                3       FALSE 2017-11-08 16:08:56     TRUE   TRUE
## Acetyl:S                4       FALSE 2017-11-08 16:08:56     TRUE   TRUE
## Acetyl:P-N-term         5       FALSE 2017-11-08 16:08:56     TRUE  FALSE
## Acetyl:T                6       FALSE 2017-11-08 16:08:56     TRUE   TRUE

… and the user just needs to specify the Id:

unimod:::.mass("KKK", fixedModifications="Acetyl:K")

## [1] 510.3166
## attr(,"sequence")
## [1] "KKK"

To add custom modifications the user just need to supply a site (which
aminoacid should be modified) and the average and/or monoisotopic
mass of his modification (currently not supported but would easily fit
in the data.frame structure):

For fixed modification on a specific site we would need an additional
column in that data.frame, e.g. location that stores the index in the
peptide sequence that should be modified.

For variable modifications more columns would be useful: maximal number
of allowed modifications (e.g. 3), maximal number of same modifications
on one specific site (e.g. trimethylated K)

User Interface for output

While fixed all/fixed specific modifications produce just one
sequence and one mass result the output of variable modifications could
have multiple sequences/mass.

While for the mass calculation it doesn’t matter if the 1st, 2nd, 3rd, …
K is modified (if there is just one modification) it matters for the
sequence (especially if you are interested in fragments like in
MSnbase or topdownr).

lgatto · 2019-02-08T17:49:33Z

@pavel-shliaha these are use cases we are indeed considering. Thanks for the input.

sgibb · 2019-02-08T18:30:04Z

@pavel-shliaha sorry for ignoring you input in my first comment:

There are instances where modification is fixed but only on one residue, i.e. when you know a location of a PTM

That's what I called a fixed and localized modification and is supported by the ModificationFixedLocalized class in the current draft implementation.

there are many instances whereby the exact position of a modification is unknown, but it is known there is a certain amount of these modifications is present, e.g. you might know there are 2 Met oxidations on some of the 5 Met

That's what I called a variable modification and should be supported by the ModificationVariable class.

lgatto · 2019-02-08T23:03:32Z

Let me start by providing a few more implementation details:

setClassUnion("CharOrInt", c("character", "integer"))

.Modifiction <- setClass("Modification",
                         slots = c(modification = "character",
                                   n = "integer",
                                   position = "CharOrInt"))

## Constructor, setting defaults and checking inputs
Modification <- function(mod, n = NA_integer_, pos = NA_integer_) {
    if (missing(mod))
        stop("Please defined your modification")
    ## check that mod is a valid modification
    if (grepl("N-term", mod)) pos <- "N-term"
    if (grepl("C-term", mod)) pos <- "C-term"
    if (anyNA(n)) {
        n <- NA_integer_
    } else {
        if (length(n) == 1 && n <= 0)
            stop("Can't define an absent")
        if (any(n < 0))
            stop("Can't define a negative number of modifications")
    }
    .Modifiction(modification = mod, n = n, position = pos)
}

And now some definitions:

A variable modification is a modification can be present. n must contain a 0.
A fixed modification must be present a defined number of time. n must be of length 1.
I suggest we call a modification that can appear a variable number of times (including 0 time) a multiple modification. A variable modification is essentially a special case of a multiple modification, where the multiple is 0 or 1.
Note that a fixed modification can't be multiple, but it could carry a fixed number > 1 modification (i.e. n >= 1 below but must length(n) == 1).
A modification can be anywhere or at a defined position, and I suggest we call the latter positional modification. A positional modification can be either fixed or variable or multiple. If defined, the position is either "N-term", "C-term" or i, where 1 <= i <= nchar(pepseq), when the peptide sequence is available.

modType <- function(object) {
    type <- rep(FALSE, 4)
    names(type) <- c("variable", "fixed", "multiple", "positional")
    if (!is.na(object@position)) type["positional"] <- TRUE
    if (!anyNA(object@n)) {
        if (length(object@n) == 2 && identical(object@n, 0:1)) type["variable"] <- TRUE
        if (length(object@n) == 1 && object@n > 0) type["fixed"] <- TRUE
        if (length(object@n) > 1 && all(object@n > 0)) type["multiple"] <- TRUE
    }
    type
}

setMethod("show", "Modification",
          function(object) {
              type <- modType(object)
              cat(object@modification, " is ",
                  paste(names(type[type]), collapse = ", "),
                  ".\n", sep = "")
          })

Demonstration:

> Modification("Acetyl:N-term")
Acetyl:N-term is positional.
> Modification("Acetyl:C-term", n = 2L)
Acetyl:C-term is fixed, positional.
> Modification("Acetyl:K", n = 0:1)
Acetyl:K is variable.
> Modification("Acetyl:C-term", n = 0:1)
Acetyl:C-term is variable, positional.
> Modification("Methyl:S", pos = 2L, n = 0:3)
Methyl:S is positional.
> Modification("Methyl:S", pos = 2L, n = 1L)
Methyl:S is fixed, positional.
> Modification("Methyl:S", pos = 2L, n = 0:1)
Methyl:S is variable, positional.

The examples I consider here are a simplification compared to what is defined in unimod, hence I don't have all the if/else cases as in the ModificationClass branch classes. What I like is that we define the different types more formally, and we avoid a class hierarchy that I'm not convinced is needed (possibly a bit over-engineered). But again, I don't consider all cases, so may be this won't scale.

@sgibb what do you think?

sgibb · 2019-02-10T20:06:49Z

a class hierarchy that I'm not convinced is needed (possibly a bit over-engineered).

That's possible true.

While your current approach looks very simple it would work for most of the cases I guess. And I really like your fixed, variable, multiple, positional-approach.

But what if the user wants a modification of the second K? (your @position is my @globalIndex but what about the @siteIndex?)

I don't really like the CharOrInt definition for position because you will always have to use if (is.numeric(...)) or if (is.character(x@position) && x@position == "N-term") in all functions.
(OK you could argue that it doesn't matter if we frequently call if/else or dispatching methods).

lgatto · 2019-02-10T22:43:25Z

Indeed, I don't considered a siteIndex type of input. I imagine it could be added.

But is this a real use case/requested feature?

Re the int/char slot, we could use 0 or 1 and Inf. What do you think of that?

lgatto · 2019-02-11T05:33:43Z

Thinking back at this, I think we could drop the siteIndex slot and replace it with a findIndex(pepseq, aa, i) function (also findFirstIndex and findLastIndex) that returns that global/absolute position.

sgibb · 2019-02-11T20:40:36Z

But is this a real use case/requested feature?

I more or less assumed it because of this comment: lgatto/MSnbase#167 (comment)

Re the int/char slot, we could use 0 or 1 and Inf.

Mh, that would just change if(is.character(x@position) && x@position == "C-term") into if(!is.finite(x@position)). So your first approach (char/int) is much easier to read and to interpret. So keep it like that.

I agree with your:

setClass("Modification",
         slots = c(modification = "character",
                   position = "integer",
                   n = "integer"))

It is much simpler than my class hierarchy and currently I can't think of an example where it wouldn't work.

lgatto mentioned this issue Aug 16, 2019

Encoding PTMs parameters into one-line Experimental Design bigbio/proteomics-sample-metadata#13

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New unimod developments #7

New unimod developments #7

lgatto commented Feb 8, 2019

pavel-shliaha commented Feb 8, 2019 •

edited

Loading

sgibb commented Feb 8, 2019 •

edited

Loading

lgatto commented Feb 8, 2019

sgibb commented Feb 8, 2019

lgatto commented Feb 8, 2019

sgibb commented Feb 10, 2019

lgatto commented Feb 10, 2019 •

edited

Loading

lgatto commented Feb 11, 2019 •

edited

Loading

sgibb commented Feb 11, 2019

New unimod developments #7

New unimod developments #7

Comments

lgatto commented Feb 8, 2019

pavel-shliaha commented Feb 8, 2019 • edited Loading

sgibb commented Feb 8, 2019 • edited Loading

modify mass

modify sequence

Problems

User Interface for Modification Input

User Interface for output

lgatto commented Feb 8, 2019

sgibb commented Feb 8, 2019

lgatto commented Feb 8, 2019

sgibb commented Feb 10, 2019

lgatto commented Feb 10, 2019 • edited Loading

lgatto commented Feb 11, 2019 • edited Loading

sgibb commented Feb 11, 2019

pavel-shliaha commented Feb 8, 2019 •

edited

Loading

sgibb commented Feb 8, 2019 •

edited

Loading

lgatto commented Feb 10, 2019 •

edited

Loading

lgatto commented Feb 11, 2019 •

edited

Loading