Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorporating binary labels into kernel distance #133

Open
helenhe96 opened this issue Feb 15, 2018 · 9 comments
Open

Incorporating binary labels into kernel distance #133

helenhe96 opened this issue Feb 15, 2018 · 9 comments

Comments

@helenhe96
Copy link
Collaborator

No description provided.

@ArtPoon
Copy link
Contributor

ArtPoon commented Feb 16, 2018

  1. Make a new branch
  2. Move tree-processing level functions from tree-kernel.R to a new file
  3. tree.kernel should take regular expressions as label arguments instead of expecting character vectors
  4. the regexes should be applied within tree.kernel() to classify tip labels from each tree into a finite number of categories, that can be represented by an integer-valued vector. These two integer vectors will be passed to C-level kernel computation.

@ArtPoon
Copy link
Contributor

ArtPoon commented Feb 20, 2018

@gtng92 pointed out that the kernel distance can be called on trees x and y as k(x,y) or k(y,x), and that if we define two regular expressions then these trees could potentially be processed differently. After discussion we decided to use just one regex for kernel distances.

helenhe96 added a commit that referenced this issue Feb 20, 2018
ArtPoon added a commit that referenced this issue Feb 23, 2018
Deleted deprecated config parsing code from smcConfig.R
Eliminated caching of "self" kernel scores to trees in treekernel.R
@ArtPoon
Copy link
Contributor

ArtPoon commented Feb 26, 2018

Please write unit tests to check whether labeled kernel function is behaving properly before closing

@ArtPoon
Copy link
Contributor

ArtPoon commented Feb 27, 2018

On branch issue133, we presently have this in treekernel.R (dropping commented lines):

tree.kernel <- function(tree1, tree2,
                        lambda,        # decay factor
                        sigma,         # RBF variance parameter
                        rho=1.0,         # SST control parameter; 0 = subtree kernel, 1 = subset tree kernel
                        normalize=0,   # normalize kernel score by sqrt(k(t1,t1) * k(t2,t2))
                        regexPattern="",     # arguments for labeled tree kernel
                        regexReplacement="",
                        gamma=0        # label factor
                        ) {
  # make labels
  use.label <- if (any(is.na(label1)) || any(is.na(label2)) || is.null(label1) || is.null(label2)) {
    FALSE
  } else {
    new_label1 <- gsub(regexPattern, regexReplacement, tree1$tip.label)
    new_label2 <- gsub(regexPattern, regexReplacement, tree2$tip.label)
    TRUE
  }
    
  nwk1 <- .to.newick(tree1)
  nwk2 <- .to.newick(tree2)
		
  res <- .Call("R_Kaphi_kernel",
                 nwk1, nwk2, lambda, sigma, as.double(rho), use.label, gamma, normalize,
                 PACKAGE="Kaphi")
  return (res)
}

We want to make these changes:

  1. user provides regular expressions that determine how substrings that define states are extracted from tip labels --- tip labels have to be unique, but also share some substring in common that tells us whether two tips share the same state, e.g., were sampled from the same compartment
  2. instead of gamma, user should pass a matrix of weights that includes row and column names. These names should correspond to the substrings that are extracted from tip labels by the regular expression.
  3. This function should use both arguments to convert tip labels in either tree into integer-valued vectors, where the integers are indices into the weight matrix. The two integer vectors and the weight matrix (without row/column names) are passed to the C function as vectors (for the matrix, the number of rows and columns is given by the maximum integer values in the respective integer vectors).

@ArtPoon
Copy link
Contributor

ArtPoon commented Feb 27, 2018

regexReplacement should be \\1 by default (capture a single group). There may be a situation where we want to concatenate two or more groups, so I guess we can let the user define a more complex label like "\1\2".

@gtng92
Copy link
Collaborator

gtng92 commented Mar 2, 2018

On branch issue133, implementation changed so that the weight matrix is no longer necessary.

  1. user provides regex to extract the substrings from the tip labels
  2. user provides character vector of all possible states
  3. each tip label is assigned a binary encoded integer value reflective of the state(s) the tip label (5fe35e1)
  4. integer vectors are passed down into C level (24aed96), where the integer value is then decoded and the different states are matched or mismatched

@ArtPoon
Copy link
Contributor

ArtPoon commented Mar 12, 2018

We want to refactor the kernel to encode labels in each node's production. Whereas before productions can only take one of four values (0 for terminal node, 1 for node with two non-terminal descendants, etc.), we now want to have each internal node have a tuple (pair) of integers for productions, and reserve the integer value -1 when the descendant is an internal node.

@ArtPoon ArtPoon added this to the version 0.3 milestone Mar 19, 2018
@ArtPoon
Copy link
Contributor

ArtPoon commented Mar 26, 2018

New labeled kernel is being prototyped in Python, see PoonLab/coevolution phyloK3.py

@ArtPoon
Copy link
Contributor

ArtPoon commented Jun 12, 2019

Need to port Python implementation into R

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants