-
Notifications
You must be signed in to change notification settings - Fork 1
time dependent constraints in gfpop
There are many R packages for changepoint detection, which is an important problem in many application domains (genomics, finance, etc). A new changepoint model with up-down constraints has shown state-of-the-art peak detection accuracy in genomic data, and more general models like this can be implemented using the new gfpop package (Runge et al, arXiv:2002.03646).
Currently the gfpop package only supports constraints which are valid for all time points. The goal of this project is to extend gfpop to allow constraints which depend on time. This will allow implementing fast optimal solvers for several new constrained changepoint models such as Labeled Optimal Partitioning.
Existing R packages fall into two categories:
- generic frameworks such as gfpop, Segmentor3IsBack, changepoint which allow computation of several different types of changepoint models. (but do not allow time dependent constraints)
- domain-specific packages such as LOPART, which computes one model with time dependent constraints (Gaussian loss, label constraints).
The goal of the coding project would be to extend gfpop to support time dependent constraints, so a wider variety of models could be specified in this framework.
I think the easiest way to implement this from a R/C++ interface perspective would be to
- 1. Allow user to pass to gfpop a new “rule” argument, a vector of length N (number of data) with entries that are integers which specify an update rule ID to use for each data point.
- 2. Add a “rule” argument in the Edge function. This is an integer which should correspond to the IDs provided in 1.
In practice this could be specified via R code like:
LOPART.graph <- gfpop::graph(
gfpop::Edge("normal", "normal", type="null", rule=1),
gfpop::Edge("normal", "normal", type="std", rule=1, penalty=5.5),
gfpop::Edge("normal", "normal", type="null", rule=2),
gfpop::Edge("noChange", "noChange", type="null", rule=2),
gfpop::Edge("normal", "noChange", type="std", rule=2),
gfpop::Edge("normal", "normal", type="std", rule=3),
gfpop::Edge("noChange", "normal", type="null", rule=3),
gfpop::Edge("normal", "normal", type="null", rule=4))
gfpop::gfpop(data.vec, LOPART.graph, rule=data.table::fcase(
unlabeled, 1,
in.positive.label, 2,
end.positive.label, 3,
in.negative.label, 4))
The idea in the code above is that the data.table::fcase call is like a switch statement that returns a vector with elements from 1 to 4 which specify what update rule/graph to use at each data point (variables unlabeled/in.positive.label etc above are logical vectors that depend on the provided label data in the LOPART model).
Then when you compute the DP updates you can, instead of using all of the edges, you can instead just use the subset of edges which correspond to the ruleID of the current data point.
The only tricky part is that we sometimes need to set a cost function to be infinity (if there are no edges coming to it at this data point) and the operators (min-less, min-more, min-env) need to be updated to handle the case of infinite inputs.
The goal for GSOC will be to implement at least
- up-down constrained model with label constraints (6 rule IDs).
- usual changepoint model with 0/1 label constraints (4 rule IDs, shown in R code above).
The gfpop package is already very useful, and incorporating time-dependent constraints will make it more useful.
Please get in touch after completing at least one of the tests below.
- EVALUATING MENTOR: Vincent Runge <[email protected]>
- Co-mentor: Toby Dylan Hocking <[email protected]>
Do one or several — doing more hard tests makes you more likely to be selected.
NOTE: this is a very difficult GSOC project, and so the ideal contributor really should do all tests below. If we can’t find a contributor that can do all these tests, we should probably not fund this GSOC project.
Easy: write an R function plotModel
which takes a gfpop::graph
as above and draws a matrix of nodes (one row for each state, one
column for each rule) and edges.
Medium: the Segmentor3IsBack package implements the segment neighborhood model (best models in
1,…,K segments) for the
Poisson loss and no constraints, but there is no implementation available for the
optimal partitioning model (best K-segment model for a single penalty
parameter, without computing the models with 1,…,K-1 segments). The
goal of this test is to modify the code in the PeakSegOptimal
package, in
order to implement a solver for the optimal partitioning problem with
Poisson loss and no constraints. Begin by studying PeakSegFPOPLog.cpp
which implements the optimal partitioning model for the Poisson loss
and the up-down constraints. There are two states in this model, up
and down. Since the up-down constrained solver has two states, there
are N x 2 optimal cost functions to compute (cost_model_mat
is of
dimension data_count*2
). The cost of being in the up state at
data_i
is cost_model_mat[data_i]
and the cost of being in the down
state is cost_model_mat[data_i+data_count]
. The
min_prev_cost.set_to_min_less_of(down_cost_prev)
method enforces a
non-decreasing constraint between adjacent segment means, for the
state change down->up. Analogously, the
PiecewisePoissonLossLog::set_to_min_more_of
method enforces a
non-increasing constraint for the state change up->down. To implement
the unconstrained solver, you just need to implement a new
PiecewisePoissonLossLog::set_to_unconstrained_min_of
method that
computes the minimum constant cost function (one PoissonLossPieceLog
object), and then uses that to compute the N x 1 array of optimal cost
functions (cost_model_mat
). Read about the FPOP algorithm in
Maidstone et al 2016 for more info. When you are done with your
implementation, check your work by comparing with the output of
Segmentor3IsBack::Segmentor(model=1)
. Perform model selection
yourself for a range of penalty parameters. Using testthat
, write
some test cases which make sure your function gives the same exact
model as the corresponding selected Segmentor model.
Hard: there is not yet an regularized isotonic regression solver for
the normal loss (issue), and your goal in this test is to implement
one. Like the unconstrained model described the Medium test, the
regularized isotonic regression model also has only one state. So you
can start by modifying the Medium test code, which should have a
cost_model_mat
which is N x 1. However the isotonic regression
constraint means that all changes are non-decreasing, so you should
use set_to_min_less_of
instead of set_to_unconstrained_min_of
. Now
the difficult part: the existing code in the coseg
package
implements the Poisson loss via class PoissonLossPieceLog
, but you
need to implement another class for the Normal loss,
NormalLossPiece
. This class will need to declare different
coefficients Constant
, Linear
, Quadratic
for a function
f(mean)=Constant + Linear*mean + Quadratic*mean^2. You will need to
provide implementations for get_smaller_root
and get_larger_root
by using the sqrt
function in #include <math.h>
. It will be judged
even better if you can get PoissonLossPieceLog
and NormalLossPiece
to inherit from the same base class with shared methods (that is the
approach that the Segmentor package uses to implement several loss
functions, and that is the approach that will be recommended for this
GSOC project). Check your work by writing a testthat
unit test to
make sure that the model returned by your function with penalty=0 is
the same as the model returned by the isoreg
function (PAVA
algorithm). Write another test that checks that the output model
is the same as Fpop
(when all changes are non-decreasing).
Contributors, please post a link to your test results here.
- EXAMPLE CONTRIBUTOR 1 NAME, LINK TO GITHUB PROFILE, LINK TO TEST RESULTS.