Skip to content

Latest commit

 

History

History
239 lines (166 loc) · 9.83 KB

CONTRIBUTING.md

File metadata and controls

239 lines (166 loc) · 9.83 KB

Frictionlesser being a small project, any contribution are welcomed in any form. Just be bold and either post an issue, a pull request or just send a message to [email protected].

Documentation

If your build chain works, you can build an HTML documentation by enabling the CMake BUILD_DOCS option:

cmake -DBUILD_DOCS=YES

You can then open the <build_dir>/doc/html/index.html in your Web browser to easily browse all the classes and functions detailled documentation.

Debugging tools

The build chain allows to select CMAKE_BUILD_TYPE=Debug to enable:

  • the use of debug symbols in the binary (for using a debugger),
  • the display of more logs (those below the "Progress" log level),
  • internal checks with the assert function.

The application allows to finely configure what is displayed as log messages, see frictionlesser --help, in the "Logging" section. For example, to enable all possible logs:

frictionlesser --verbose=XDebug --log-file=".*" --log-func=".*" --log-depth=9999 --max-errors=9999

Architecture

Frictionlesser implements:

  • a local search algorithm,
  • manipulating swaps over a binary partition of a subset of human genes,
  • evaluated by an objective function, using RNA single-cell sequencing data.

The software comes in two main parts: the application and its library. The application implements a command line executable, running the search algorithm. Its entry point is the app/frictionlesser.cpp file.

The datatester.cpp binary basically re-implements the data checks and can be used to double-check if input data file are correct, without running any algorithm.

The library holds the data structures, the objective function and some common functions. Its entry point is the include/frictionless headers directory, along with the src/ implementation directory.

Terms

Paradiseo and Frictionlesser sometimes differ on how they call this or that. To help modularizing the code, all those terms may be used for different objects.

Nonetheless, they more or less points to similar concepts:

  • solution, individual ≈ signature,
  • objective function ≈ score, quality,
  • search algorithm ≈ [meta]-heuristic, optimization algorithm, evolutionary algorithm,

Paradiseo

The source code heavily rely on the Paradiseo framework for everything related to search algorithmics:

  • the local search is implemented with the Paradiseo-MO module, which allows for easily modifying and extending the algorithm by just combining operators.
  • the binary partition data structure and the corresponding "swap" neighborhood follows the (very light) Paradiseo conventions. (They are actually designed to be ultimately a part of Paradiseo)
  • The objective function inherits from the (quite light) Paradiseo interface, which allows to be easily plugged into other software, thanks to Paradiseo's tooling.

The idea behind Paradiseo is to modularize optimization/search algorithms. As such, it may be difficult to follow, as each module has its own set of interfaces ("operators"), and various implementations are available.

The main design pattern may be hard to graps at first if you are not fluent in object programming. See the "20 years" preprint for a high-level view on it. You can also look at the algopattern project which is a gentle introduction to this kind of design pattern (albeit for another kind of algorithm).

Search Algorithm

The algorithm is only an assembling of Paradiseo-MO components. It is thus completely implemented in a few lines, near the end of the app/frictionlesser.cpp file. Most of the code is actually managing various way to log its execution.

If you want to have a look at the algorithm itself, you need to browse Paradiseo's code.

  • The code of the moRandomBestHC class is just a wrapper, actually pre-assembling an "explorer" for you.
  • The moLocalSearch class contains some actual code from which you can follow the important operators.

Objective Function

The objective function is the high-level interface that computes a signature's quality.

The entry point for the objective function is the file include/frictionless/eval.h.

The objective function follows Paradiseo-MO's architucture for partial evaluations. This allows to drastically reduce the amount of computations when evaluating a solution that is just a gene swap away from another.

The entry points are:

  • The frictionless::EvalFull implements the full evaluation of a completely new solution.
  • The frictionless::EvalSwap implements a partial evaluation for swap neighborhoods.

These two classes heavily rely on the FriedmanScore class (score.h), which computes the main statistic, and computes the data cache that allows the partial evaluation (see src/eval.cpp).

The FriedmanScore itself relies on a Transcriptome (transcriptome.h), which holds the input RNA expression data, along with various accessors onto it.

The score is computed for a given binary partition of the genes space, which is held by the Signature class (see below).

Note that the name of members in the FriedmanScore follows the notation used in the Frictionlesser technical report.

Signature

A solution to the problem is called a Signature (signature.h), which is actually a moBinaryPartition (moBinaryPartition.h). The binary partition is just a set of "selected" genes (and its counterpart, a set of "rejected" genes).

It is coupled with a "Fitness", which is the slang term in Paradiseo for "objective function value of a solution to the problem".

In Frictionlesser, the Fitness of a Signature is a Score (signature.h). This Score essentially holds the cache allowing the partial evaluation. It also hold the score value (a scalar), and the atomic score values by samples; see the ScoreDetails class (signature.h).

Cache

The cache system are the low-level data structure that are to be updated when the score of a signature is updated after some change. It is structured in three layers, depending on what is changing when encountering a rew signature.

In the current setup, only the swap cache is supposed to be used during the search. The two other caches are involved during data load, and are managed by the high-level application (see frictionlesser.cpp).

All the details related to the cache system are in cache.h:

  • CacheTranscriptome, for Friedman score's intermediate results that are tied to a given transcriptome,
  • CacheSize, for results that are tied to a given signature size,
  • CacheSwap, for results involved in swaping two genes.

The current design is to attach the swap cache to the neighbor operator (see the next section). This avoid having a cache attached to the signatures themselves and saves some space and copy time. However, this require to swap caches when moving from one signature to another. The current design also maintains a swap cache attached to the FriedmanScore data structure, which may not be optimal.

Neighborhood

The neighborhood describes how to "move" from one signature to another.

In Paradiseo-MO, this concept is at the core of the modularization, and may be difficult to fully grasp at first. You may first read the Paradiseo-MO preprint to get an introduction.

A Paradiseo-MO "Neighbor" is not just another Signature, but it implements how to move from one signature to another. In moBinaryPartitionSwapNeighbor (moBinaryPartitionSwapNeighbor.h), it stores a couple of genes: one to be selected, the other to be rejected, hence modelling a swap that can be applied on a Signature.

The moBinaryPartitionSwapNeighborhood class (moBinaryPartitionSwapNeighborhood.h) implements a way to enumerate all the possible neighbors of a given signature. It actually generates neighbors and not solutions.

Other

The parser.h shows several classes that may load different ways to represent the expression tables (i.e. Neftel's convention or Zakiev's convention).

Frictionlesser uses the clutchlog project for having nice, colored, logs that shows the log location. Its configuration is set in log.cpp.

Frictionlesser also uses the exceptions project for having clean exception classes declarations, holding the errors location.

The frictionless.h file holds some convenience functions.

The src/pgamma.cpp file is borrowed from the R project.

Licensing

TL;DR: Frictionlesser is available under the AGPL license.

Frictionlesser itself is distributed under the GNU Affero Public General License v3.0 license (AGPL). It's source code is (so far) fully copyrighted to the Institut Pasteur, except for the code of the pgamma function, which is borrowed from the R project (under GPL).

Frictionlesser compiles against the Paradiseo project code, which is distributed under the LGPL v2.0 (for its core) and the CeCILL license v2.1 (for the MO module).

The CeCILL license is fully compatible with the GPL, and the AGPL is basically a GPL with added clauses on using the software as a service over a network. Hence, the most restrictive license applies, which is the AGPL v3.

This means that any derivative work should be licensed under the same term, which basically guarantee that you will always be able to get access to the source code, whatever the setting in which you use this software.