Skip to content
This repository has been archived by the owner on May 23, 2022. It is now read-only.

New features : Normalisation functions and enhanced criteria. #2

Open
wants to merge 63 commits into
base: main
Choose a base branch
from

Conversation

gmagannaDevelop
Copy link
Member

I think this should be a light review. There are no modifications to the existing functionalities. I just added some functions to the main API, to perform the log normalisation of raw counts. I also added the extra columns needed to sample from the inferred distributions.

@pauleve I am asking you to review this PR because I think this could be a new release v0.1.3 ? So that you can use your workflow to publish to PyPI and colomoto.

Let me know what do you think.

This parameters will be used to sample the inferred distributions in order to simulate sequencing data.
The main class now contains all the elements needed to sample from the inferred distributions.
These can be used to process raw counts and obtain the log-normalised data that is used by most analyses.
@gmagannaDevelop gmagannaDevelop added the enhancement New feature or request label Jun 12, 2021
@gmagannaDevelop gmagannaDevelop self-assigned this Jun 12, 2021
Copy link
Member

@pauleve pauleve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have put normalization.py directly in profile_bin directory, but that's just a matter of taste ;-)

@pauleve
Copy link
Member

pauleve commented Jun 12, 2021

👍
There is no hurry for the release, we can do it later.

gmagannaDevelop and others added 21 commits June 14, 2021 10:11
ProfileBin.clear_r_envir allows the user to remove all objects related to the binarisation instance that are found on the main environment of the embedded R process. This might be needed or practical when using multiple instances of the class as part of an analysis pipeline.
This will prevent the user from thinking the backing files come from an untrusted source.
The ProfileBin class can now automatically detect and try to install missing R dependencies.
local parameters for modelling zero-inf genes. This approach will be re-evaluated.
An optional parameter has been added in order to mask zero entries before computing most of relevant parameters. Only the DropOutRate and DenPeak are calculated before dropping all zeros, if the parameter mask_zero_entries is set to True. These changes are consistent accross the R and Python interfaces.
this will be further tested, to see if it can explain the two apparent classes of zero-inf genes.
ProfileBin.simulation_fit() allows to recalculate the criteria to allow the simulation of zero-inflated genes.
in order to facilitate the analysis of simulated zero-inf genes.
This allows doing a full-scale simulation from the estimated parameters.
Removed unnecessary else sections on if clauses that returned directly.
from 0.75 to 0.95 i.e. back to the default value used by Curie on the original paper. The use of this parameter should be discussed on the next reunion as well as whilst writting the paper.
removed unused typing.Dict import, renamed variable p to pool (multiprocessing.Pool context manager).
commented out part of the criteria that were useless as they were part of the exponential model for ZeroInf genes. This dropped the execution time of compute_criteria by ~18 seconds on Nestorowa dataset.
when calling simulation_fit(), there was a R-side exception if there were no Zero-Inflated genes. This is caused because the bigmemory matrix descriptor file needs to contain at least one entry. It was corrected by skipping the R-side recalculation of compute_criteria for Zero-Inflated genes (binarisation criteria are used for simulation).
when an exception was raised whilst calling compute_criteria, all the R processes spawned to perform the parallel computation were not killed. Now, on the Python-side, if an RRuntimeError is raised, an explicit call to snow::stopCluster() is performed in order to kill the orphan R processes.
The performance gains are incredible ! Using sequential, python side apply on the dataframe results in 3x speedup ! An using the parallel version results in 10x speedup.
return self in order to be able to chain the calls.
Minimal examples will be added.
args.publish_dir.mkdir(parents=True, exist_ok=True) was in the wrong place, so the directory was not created when calling rna2bool with a config file.
The random number generator in biased_simulation_from_binary_state() is now fully reproducible. That means that given a fixed seed, the results of calling the function will allways be the same. We would like to ensure that across executions all random number generation runs are fully reproducible, given a fixed seed, for all functions.
to enable faster debug and development
There seems to be no significant difference between the time needed to train and return a unordered dataframe and sorting it just after training. For some applications it might be desirable to have this property. Once it's done, I'll merge it into synthesis.
PROFILE_source_dev.R replaces PROFILE_source.R
Synthesis and sampling will be merged into a single module. I will refactor the whole package to simplify it
modules synthesis.sampling and synthesis.simulation will be merged into a single one.
The sampling functions are now implementing the module-wide unique random number generator by default. This will allow for the global seeding mechanism to function, enabling reproducible results.
All the modules there contained had already been merged into the profile_binr.simulation module.
This submodule is to be used to generate trajectories and reconstruct them using STREAM. It might be removed, renamed, or replaced for the second release of profile_binr.
The models that will be used for demonstrating the effectiveness of the synthesis method.
With some minor changes to existing ones. This version facilitates the creation of simpler, cleaner notebooks.
The function profile_binr.simulation.random_nan_binariser contained a critical typo, making it have a tendency to set more values to zero than it was supposed to.
…ajectory

enhanced n_samples_per_state parameter behaviour
core regulation model
useful to benchmark the time required to binarise, simulate, etc.
These include changes in the binarisation scheme, docstring, retaining discarded genes (or not), etc.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants