New features : Normalisation functions and enhanced criteria. #2

gmagannaDevelop · 2021-06-12T09:07:20Z

I think this should be a light review. There are no modifications to the existing functionalities. I just added some functions to the main API, to perform the log normalisation of raw counts. I also added the extra columns needed to sample from the inferred distributions.

@pauleve I am asking you to review this PR because I think this could be a new release v0.1.3 ? So that you can use your workflow to publish to PyPI and colomoto.

Let me know what do you think.

This parameters will be used to sample the inferred distributions in order to simulate sequencing data.

The main class now contains all the elements needed to sample from the inferred distributions.

.

These can be used to process raw counts and obtain the log-normalised data that is used by most analyses.

pauleve

I would have put normalization.py directly in profile_bin directory, but that's just a matter of taste ;-)

pauleve · 2021-06-12T14:23:11Z

👍
There is no hurry for the release, we can do it later.

to allow simulation via numpy.

ProfileBin.clear_r_envir allows the user to remove all objects related to the binarisation instance that are found on the main environment of the embedded R process. This might be needed or practical when using multiple instances of the class as part of an analysis pipeline.

This will prevent the user from thinking the backing files come from an untrusted source.

.

The ProfileBin class can now automatically detect and try to install missing R dependencies.

Simulate distributions.

.

local parameters for modelling zero-inf genes. This approach will be re-evaluated.

An optional parameter has been added in order to mask zero entries before computing most of relevant parameters. Only the DropOutRate and DenPeak are calculated before dropping all zeros, if the parameter mask_zero_entries is set to True. These changes are consistent accross the R and Python interfaces.

this will be further tested, to see if it can explain the two apparent classes of zero-inf genes.

ProfileBin.simulation_fit() allows to recalculate the criteria to allow the simulation of zero-inflated genes.

in order to facilitate the analysis of simulated zero-inf genes.

This allows doing a full-scale simulation from the estimated parameters.

Removed unnecessary else sections on if clauses that returned directly.

from 0.75 to 0.95 i.e. back to the default value used by Curie on the original paper. The use of this parameter should be discussed on the next reunion as well as whilst writting the paper.

removed unused typing.Dict import, renamed variable p to pool (multiprocessing.Pool context manager).

for zero-inf genes.

commented out part of the criteria that were useless as they were part of the exponential model for ZeroInf genes. This dropped the execution time of compute_criteria by ~18 seconds on Nestorowa dataset.

when calling simulation_fit(), there was a R-side exception if there were no Zero-Inflated genes. This is caused because the bigmemory matrix descriptor file needs to contain at least one entry. It was corrected by skipping the R-side recalculation of compute_criteria for Zero-Inflated genes (binarisation criteria are used for simulation).

when an exception was raised whilst calling compute_criteria, all the R processes spawned to perform the parallel computation were not killed. Now, on the Python-side, if an RRuntimeError is raised, an explicit call to snow::stopCluster() is performed in order to kill the orphan R processes.

.

The performance gains are incredible ! Using sequential, python side apply on the dataframe results in 3x speedup ! An using the parallel version results in 10x speedup.

.

return self in order to be able to chain the calls.

.

Minimal examples will be added.

.

args.publish_dir.mkdir(parents=True, exist_ok=True) was in the wrong place, so the directory was not created when calling rna2bool with a config file.

The random number generator in biased_simulation_from_binary_state() is now fully reproducible. That means that given a fixed seed, the results of calling the function will allways be the same. We would like to ensure that across executions all random number generation runs are fully reproducible, given a fixed seed, for all functions.

to enable faster debug and development

.

There seems to be no significant difference between the time needed to train and return a unordered dataframe and sorting it just after training. For some applications it might be desirable to have this property. Once it's done, I'll merge it into synthesis.

.

PROFILE_source_dev.R replaces PROFILE_source.R

.

Synthesis and sampling will be merged into a single module. I will refactor the whole package to simplify it

modules synthesis.sampling and synthesis.simulation will be merged into a single one.

The sampling functions are now implementing the module-wide unique random number generator by default. This will allow for the global seeding mechanism to function, enabling reproducible results.

.

All the modules there contained had already been merged into the profile_binr.simulation module.

This submodule is to be used to generate trajectories and reconstruct them using STREAM. It might be removed, renamed, or replaced for the second release of profile_binr.

The models that will be used for demonstrating the effectiveness of the synthesis method.

With some minor changes to existing ones. This version facilitates the creation of simpler, cleaner notebooks.

The function profile_binr.simulation.random_nan_binariser contained a critical typo, making it have a tendency to set more values to zero than it was supposed to.

…ajectory enhanced n_samples_per_state parameter behaviour

core regulation model

useful to benchmark the time required to binarise, simulate, etc.

These include changes in the binarisation scheme, docstring, retaining discarded genes (or not), etc.

gmagannaDevelop added 4 commits June 9, 2021 00:04

🚧 Working on extra criteria for synthesis

9f80291

This parameters will be used to sample the inferred distributions in order to simulate sequencing data.

✨ Improved criteria to model the three possible distributions.

67f8ca7

The main class now contains all the elements needed to sample from the inferred distributions.

🙈 ignore data files

a6b8cef

.

✨ Added normalisation functions.

4efe90c

These can be used to process raw counts and obtain the log-normalised data that is used by most analyses.

gmagannaDevelop added the enhancement New feature or request label Jun 12, 2021

gmagannaDevelop requested a review from pauleve June 12, 2021 09:07

gmagannaDevelop self-assigned this Jun 12, 2021

pauleve approved these changes Jun 12, 2021

View reviewed changes

gmagannaDevelop and others added 21 commits June 14, 2021 10:11

✨ added relative probabilities for Gaussian mixtures

f7e99bc

to allow simulation via numpy.

⚡ more explicative backing file names

f302afe

This will prevent the user from thinking the backing files come from an untrusted source.

✨ added extra criteria

659b3b0

.

✨ automatic dependency installation

ef1260d

The ProfileBin class can now automatically detect and try to install missing R dependencies.

⚗️ added simulation suite draft script

5420cfc

Simulate distributions.

🙈 update gitignore

1eb4abb

.

⚗️ add extra params to criteria

0baf155

local parameters for modelling zero-inf genes. This approach will be re-evaluated.

✨ added optional parameter 'dor_threshold'

3cbdde8

this will be further tested, to see if it can explain the two apparent classes of zero-inf genes.

💥 Added methods for simulation.

9d68783

ProfileBin.simulation_fit() allows to recalculate the criteria to allow the simulation of zero-inflated genes.

✨ added diagnostic plot module

f356c1a

in order to facilitate the analysis of simulated zero-inf genes.

⚡ created simulation_criteria

778d06f

This allows doing a full-scale simulation from the estimated parameters.

🎨 Followed pylint suggested improvements

22ebfd4

Removed unnecessary else sections on if clauses that returned directly.

💡 Changed default dor_threshold for ProfileBin.simulation_fit()

50b4193

from 0.75 to 0.95 i.e. back to the default value used by Curie on the original paper. The use of this parameter should be discussed on the next reunion as well as whilst writting the paper.

🎨 Followed pylint recommendations for code style

98d9ccb

removed unused typing.Dict import, renamed variable p to pool (multiprocessing.Pool context manager).

🗑️ deprecate exponential simulation function

680bc6c

for zero-inf genes.

🗑️ Deprecate ZeroInf as exponential extra criteria (R side)

5795c51

commented out part of the criteria that were useless as they were part of the exponential model for ZeroInf genes. This dropped the execution time of compute_criteria by ~18 seconds on Nestorowa dataset.

♻️ added static method to reduce boilerplate

7aec602

.

gmagannaDevelop added 30 commits February 1, 2022 23:00

🚧 First python-side parallel binarisation functional

9badf2f

The performance gains are incredible ! Using sequential, python side apply on the dataframe results in 3x speedup ! An using the parallel version results in 10x speedup.

🚧 developping CLI for reproducible experiments

a69e6b2

.

🚧 Parser advancing, need to add actions

f3c6ec4

.

👽 Change ProfileBin.fit and ProfileBin.simulation fit

ecd8b44

return self in order to be able to chain the calls.

🚨 Fixed pylint and Rlint warnings

2cec2ec

.

💥 First functional version of CLI

60d8ca8

Minimal examples will be added.

💥 First fully functional CLI

1b5e557

.

🙈 update .gitignore

3334311

.

🐛 Fixed experiment directory creation bug

13f083e

args.publish_dir.mkdir(parents=True, exist_ok=True) was in the wrong place, so the directory was not created when calling rna2bool with a config file.

✨ Added small utility module

99b822b

to enable faster debug and development

🙈 update .gitignore

73c3b4d

.

🚨 fix linter warnings, enhance docstrings

81a4e57

.

🔥 removing criteria ordering experiment code

e6250bd

.

🔥 remove PROFILE_source_dev.R

6aef9fc

PROFILE_source_dev.R replaces PROFILE_source.R

✏️ Fixed spelling mistakes

7528b90

.

🚚 Rename functions within synthesis/sampling

e47a552

Synthesis and sampling will be merged into a single module. I will refactor the whole package to simplify it

🚧 Working on reproducible runs.

378ba08

modules synthesis.sampling and synthesis.simulation will be merged into a single one.

🚧 Preparing synthesis.sampling for merge

4bc0a77

The sampling functions are now implementing the module-wide unique random number generator by default. This will allow for the global seeding mechanism to function, enabling reproducible results.

💥 Migrate synthesis/* to simulation.py

770fa3c

.

🔥 Remove profile_binr.synthesis

88a1692

All the modules there contained had already been merged into the profile_binr.simulation module.

✨ Added stream_helpers submodule

f624b1f

This submodule is to be used to generate trajectories and reconstruct them using STREAM. It might be removed, renamed, or replaced for the second release of profile_binr.

✨ Added models submodule

4aafdfc

The models that will be used for demonstrating the effectiveness of the synthesis method.

✨ Added extra utility functions

04cdb00

With some minor changes to existing ones. This version facilitates the creation of simpler, cleaner notebooks.

✏️ Fixed critical typo in simulation.

fe3acd1

The function profile_binr.simulation.random_nan_binariser contained a critical typo, making it have a tendency to set more values to zero than it was supposed to.

⚡ Improved profile_binr.utils.stream_helpers.simulate_from_boolean_tr…

03150d8

…ajectory enhanced n_samples_per_state parameter behaviour

✨ added new model

82b61e6

core regulation model

✨ added a new timer context manager

4643bf5

useful to benchmark the time required to binarise, simulate, etc.

💥 Major changes to profile_binr.wrappers.probinr

dd91264

These include changes in the binarisation scheme, docstring, retaining discarded genes (or not), etc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New features : Normalisation functions and enhanced criteria. #2

New features : Normalisation functions and enhanced criteria. #2

gmagannaDevelop commented Jun 12, 2021

pauleve left a comment

pauleve commented Jun 12, 2021

New features : Normalisation functions and enhanced criteria. #2

Are you sure you want to change the base?

New features : Normalisation functions and enhanced criteria. #2

Conversation

gmagannaDevelop commented Jun 12, 2021

pauleve left a comment

Choose a reason for hiding this comment

pauleve commented Jun 12, 2021