-
Notifications
You must be signed in to change notification settings - Fork 3
/
dev_log.rst
1771 lines (1327 loc) · 79.8 KB
/
dev_log.rst
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
TODOs for the project in the future:
====================================
On the table:
-------------
- TEST: [FEATURE]: Translation tables
- TEST: raise an exception if the translation base is too small
- TEST: fold it into the translate into internal ids
- TEST: In the mapping, preserve the weights column
- TEST: [BUG]: forward the edge dropping into the construction routines
- TODO: [PAPER]:
- Replicability analysis:
- ASK RONG for data (published):
DONE: found in archives
- Linhao paper for aggregates RNA-seq
- NOPE: (too much already) Akshay p53 screens
- ASK EWALD/BADER for data (published):
DONE: found in archives
- TWIST-1
'/home/andrei/Dropbox/workspaces/JHU/Ewald Lab/TWIST1_ECAD/Hits.csv',
- K14 (Veena?):
'/home/andrei/Dropbox/workspaces/JHU/Ewald Lab/Veena data/both_HUM.csv',
'/home/andrei/Dropbox/workspaces/JHU/Ewald Lab/TWIST1_ECAD/All_genes.csv'
- Kp/Km
'/home/andrei/Dropbox/workspaces/JHU/Ewald Lab/Kp_Km data/top_100_hum.csv',
- Collagen vs Matrigel
'/home/andrei/Dropbox/workspaces/JHU/Ewald Lab/Matrigel vs Collagen/Matrigel_vs_collagen-tumor.tsv'
- OTHER VALIDATIONS:
- Breast Cancer cell lines aneuploidy
- Replicate the COVID19 patient fluids diff expression
- TODO: [PAPER]: generate the plot to justify Gumbel distribution choice as fitting the max value
- TODO: [PAPER]:
-
- INTEST: run the chr11 re-analysis
- TODO: replicate the COVID19 patient fluids diff expression
- NOPE: p53 in case of Akshay
Data will be hard to impossible to find
- INTEST: Veena networks
- TODO: [PAPER]: Ablation study
- DONE: Code to perform the ablation study comparison:
- DONE: compare calls
- DONE: compare cll groups
- DONE: generate ablations file to be compared
- TEST: Hits degradation:3
- randomly remove 5%, 10%, 20% and 50% hits
- randomly remove 5%, 10%, 20% and 50% of lowest hits
- TEST: Random noise in hits:
- replace 5%, 10%, 20% and 50% hits with random node sets
- TODO: Size of background samples
- perform a sampling with 5, 10, 20, 25, 50 and 100 background reads
- TEST: Weighting:
- Weighted vs unweighted
- TODO: Network degradation
- interactome: randomly remove 5%, 10%, 20% and 50% of edges
- annotome: randomly remove 5%, 10% and 20% of annotation attachments on proteins
- TODO: Resistance to poisoning (Baggerly-robustness)
- take a random set of nodes
- show absence of calls
- sprinkle a test dataset (glycogen biosynthesis)
- show that only that cluster pops up
Current refactoring:
--------------------
- TODO: [KNOWN BUG] [CRITICAL]: pool fail to restart on cholmod usage
There seems to be an interference between the "multiprocessing" pool method and a third-party
library (specifically cholmod). It looks like after a first pool spawn, cholmod is failing to
re-load a new solution again.
- DONE: try re-loading the problematic module every time => Fails
- DONE: try to explicitely terminate the pool. It didn't work in the end
- TODO: extract a minimal example and post it to StackOverflow
- DONE: add an implicit switch if there is a single element to analysis or multiple
between explicitely multi-threaded and implicitely single-threaded
- DONE: temporary patch and flag it as a known bug
- TODO: rebuild and upload the project to the PyPI
Delayed until after the review
- TODO: rebuild and upload to the testPyPI
- TODO: test it inside a docker instance
- TODO: rebuild and upload to the PyPI
- TODO: test it inside a docker instance
- TODO: [REFACTOR]: factor out the process spawning logic shared between knowledge and
interactome analysis to a "utility" domain of BioFlow
- TODO: [SANITY] [REFACTOR]: sanify the databases management
- TODO: create the `data_stores` package and move everything relating to mongoDB and neo4j there
- TODO: remove the `GraphDeclarator.py`, fold the logic directly into the `cypher_drivers`
- TODO: rename `db_io_routines` to `bionet_routines`
- TODO: import `cypher_drivers` as `bionet_store`
- TODO: rename `cypher_drivers` to `_neo4j_backend`
- TODO: import the `mongodb.py` as an alias with `samples_storage`
- TODO: fold the laplacians .dump object storage in dumps as `auxilary_data_storage`
- TODO: put a type straight-jacket
- TODO: move the `internet_io` to the `data_stores` package
- TODO: [TESTING]: write integration test suite
- TODO: implement docker testing
- TODO: check computation speed
- TODO: check integration test coverage
- TODO: add model_assumptions filter to the auto_analyze of the interactome_analysis as well and
annotome_analysis
- TODO: [SANIFY][REFACTOR] Add a typing module with shared types
- TODO: [FEATURE]: Factor out the structural analysis of the network properties to a module
- TODO: basically eigenvalues + eigenvector for the largest one
- TODO: tools used with Mehdi for the analysis of the network
- TODO: create a bioflow.var folder and put scripts there
- TODO: gene essentiality analysis => it's a different project and needs to be moved
there with bioflow as a dependency
- TODO:
<Environment registration>
- TODO: build status.yaml in the $BIOFLOWHOME$/.internal
- TODO: gets written to upon
- databases downloads ['DOWNLOAD section']: name + date of download + hash
- organism definition
- neo4j filling: upon a neo4j "build"
- Laplacians constructions
- Translation of a dataset
- TODO: on each addition to the stack, everything that is above a certain layer gets
removed
- TODO: on an addition of a stack, if a next level is to be added without the previous one
existing, the level gets niked
- TODO: gets copied to run upon each
- upon a run, the status.yaml gets copied into the base folder
- and a commit # gets added to it
- plus if any untracked changes are present in the tracked files inside .bioflow
- TODO: [USABILITY] store a header of what was analyzed and where it was pulled from + env
parameters in a text file in the beginning of a run.
- TODO: define a persistent "environment_state" file in the $BIOFLOWHOME/.internal/
- TODO: log the organism currently operating
=> check_organism() -> base, neo4j, laplacians
=> update_organism(base, neo4j, laplacians) -> None
- TODO: log the organism loaded in the neo4j
- TODO: log the organism loaded in laplacians
- TODO: define a "check_org_state" function
- TODO: define an "update_org_state" function
- TODO: make sure that the organisms in operating/neo4j/laplacian are all synced
=> "check_sync()" (calls check_organism, raises if inconsistency)
- TODO: make sure that the neo4j is erased before a new organism is loaded into it.
=> "check_neo4j_empty" (calls check_organism, checks that neo4j is "None")
- TODO: make sure that the retrieved backtround set is still valid
<END Environment registration>
<Sanify BioKnowledge>
- TODO: Develop a pluggable Informativity weighting function for the matrix assembly
- TODO: Allow for a score for a physical entity term attachment to the ontology system
- eg. GO attachment comes from UNIPROT
- reactome comes from reactome and can be assigned a linkage score.
- DONE: inline the reach computation to remove excessively complex function
<Type hinting, typing and imports>
- TODO: [SANITY][REFACTOR]: put all the imports under the umbrella making clear where they come from
- TODO: move the models and types into a top-level file "typing", containing all the class models
- TODO: [MAINTENABILITY][REFACTOR]: put a straightjacket of the types of the tuples passed
around and function type signatures => Partially done, long-term project
- TODO: [SANITY] convert the dicts to Type Aliases / NewType and perform proper type hinting
=> Partially done, long-term project
- TODO: [SANITY][REFACTOR]: define appropriate types: => Partially done, long-term project
- neo4j IDs
- laplacian Matrix
- current
- potential
<Pretty progress>
- TODO: [USABILITY] Improve the progress reporting
Move the INFO to a progress bar. The problem is that we are working with multiple threads in
async environment. This can be mitigated by using the `aptbar` library
- TODO: single sample loop to aptbar progress monitoring
- TODO: outer loop (X samples) to aptbar progress monitoring
- TODO: move parameters that are currently being printed in the main loop in INFO channel to
DEBUG channel
- TODO: provide progress bar binding for the importers as well
- TODO: [USABILITY]: fold the current verbose state into a `-v/--vebose` argument
DONE SEPARATOR:
- DONE: sort clusters by p-value
- DONE: there seems to be a bug where most clsters don't get output correctly anymore
=> Nope, correct, just no correct calls made
- DONE: there is a problem with the sparse_sampling toggle being stuck on -1 even in the cases
where it should not be.
- DONE: there is a problem with trimming the length of sampled sets
- current hypothesis is that it's due to duplicate neo4j ids that get eliminated during the
translation to the matrix_ids
- hypothesis is confirmed by the sampling engine not having the replace set to False in the
np.random.choice
- DONE: re-enable the env_skip flags in InteractomeInterface
- NOFX: [FEATURE]: Bayesian re-weighting
A possible implementation of this feature is to provide a mechanism that would sample the
flow through the network based on provided pairs/groups and correct the resistances to make sure
the generated flow is non significant (aka set things to 1 by dividing by the information flow).
- TODO: sample a large set of nodes, non-normalized
- TODO: calculate the resulting flows
- NOFX: We are solving the same problem through statistics due to the difficulty of defining a
proper prior and the amount of calculation needed to get there.
DONE: document the need to add to the java heap of neo4j when operating the database on the human
interactome knowledge.
DONE: [FEATURE]: Factor out the clustering analysis of the network to a different function in
the knowledge/interactome analyses
- DONE: write a significance analysis function,
- taking in the UP2UP tension + UP2UP background tension
- if the analysis was dense
- hierarchically clustering the matrix
- sorting the clusters by size and average flow
- for each size, compare the flow intensity
- use Gumbel to determine significance
- DONE: replicate it for the knowledge analysis
DONE: [SANITY][REFACTOR]: Integration test grid
- Flat prim, weighted prim, flat prim/sec, weighted prim/sec
- Knowledge & Interactome
- No background, flat backgroumd, weighted background
DONE: write the expected environmental variables:
- NEOPASS
- BIOFLOWHOME
DONE: test docker deployment
DONE: rebuild and test for human deployment
DONE: it seems that the current architecture of neo4j is having trouble with very large transaction
- DONE: implement the autobatching
- we are adding a new parameter to the nodes being processed in batch: n.processing
- it is cleared after a request goes through
- we are limiting calls by the WITH n LIMIT XXX statement
- where XXX is autobatching parameter
- and performing a batching loop in self._driver.session, but around the session
.write_transaction
TODO: document new features
DONE: CLI:
- DONE: add support for secondary sets
- DONE: add support for mail report of completion (import log as well and patch the log)
- DONE: switch pure boolean options to flags; propagate to readme examples
DONE: todoc pass
DONE: Three levels of usage:
- DONE: Basic >>> readme
- DONE: rewrite an example for basic usage, from CLI and example line
- DONE: mention the integration tests and the samples shipped for unittests
- DONE: Advanced
- DONE: secondary set
- DONE: weighted set
- DONE: weighted background
- DONE: reweighting/specific nodes exclusion
- DONE: Deep dive
- DONE: adding to the main database
- DONE: Weighting schema modification
- DONE: Pair generation method
- DONE: Statistical significance
- DONE: Sampling method
- DONE: GDF
- NOFX : change the active organism by simple list and then read from the main_configs
(NOFX: major refactor, interferes)
DONE: clean up dead code
- DONE: Delete old code path
- DONE: Clear deprecation markers
- DONE: Follow up with the propagate from main configs markers
- DONE: Follow up with the renaming of nodes
- TODO: Clear the dangling currentpass and tracing/intest/todoc todos
- DONE: check for doc inconsistency
- DONE: switch pool spin-up to internal function (_)
- NOFX: switch the calculation of the sparsity into the active samples loading function, drop
elsewhere: calculated only in the auto_analyze so far
- DONE: move debug prints/logs to the log.debug
- DONE: modify so that the line is printed only into the debug log and not into the info.log
(aka console log)
- DONE: figure out wtf is wrong with the exception reporting through the SMTP
- DONE: disable all sys.excepthooks in logging and smtp logging and insert explicit wrappers
- DONE: correct all the sparse_sampling in the docs
- NOFX: in knowledge interface, find instances of the coupled LegacyID and name and rename them
(NOFX: they are uniprot-specific and are used in iterations. That would be a major refactor)
<Memoization of actual analysis runs>
- NOFX: [FEATURE] currently, re-doing an analysis with an already analyzed set requires a complete
re-computation of flow generated by the set. However, if we start saving the results into
the mongoDB, we can just retrieve them, if the environement and the starting set are
identical. (NOFX: interferes with exclusion-based re-analysis)
- NOFX: [USABILITY] move the dumps into a mongo database instance to allow swaps between builds
- wrt backgrounds and the neo4j states (NOFX: implemented otherwise)
- DONE: [USABILITY] since 4.0 neo4j allows multi-database support that can be used in order to
build organism-specific databases and then switch between them, without a need to rebuild
- NOFX: [USABILITY]: allow a fast analysis re-run by storing actual UP groups analysis in a
mongo database - aka a true memoization. (NOFX: interferes with exclusion-based re-analysis)
- ????: [USABILITY] add the Laplacian nonzero elements to the shape one (????)
<Specific nodes/links exclusion>
- DONE: provide a list of ids of the nodes to be excluded from the analysis
- DONE: map the nodes to the concrete annotation/physical entity nodes
- DONE: after loading the laplacian interface, find the affected nodes/node pairs
- for nodes, null the corresponding row & column
- for pairs of nodes, null the specific cell pairs indicating the connections
<>
- DONE: [USABILITY]: adjust the sampling spin-up according to how many "good" samples are
already in the mongodb
- NOFX: inline the mapping of the foreground/background IDs inside the auto-analyze set
(NOFX: not an essentail feature)
- improves run state registration
- removes an additional layer of logic of saving/retrieving
- is not necessary now that background is used only for the sampling
- NOFX: fold in the different policy functions into the internal properties of the Interface
object and carry them through to avoid excessive arguments forwarding (NOFX: not an essential
feature)
- DONE: transplant those functions into the hash calcualtion
<new flow and sampling routines>
<Split sets>
- DONE: provide infrastructure for the loading for split hits sets
- The easiest to do will be to add a separation character to the loading dumps
- From the UI/UX perspective, however, it is a pure nightmare
- From the user logic, the first-class usage of the secondary set would be a nightmare as
well - it is not a happy path, but rather an additional feature. To enable the secondary set
analysis, we will then be using the
- NOPE: the final decision is to add and document the secondary set start in hits with a
special entry "TARGET SET"
- PROBLEM: there is a heavy interference with the parsing of weighted vs unweighted sets that
will be problematic.
- DONE: we are splitting the hits_list into two in order to supply to the downstream tools
- TEST:
- DONE: Discovered that now there was an issue with unmapped values sticking around after
the translation
- DONE: Discovered an issue with floats not being properly parsed anymore
<Weighting of the nodes>:
- DONE: Define pairs in the sampling with a "charge" parameter if the parameters supplied by the
- The problem is that there is no good rule for performing a weight sampling, given that there
are now two distribution in the interplay
- We however cannot ignore the problem, because we discretize a continuous distribution -
something that is a VERY BAD PRACTICE (TM)
- Basically, the problem is how to perform statistical tests to make sure not to make
overconfident calls.
=> degree vs weight - based sampling?
- DONE: allow for weighted sets and biparty sets to be computed:
- DONE: modify the flow computation functions to allow for a current to be set
- DONE: allow the current computation to happen on biparty sets and weighted sets
- DONE: perform automated switch between current computation policies based on what is
supplied to the method
- DONE: allow for different background sampling processes
- DONE: add two additional parameters to the mongoDB:
- DONE: parameter of the sampling type (set, weighted set, biparty, weighted biparty)
- DONE: parameter specific to the set type: (set size, set size + weight distribution,
pairs, pairs + weights)
- DONE: add a sampling policy transformer that takes in the arguments and returns a proper
policy:
- DONE: set sampling
- DONE: weighted set sampling
- DONE: bipary sampling (~ set sampling)
- DONE: weighted biparty sampling (~ weighted set sampling)
- DONE: check if the background set is weighted, we can perform a sampling according to the
weights indicated there
- As of now, it is not used in sampling
- DONE: check if it is parsed in the weighted version
- DONE: check if it is propagated in the weighted version
<knowledge interface mirror>
- DONE: Mirror the weight sampling modifications from InteractomeInterface/interactome_analysis to
AnnototmeInterface/knowledge_analysis.
- DONE: conduction routines (does not apply - the loop is accessed directly from the
Knowledge loop due to filtering)
- DONE: flow calculation methods
- DONE: change the calculation of ops to the included method
- DONE: change the decision to go sparse with
- DONE: change the generation of the pairs to the one included in the flow calculation
method
- DONE: add support for split and weighted sets in knowledge_access_analysis
- DONE: forwarding of the secondary set and hits set in the knowledge_access_analysis
- DONE: forwarding of the weights in the knowledge_access_analysis
- DONE: add support for split and weighted sets in the BioKnowledge interface:
- DONE: - flow calcualtion/evaluation/reduction methods
- DONE: - separation of active up_sample from the weighted sample
- DONE: - switch active samples to private
- NOFX: - explicit weight functions to be supplied upon a full rebuild
- Add an explicit weight function transfer to allow the rebuild
- NOFX: - fast_load: background logic needs to account for if it is a list of ids or
ids+weights (NOFX: currently does not work)
- DONE: add active sample md5 hash that takes in account the flow and weights as well as
sampling/flow calculation methods
- DONE: add self.set_flow sources, evaluate ops and reduce ops
- DONE: sparse samples standardized to -1
- DONE: random sampling is now a forwards of the random sampling method in the "policy
folder"
- DONE: parse modification - propagate the options now
- DONE: switch to sampling policy in KnowledgeInterface
- DONE: revert the change to sparse_sampling in the text documentation of methods
- DONE: pass the arguments down the pipeline
- DONE: move self.entity_2_terms_neo4j_ids.keys() into self.known_UP_ids upon construction, then
in all the references
- DONE: upon debug discovered that the UNIPROT/GO parse is currently broken:
The borked edges seem to be coming from the BioGRID database
- DONE: a lot of uniprot node connections and "weak interactions" parse ast GO terms
- DONE: the proper GO terms are not loading - dangling legacy code
- DONE: Remove the current defaults in the policies and allow the user to provide them explicitely
upon modules call
- DONE: perform the explicit background pass for BioKnowledge as for the Interactome
- DONE: rename the 'meta_objects' to 'Reactome_base_object' in reactome_inserter.py
<Documentation>
- DONE: [DOC] pass and APIdoc all the functions and modules
- DONE: [DOC] document all the possible exceptions that can be raised
- what will raise an exception
- DONE: [SANITY] remove all old dangling variables and code (deprecated X)
- DONE: [DOC] Document the proper boot cycle of the application
- $BIOFLOWHOME check, use the default location (~/bioflow)
- in case the user configs .yaml is not found, copy it from its own registry to $BIOFLOWHOME
- use the $BIOFLOWHOME/config/main_config.yaml to populate variables in main_configs
- use the information there to load the databases from the internet and set the parsing
locations
- be careful with edits - configs are read safely but are not checked, so you can get random
deep python errors that will need to be debugged.
- everything is logged to the run directory and the $BIOFLOWHOME/.internal/logs
- DONE: [DOC] put an explanation of overall workflow of the library
- DONE: [DOC] check that all the functions and modules are properly documented
- DONE: [SANITY] move the additional from the "annotation_network" to somewhere saner =>
Separate application importing BioFlow as a library
- DONE: [DOC] document where the user-mapped folders live from Docker
- DONE: [DOC] document the user how to install and map to a local neo4j database
- DONE: [REFACTOR] re-align the command line interface onto the example of an analysis pipeline
- DONE: [SANITY] Docker:
- Add outputs folder map to the host filesystem to the docker-compose
- Remove ank as point of storage for miniconda in Docker
<node weights/context forwarding>
- DONE: eliminate InitSet saving in KnowledgeInterface and Interactome interface
- DONE: they become _background
- DONE: _background can be set on init and is intersected with accessible nodes (that are saved)
- on _init
- on _set_sampling_background
- DONE: perform the modification of the background selection and registration logic.
DONE: First, it doesn't have to be integrated between the AnnotationAccess interface and
ReactomeInterface
DONE: Second, we can project the background into what can be sampled instead of
re-defining the root of the sampling altogether.
DONE: Background is no more a parameter supplied upon constructions, but only for the
sampling, where it still gets saved with the sampling code.
DONE: the transformation within the sampling is done
- DONE: deal with the parse_type inconsistencies (likely remainders of the previous insertions
that was off)
- DONE: (physical_entity)-refines-(annotation)
- DONE: (annotation)-refines-(annotation)
- DONE: (annotation)-annotates-(annotation)
- DONE: is_next_in_pathway still has custom_from and custom_to
- DONE: change the way the connections between GOs and UPs are loaded into the KnowledgeInterface
- DONE: [REFACTOR] The policy for the building of a laplacian relies on neo4j crawl (2 steps)
and the matrix build:
- neo4j crawl
- A rule/routine to retrieve the seeds of the expansion
- A rule/routine to expand from those routines and insert nodes into the network
- matrix build
- creates the maps for the names, ids, legacy IDs and matrix indexes for the physical
entities that will be in the interactome
- connect the nodes with the links according to a weighting scheme
- normalize the weights for the laplacian
- DONE: STAGE 2/3:
- DONE: parse the entire physical entity graph
- DONE: convert the graph into a laplacian and an adjacency matrix
- DONE: check for the giant component
- DONE: write the giant component
- DONE: re-parse the giant component only
- DONE: re-convert the graph into a laplacian
- TODO: ax the deprecated code and class variables in InteractomeInterface
- TODO: ax the deprecated variables in the internal_configs
- DONE: [REFACTOR] inline the neo4j classes deletion(the same way as self_diag)
- DONE: [REFACTOR] On writing into the neo4j DB we need to separate the node types and edges:
- node: physical entity nodes
- edge: physical entity molecular interaction
- edge: identity
- node: annotation (GO + Reactome pathway + Reactome reaction)
- edge: annotates
- node: x-ref (currently the 'annotation' node)
- edge: reference (currently the 'annotates' edge type)
For compatibility with life code, those will initially be referred to as parse_types as a
property
- DONE: [REFACTOR] add universal properties:
- N+E: parse_type:
- N:
- physical_entity
- annotation
- xref
- E:
- physical_entity_molecular_interaction
- annotates
- annotation_relation
- identity
- reference
- refines
- N+E: source
- N+E: source_<property> (optional)
- N: legacyID
- N: displayName
- DONE: [REFACTOR] check that the universal properties were added with an exception in
- DB.link if `parse_type` not defined or `source` not defined
- DB.create if `parse_type` not defined, `source` not defined, `legacyID` not defined or
`displayName` not defined
- DONE: [REFACTOR]
- NOPE: Either add a routine that performs weight assignment to the nodes
- DONE: Or crawl the nodes according to the parse_type tags, return a dict of nodes and a
dict of relationships of the types:
- NodeID > neo4j.Node
- NodeID > [(NodeID, OtherNodeID), ] + {(NodeID, OtherNodeID): properties}
- DONE: [DEBUGGING] write a tool that checks the nodes and edges for properties and numbers and
then finds patterns
- DONE: nodes
- DONE: edges
- DONE: patterns
- DONE: formatting
- DONE: PROBLEM:
- 'Collections' are implicated in reactions, not necessarily proteins themselves.
- Patch:
- Either: link the 'part of collection' to all the 'molecular entity nodes'
- Or: create 'abstract_interface'
- Or: same
- DONE: Due to a number of inclusions (Collection part of Collection, ....), we are going to
introduce a "parse_type: refines"
Current rewriting logic would involve:
- DONE: Upon external insertion, insert as well the properties that might influence the
weight computation for the laplacian construction
- cross-link the Reactome nodes linked with a "reaction" so that it's a direct linking in
the database
- DONE: Changing neo4j crawl so that it uses the edges properties rather than node types
- For now we will be proceeding with the "class" node properties as a filter
- Crawl allowed to pass through edges with a set of qualitative properties
- Crawl allowed to pass through nodes with a set of qualitative properties
- Record the link properties {(node_id, node_id): link (neo4j object)}
- Record the node properties {node_id: node (neo4j object)}
- Let the crawl run along the edges until:
- either the allowed number of steps to crawl is exhausted
- there is no more nodes to use as a seed
- DONE: change the weight calculation that will be using the link properties that were recorded
- use the properties of the link and the node pair to calculate the weights for both
matrices
- NOPE: [FEATURE] [USABILITY] upon organism insertion and retrieval, use the 'orgnism' flag on the
proteins and relationships to allow for simultaneous loading of several organisms.
- Superseeded by the better way of doing it through multiple databases in a single neo4j
instance
- DONE: record the origin of the nodes and relationships:
- Reactome
- UNIPROT
- DONE: define trust into the names of different databases and make use it as mask when pulling
relationships
- NOPE: [REFACTOR] remove the GraphDeclarator.py and re-point it directly into the cypher_drivers
- It's already an abstract interface that can be easily re-implemented
- NOPE: [REFACTOR] wrap the cypher_drivers into the db_io_routines class
- Nope, it's already an abstract interface
- DONE: [FUTUREPROOFING] [CODESMELL] get away from using `_properties` of the neo4j database
objects.
=> Basically, now this uses a Node[`property_name`] convention
- DONE: [PLANNED] implement the neo4j edge weight transfer into the Laplacian
- DONE: trace the weights injection
- DONE: define the weighting rules for neo4j
- DONE: enable neo4j remote debugging on the remote lpdpc
- DONE: change the neo4j password on remote lpdpc
- DONE: add the meta-information for loading (eg organ, context, ...)
- doable through a policy function injection
The other next step will be to register the context in the neo4j network in order to be able to
perform loads of networks conditioned on things such as the protein abundance in an organ or the
trust we have in the existence of a link.
- neo4j database:
- REQUIRE: add context data - basically determining the degree of confidence we want to
have in the node. This has to be a property, because the annotations will be used as
weights for edge matrix calculations and hence edges need to have them as well.
- database parsing/insertion functions:
- REQUIRE: add a parser to read the relevant information from the source files
- REQUIRE: add an inserter to add the additional information from the parse files into
the neo4j database
- REQUIRE: an intermediate dict mapping refs to property lists that will be attached to
the nodes or edges.
- laplacian construction:
- REQUIRE: a "strategy" for calculating weights from the data, that can reason on the in
and out nodes and the edge. They can take in properties returned by a retrieval pass.
- REQUIRE: the weighting strategy should be a function that can be plugged in by the end
user, so for a form neo4j_node, neo4j_node, neo4j_edge > properties.
- REQUIRE: the weighting strategy function should always return a positive float and be
able to account for the missing data, even if it is raising an error as a response to
it.
- REQUIRE: the current weighting strategy will be encoded as a function using node types
(or rather sources).
<DONE: CONFIGS sanity>
Current architecture:
- `user_configs.py`, containing the defaults that can be overriden
- `XXX.yaml` in `$BIOFLOWHOME/configs` that allow them to be overriden
- the override is performed in the `user_configs.py`
- `main_configs` imports them and performs the calculation of the relative import paths
- all other modules import `main_configs as configs` and use variables as `configs.var`
- the injection is performed by an explicit if-else loop after having read in a sanitized way
the yamls needed.
- example_configs.yaml are deployed during the installation, that the user can modify.
- the import will look for user_configs.yaml, that the user would have modified
- the yaml file would contain a version of configs, se
- DONE: how do we find where is $BIOFLOWHOME to read the configs from?
- Look up the environment variable. If none found, proceed with default location
(~/bioflow). ask user to register it explicitly upon installation.
- DONE: define the copy to the user directory without overwrite
- DONE: test that the edits did not break anything by defining a new $BIOFLOWHOME and pulling
all online dbs again
- DONE: move user_configs.py to the .yaml
A possible approach is to use the `cfg_load` library and the recommended good application practices:
- define the configurations dictionary ()
- load the defaults (stored in the configs file within the library)
- load the $BIOFLOWHOME/configs/<> - potential overrides for the defaults
- update the configurations with user's overrides
- update the configurations with the environment variables (or alternatively read for the
command line and inject into the environment)
- PROBLEM: we use typing/naming assist from IDE
We can override this issue by still assigning all the the values retrieved from the
config files to the variable names defined in the main_configs file.
In this way, our default variable definitions are the "default", whereas the user's yaml
file supplies potential overrides
- PROBLEM: storing the build/background parameters for the Interactome_Analysis and the
Annotome_Analysis classes
Itermediate problem: there is a loading problem for the `BioKnowledgeInterface` due to `InitSet`
used on the construction (~6721 nodes) is significantly bigger than the `InitSet` used in order
to generate the version that is being put into storage.
- `InitSet` is loaded as `InteractomeInterface.all_uniprots_neo4j_id_list`. In case
`reduced_set_uniprot_node_ids` is defined in the parameters, the source set from the
`InteractomeInterface` is trimmed to only nodes that are present in the limiter list.
- `InteractomeInterface` is the one currently stored in a way that can be retrieved by a
`.fast_load()` and is not changed since the last since the last load. The `.fast_load()` call
is performed
- It looks like the problem is in the fact that the rebuild uses a background limiter (all
detectable genes), whereas the fast load doesn't.
=> Temporary fix:
- DONE: define a paramset with background in the `analysis_pipeline_example`.
=> In the end, it is a problem of organization of the parameters and of the context.
- TODO: Set a user flag to know if we are currently using a background.
- TODO: Set a checkers on the background loads to make sure we
- TODO: Version the builds of the MolecularInterface and BioKnowledge interface
- TODO: Provide a fast fail in case if the environment parameters differ between the
build and the fastload. Environment parameters are in the `user_configs.py` file.
- TODO: this is all wrapped in the environment variables
- TODO: for each run, this is saved as .env text file render.
- DONE: [SANITY] Configs management:
- DONE: move the organism to the '~/bioflow'
- DONE: all the stings `+` need to be `os.path.join`.
- DONE: active organism is now the only thing that is saved. It is stored in "shelve" file
inside the ".internal" directory
- NOIMP: fold in the sources for the databases into a single location, with a selector from
"shelve" indicating which organims to load.
- NOIMP: create a user interface command in order to set up the environment and a saving file
that allows the configs to be saved between the users.
- DONE: move the `online_dbs.ini`, `mouse.ini`, `yeast.ini` to the `~/bioflow/configs` and add
`user_configs.ini` to it to replace `user_configs.py`.
- DONE: [SANITY]: move the location from which the base folder is read for it to be computed
(for relative bioflow home insertion) (basically the servers.ini override)
The next step will be to register user configurations in a more sane way. Basically, it can
either be a persistent dump that is loaded every time the user is spooling up the program or an
.ini file in additions to the ones already existing
- Peristent dump:
- PLUS: Removes the need to perform reading in and out of .ini files
- PLUS: Guarantees that the parameters will always be well formed
- MINUS: is non-trivial to modify for the users
- .ini files: => selected, but in the .yaml incarnation
- PLUS: Works in a way that is familiar to most people
- PLUS: Allows
- MINUS: in case configurations are not properly defined, all crashes
- Both:
- NOPE: a command line process to define the variables
- NOPE: a command line to show all the active flags
- DONE: the configs folder to be gutted of active configs and them moved to the
~/bioflow directory
- DONE: a refactor to show transparently the override between the default parameters
and the user-supplied parameters
- DONE: a refactor to remove the conflicting definitions (such as deployment vs test
server parameters)
- DONE: [USABILITY] add a proper tabulation and limit float length in the final results print-out
(tabulate: https://pypi.python.org/pypi/tabulate)
- DONE [USABILITY] add limiters on the p_value that is printed out elements
- DONE: [USABILITY] change colors of significant elements to red; all others to black (with alpha)
- This modification is to be peformed in the `samples scatter and hist` function in the
`interactome_analysis` module
- DONE: [FEATURE] [REFACTOR]:
- add the selection of the degree width window for stat. significance calculation.
- DONE: [FEATURE]:
- add p-value and pp-value to the GO annotation export
- DONE: Currently, performing an output re-piping. The output destinations are piped around thanks
to a NewOutput class in main_configs, which can be initialized with a local output directory
(and will be initialized in the auto-analyze function for both the interactone and the knowledge
analysis network)
- DONE: [DEBUG] add the interactome_network_stats.png to the run folder
- DONE: [USABILITY]: fold the p-values into the GO_GDF export in the same way we do it for the
interactome
- NOIMP: [USABILITY] Add an option for the user to add the location for the output in the
auto-analyse
- DONE: [USABILITY] align the rendering of the conditions in annotations analysis with the
interactome analysis
- DONE: [DEBUG]: align BioKnowledgeInterface analysis on the InteractomeAnalysis:
- Take the background list into account
- Take in account the analytics UP list in the hashing (once background is taken into account)
We are dealing with a problem on the annotation analysis network not loading the proper
background (proably due to the wrong computation of the laplacian). At this point we need to
align the Annotation analysis on the molecular analysis.
- DONE: run git blame on the Molecular network interface, copy new modifications
- DONE: run git blame on molecular network analysis, copy the new modifications
- DONE: [SHOW-STOPPER]: debug why GO terms load as the same term
- Not an issue - just similar GO terms of different types (eg entities)
- TODO: [SANITY]: Feed the location of the output folders for logs with the main parameters
- DONE: create a function to generate paths from a root location
- DONE: define new "TODO"s : (TRACING, OPTIMIZE and CURRENT)
- DONE: Move the "info" log outputs to the parameters
- NOIMP: Allow the user to provide the names for the locations where the information will be
stored
- DONE: trace the pipings of the output / log locations
- DONE: [USABILITY] add a general error log into the info files
- DONE: [USABILITY] save the final table as a tsv into the run directory
- DONE: [USABILITY] format the run folders with the list send to the different methods
- DONE: [USABILITY] add a catch-it-all for the logs
- DONE: [OPTIMIZATION]: Profile the runtime in the main loop:
- DONE: check for consistency of the sparse matrix types in the main execution loop
- DONE: run a profiler to figure out the number of calls and time spend in each call. common
profilers include `cProfile` (packaged in the base python) and `pycallgraph` (although no
updates since 2016). Alternatively, `cProfile` can be piped into `gprof2dot` to generate
a call graph
- DONE: biggest time sinks are:
- csr_matrix.__binopt (likely binaries for csr matrices) (22056 calls, 81404 ms)
- lil_matrix.__sub__ (7350/79 968)
- lil_matrix.tocsr (7371/76 837)
- sparse_abs (7350/58 700)
- lil_matrix._sub_sparse (3675/48 192)
- csr_matrix.dot/__mul__/_mul_sparse_matrix (~7350 / ~48 000)
- triu (3675/47 592)
- csr_matrix.tocsc (7353 / 47 253)
- csc_matrix.__init__ (121332 / 47 253) (probably in-place multiplication is better)
- DONE: first correction:
- baseline: edge_current_iteration: 3675 calls, 362 662 ms, 40238 own time.
- DONE: uniformize the matrix types towards csc
- eliminated lil_matrix: performance dropped, lil_matrix still there)
(3675 calls, 366 321ms, 41 525 own time)
- elimitated a debug branch in get_current_through_nodes => No change
(3675 / 367 066 / 40 238)
- corrected all spmat.diag/triu calls to return csc matrices + all to csc => worse
(3675 / 408 557 / 40 657)
- tracked and imposed formats to all matrix calls inside the fast loop => 50% faster
(3675 / 262 627 / 40 420)
=> csr_matrix still gets initialized a lot and a coo_matrix is somewhere.
lil_matrix is gone now though
- repaced all mat.csc() conversions by tocsc() calls
=> was more or less already done
- DONE: profiled line per line execution.
- sparsity changes are the slowest part, but seem unavoidable
- followed by triu
- followed by additions/multiplications
- followed by cholesky
- DONE: delay the triu until after the current accumulator is filled
- baseline: 261 983
- after: 197 888 => huge improvement
- NOPE: perform in-place multiplication => impossible (no in-place dot/add/subract
versions)
- TODO: clean-up:
- DONE: remove the debug filters connectors
- DONE: deal with the confusing logic of enabling the splu solver
- problem => We run into the optimization of using a shared solver
- DONE: test the splu and the non-shared solver branch
- Non-sharing works, is slow AF
- Splu switch works, but is slow AF
- DONE: resolve the problem with "memoization" naming convention. in our cases it's remembering
potential diffs. It appears that it also interacts with "fast load" in "build extended
conduction system. Technically, by performing a memoization into a database, we could
have a searchable DB of past runs, so that the comparison is more immediate. So far the
usage is restricted to `InteractomeInterface.compute_current_and_potentials`, to enable a
`fast load` behavior
- DONE: [DEBUG] [SHOW-STOPPER]: connections between the nodes seem to have disappeared
- DONE: check if this could have been related to memoization. Unlikely. The
only place where the results of memoized were accessed was for voltages > it is.
- DONE: run test on the glycogen set. No problem detected there
- DONE: [SANITY] Logs:
=> DONE: Pull the logs and internal dumps into the ~/bioflow directory
=> IGNORE: Hide away the overly verbose info logs into debug logs.
- PTCH: [SANITY] allow user to configure where to store intermediates and inputs/outputs
- DONE: [SANITY] move configs somewhere saner: ~/bioflow/ directory seems to be a good start
- PTCH: [CRICITAL] MATPLOTLIB DOES NOT WORK WITH CURRENT DOCKERFILE IF FIGURE IS CREATED =>
figures are not created.
- DONE: [CRITICAL] ascii in gdf export crashes (should be solved with Py3's utf8)
- DONE: [DEBUG]/[SANITY]: MongoDB:
- DONE: Create a mongoDB connection inside the fork for the pool
- DONE: Move MongoDB interface from configs into a proper location and create DB type -
agnostic bindings
- DONE: [SHOW-STOPPER] Memory leak debugging:
- DONE: apply muppy. Muppy did not detect any object bloating > most likely comes from matrix
domain
- DONE: apply psutil-based object tracing. The bloat appears around summation + signature
change operations in the main loop > sparse matrix summation and type change seem to be the origin.
- DONE: try to have consistent matrix classes and avoid implicit conversions. Did not help
with memory, but accelerated the main loop by 10x. Further optimization of the main loop might be
desirable
- DONE: try to explicitely destroy objects with _del and calls of gc. Did not mitigate the
problem. At this point, the memory leak seem to be localized to C code for the
summation/differentiation between csc_matrices.
- DONE: disable multithreading to see if there is any interference there. Did not help
- DONE: extract he summation to create a minimal example: did not help
- DONE: build a flowchart to see all the steps in matrices to try to extract a minimal
example. Noticed that memoization was capturing a complete sparse matrix. That's where the bloat
was happening.
- DONE: correct the memoization to remember the currents only.
- DONE: Threading seem to be failing as well.
The additional threads execute the first sampling, but never commit. Given that they
freeze out somewhere in the middle, the most likely hypothesis is that they run out of
RAM and only one thread - that keeps a lock on it - continues going forwards.
In the end it is due to memory leak.
- DONE: [DEBUG]: sampling pools seem to be sharing the random ID now and not be parallel. CPU
usage however indicates spawned processes running properly
- DONE: Random ID assignements to the the threads seem to be not working as well
- DONE: rename pool to thread
- DONE: add the ID to the treads
- DONE: debug why objects all share the same ID across threads (random seed behavior? - Nope).
The final reason was that thread ID was called in Interactome_instance initialization and not
- DONE: [USABILITY] change ops/sec from a constant to the average of the last run (was already
the case)
- DONE: debug the issue where the `all_uniprots_id_list` interesection with `background` leads
to error-prone behavior. Errors:
a) injection of non-uniprot IDs into the connected uniprots
b) change of the signature of the Interface instance by changing:
- `analytics_uniprot_list` in `InteractomeInterface`
- `analytics_up_list` in `BioKnowledgeInterface`
The issue seem to be stemming from the following variables:
in `InteractomeInterface`:
- `self.all_uniprots_neo4j_id_list` (which is a pointer to `self.reached_uniprots_neo4j_id_list`)
- `self.connected_uniprots`
- `self.background`
- `self.connected_uniprots` and `self.background` are directly modified from the `auto_analyze`
routine and then
- The operation above is cancelled by random_sample specifically
Which is probably the source of our problems. now the issue is how to get rid of the
problem with nodes that failed
- the issue only emerges upon sparse sampling branch firing
- `self.entry_point_uniprots_neo4j_ids` is used by `auto_analyze` to determine sampling depth
and is set by the `set_uniprot_source()` method and is checked by the
`get_interactome_interface()` method
in `BioKnowledgeInterface`:
- `self.InitSet` (which is `all_uniprots_neo4j_id_list` from the `InteractomeInstance` from
which the conduction system is built)
- `self.UPs_without_GO`
The intermediate solution does not seem to be working that well for now: the sampling mechanism
tends to pull as well the nodes that are not connected to the giant component in the neo4j graph.
- Tentatively patched by making the pull from which the IDs are sampled stricter. Seems to
work well
- DONE: [SHOW-STOPPER]: ReactomeParser does not work anymore likely a node.
The issue was with a automated renaming during a refactoring to extract some additional
data
- DONE: [FEATURE]: (done by defining a function that can be plugged to process any tags in neo4j)
- In Reactome, parse the "Evidence" and "Source" tags in order to refine the laplacian weighting