Update README

algbio · Nov 4, 2024 · 149cca6 · 149cca6
1 parent bea077f
commit 149cca6
Show file tree

Hide file tree

Showing 8 changed files with 1,814 additions and 1 deletion.
diff --git a/README.md b/README.md
@@ -1 +1,75 @@
-# Safe-sequences
+# Safe-sequences
+
+**04/11/2024**
+
+## About safe sequences
+
+The driving motivation for this project is to extend the well known notion of safe paths to safe sequences and provide implementations to efficiently compute these objects. [Safety](https://link.springer.com/chapter/10.1007/978-3-319-31957-5_11) has found applications to numerous bioinformatic problems, namely the genome-guided RNA transcript assembly problem, where this current implementation focuses on. More specifically, given a set of RNA-seq reads, a directed acyclic graph (DAG) is constructed from their alignments to a reference genome. The graph nodes correspond to e.g exons, the arcs correspond to reads overlapping two consecutive exons, and the node or arc weights corresponding their read coverage. The RNA transcripts then correspond to a set of source-to-sink weighted paths in the DAG that best explain the nodes, arcs and their weights, under various definitions of optimality. For example, under the assumption of perfect data, the weights on the arcs of the graph induce a flow, and thus the Minimum Flow Decomposition problem (MFD) becomes a natural and suitable abstraction.
+
+Integer Linear Programming (ILP) is an ubiquous framework to tackle NP-hard problems. As such, one can rely on ILP solvers to solve MFD and many of its variants in an attempt to find optimal sets of paths that explain the graph's structure and flow. Due to the inherent complexity of MFD and its variants, even optimized solvers (e.g., Gurobi) cannot compute solutions within a practical timeframe. Therefore, incorporating safety information into the Linear Program may help accelerate the solver's execution. The current implementation is tailored to experimentally assess these speed ups.
+
+We remark that safe paths/sequences can be applied to any path-finding ILP model. In other words, safe paths/sequences are correct whenever the space of feasible solutions consists of a set of source to sink paths.
+
+## Requirements
+
+To solve linear programs we use [Gurobi](https://www.gurobi.com/). We recommend checking [this](https://www.gurobi.com/academia/academic-program-and-licenses/).
+To compute maximum weight edge antichains we use a classical reduction to the Minimum Flow problem and then use [NetworkX](https://networkx.org/) to find these antichains for us. All of the above can be easily installed in Python.
+
+## Usage
+
+### Input format
+
+Every graph has a header line of the form "#Graph" followed by a space and an identifier of the graph, i.e., a string. Followed by the header, in a new line, the number of nodes of the graph is written. Then, every next line should be of the form "u v f(u,v)", where u,v are nodes of the graph represented by natural numbers and f(u,v) is the weight on the arc (u,v) (so there should be as many of these lines as arcs in the graph). The values f(u,v) should be integral but not necessarily need to induce a flow.
+
+```text
+#Graph id1
+n_1
+u v f(u,v)
+v w f(v,w)
+...
+u w f(u,w)
+#Graph id2
+n_2
+a b f(a,b)
+b c f(b,c)
+...
+g l f(g,l)
+...
+#Graph id8
+n_100
+...
+```
+
+Every input file should contain at least one graph. The graph must be a DAG without parallel arcs.
+We do not give any correctness guarantees when different graphs carry the same identifiers.
+Check the example/test.graph file in the repository for a concrete example.
+
+### Example
+```bash
+    python3 main.py -i example/test.graph -m 0 -c
+```
+
+### Options:
+```text
+  -h, --help            show this help message and exit
+  -i INPUT, --input INPUT
+                        Input file path
+  -t THREADS, --threads THREADS
+                        Number of threads for Gurobi (default: 4)
+  -g TIMEOUT, --timeout TIMEOUT
+                        Timeout for Gurobi in seconds (default: 300)
+  -e EPSILON, --epsilon EPSILON
+                        Relative optima improvement for Gurobi in two consecutive iterations; must be between 0 and 1 (default: 0.25)
+  -c, --clear           Clears log file before exiting
+  -m {0,1,2,3}, --mode {0,1,2,3}
+                        Mode to run. 0: demo used in the paper for MinPathError; 1: demo used in the paper for LeastSquares; 2:
+                        same as 1 but actually optimizes on the solution size and the cumulative errors; 3: skeleton function for
+                        safety (utilize as you see fit).
+```
+
+### Output and results analysis'
+When running with modes 0, 1, and 2, some metrics of interest are written to a file (e.g., Gurobi's running time equipped with and without safety information, maximum edge antichain of the graph, and more). To produce the LateX tables use the `stats.py` module. In the stats module please note that the `width-ranges` parameter may need adjustment.
+
+## Contact
+
+Please contact the authors for any problem related to the code, namely errors and suggestions to improve.
diff --git a/example/test.graph b/example/test.graph
@@ -0,0 +1,26 @@
+#Graph 0
+3
+0 1 1
+1 2 1
+#Graph 1
+4
+0 1 1
+0 2 2
+2 3 2
+#Graph 2
+2
+#Graph 3
+12
+0 1 1
+1 2 1
+2 3 1
+3 4 1
+4 5 1
+3 5 1
+5 6 1
+6 7 1
+7 8 1
+6 9 1
+9 10 1
+8 10 1
+10 11 1
diff --git a/graph.py b/graph.py
@@ -0,0 +1,108 @@
+class st_DAG:
+
+    def __init__(self, n:int, source:int, target:int, id:str):
+        self.id        = id
+        self.n         = n
+        self.m         = 0
+        self.w         = 0
+        self.graph     = [[] for _ in range(self.n)]
+        self.graph_R   = [[] for _ in range(self.n)]
+        self.edge_list = []
+        self.flow      = dict()
+
+        self.source = source
+        self.sink   = target
+
+    def get_adj_list(self):
+        return self.graph
+
+    def get_adj_list_R(self):
+        return self.graph_R
+
+    #maybe make weight argument optional
+    def add_edge(self,u,v,w):
+        self.graph[u].append(v)
+        self.graph_R[v].append(u)
+        self.edge_list.append((u,v))
+        self.flow[(u,v)] = w
+        self.m += 1
+
+    def is_source(self,u):
+        return len(self.graph_R[u]) == 0 and u!=self.source and u!=self.sink
+
+    def is_sink(self,u):
+        return len(self.graph[u]) == 0   and u!=self.source and u!=self.sink
+
+    def get_sources(self):
+        return list( filter ( self.is_source, [i for i in range(self.n)] ) )
+
+    def get_sinks(self):
+        return list( filter ( self.is_sink,   [i for i in range(self.n)] ) )
+
+    def out_neighbors(self,u):
+        return self.graph[u]
+
+    def in_neighbors(self,u):
+        return self.graph_R[u]
+
+    def out_degree(self,u) -> int:
+        return len(self.out_neighbors(u))
+
+    def in_degree(self,u) -> int:
+        return len(self.in_neighbors(u))
+
+    def unique_out_neighbor(self,u) -> bool:
+        return self.out_degree(u) == 1
+
+    def unique_in_neighbor(self,u) -> bool:
+        return self.in_degree(u) == 1
+
+    def outflow(self,u) -> int:
+        out_neighbors = self.out_neighbors(u)
+        return sum(map(lambda v : self.flow[(u,v)], out_neighbors))
+
+    def inflow(self,v) -> int:
+        in_neighbors = self.in_neighbors(v)
+        return sum(map(lambda u : self.flow[(u,v)], in_neighbors))
+
+    def excess(self,u) -> int:
+        return self.inflow(u) - self.outflow(u)
+
+    def get_nodes(self) -> list:
+        return list(range(self.n))
+
+    def get_nodes_but_st(self) -> list:
+        return list(range(1,self.n-1))
+
+    def print(self):
+        print(">>>Graph {} n={} m={}".format(self.id, self.n, self.m))
+        for u in range(self.n):
+            l = "{} -> ".format(u)
+            for v in self.out_neighbors(u):
+                l += "({},{}); ".format(v,self.flow[(u,v)])
+            print(l)
+
+    def __str__(self):
+        G = ">>>Graph {} n={} m={}\n".format(self.id, self.n, self.m)
+        for u in range(self.n):
+            l = "{} -> ".format(u)
+            for v in self.out_neighbors(u):
+                l += "({},{}); ".format(v,self.flow[(u,v)])
+            G += l + "\n"
+        return G
+
+class EdgeList:
+    def __init__(self, n:int):
+        self.n = n
+        self.m = 0
+        self.edge_list = []
+
+    def add_edge(self,u,v):
+        self.edge_list.append((u,v))
+        self.m += 1    
+
+    def append_edges(self, edges):
+        self.edge_list = edges + self.edge_list
+
+    def prepend_edges(self, edges):
+        self.edge_list = self.edge_list + edges