Skip to content

Commit

Permalink
Update README
Browse files Browse the repository at this point in the history
  • Loading branch information
fmesena committed Nov 4, 2024
1 parent bea077f commit 149cca6
Show file tree
Hide file tree
Showing 8 changed files with 1,814 additions and 1 deletion.
76 changes: 75 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1 +1,75 @@
# Safe-sequences
# Safe-sequences

**04/11/2024**

## About safe sequences

The driving motivation for this project is to extend the well known notion of safe paths to safe sequences and provide implementations to efficiently compute these objects. [Safety](https://link.springer.com/chapter/10.1007/978-3-319-31957-5_11) has found applications to numerous bioinformatic problems, namely the genome-guided RNA transcript assembly problem, where this current implementation focuses on. More specifically, given a set of RNA-seq reads, a directed acyclic graph (DAG) is constructed from their alignments to a reference genome. The graph nodes correspond to e.g exons, the arcs correspond to reads overlapping two consecutive exons, and the node or arc weights corresponding their read coverage. The RNA transcripts then correspond to a set of source-to-sink weighted paths in the DAG that best explain the nodes, arcs and their weights, under various definitions of optimality. For example, under the assumption of perfect data, the weights on the arcs of the graph induce a flow, and thus the Minimum Flow Decomposition problem (MFD) becomes a natural and suitable abstraction.

Integer Linear Programming (ILP) is an ubiquous framework to tackle NP-hard problems. As such, one can rely on ILP solvers to solve MFD and many of its variants in an attempt to find optimal sets of paths that explain the graph's structure and flow. Due to the inherent complexity of MFD and its variants, even optimized solvers (e.g., Gurobi) cannot compute solutions within a practical timeframe. Therefore, incorporating safety information into the Linear Program may help accelerate the solver's execution. The current implementation is tailored to experimentally assess these speed ups.

We remark that safe paths/sequences can be applied to any path-finding ILP model. In other words, safe paths/sequences are correct whenever the space of feasible solutions consists of a set of source to sink paths.

## Requirements

To solve linear programs we use [Gurobi](https://www.gurobi.com/). We recommend checking [this](https://www.gurobi.com/academia/academic-program-and-licenses/).
To compute maximum weight edge antichains we use a classical reduction to the Minimum Flow problem and then use [NetworkX](https://networkx.org/) to find these antichains for us. All of the above can be easily installed in Python.

## Usage

### Input format

Every graph has a header line of the form "#Graph" followed by a space and an identifier of the graph, i.e., a string. Followed by the header, in a new line, the number of nodes of the graph is written. Then, every next line should be of the form "u v f(u,v)", where u,v are nodes of the graph represented by natural numbers and f(u,v) is the weight on the arc (u,v) (so there should be as many of these lines as arcs in the graph). The values f(u,v) should be integral but not necessarily need to induce a flow.

```text
#Graph id1
n_1
u v f(u,v)
v w f(v,w)
...
u w f(u,w)
#Graph id2
n_2
a b f(a,b)
b c f(b,c)
...
g l f(g,l)
...
#Graph id8
n_100
...
```

Every input file should contain at least one graph. The graph must be a DAG without parallel arcs.
We do not give any correctness guarantees when different graphs carry the same identifiers.
Check the example/test.graph file in the repository for a concrete example.

### Example
```bash
python3 main.py -i example/test.graph -m 0 -c
```

### Options:
```text
-h, --help show this help message and exit
-i INPUT, --input INPUT
Input file path
-t THREADS, --threads THREADS
Number of threads for Gurobi (default: 4)
-g TIMEOUT, --timeout TIMEOUT
Timeout for Gurobi in seconds (default: 300)
-e EPSILON, --epsilon EPSILON
Relative optima improvement for Gurobi in two consecutive iterations; must be between 0 and 1 (default: 0.25)
-c, --clear Clears log file before exiting
-m {0,1,2,3}, --mode {0,1,2,3}
Mode to run. 0: demo used in the paper for MinPathError; 1: demo used in the paper for LeastSquares; 2:
same as 1 but actually optimizes on the solution size and the cumulative errors; 3: skeleton function for
safety (utilize as you see fit).
```

### Output and results analysis'
When running with modes 0, 1, and 2, some metrics of interest are written to a file (e.g., Gurobi's running time equipped with and without safety information, maximum edge antichain of the graph, and more). To produce the LateX tables use the `stats.py` module. In the stats module please note that the `width-ranges` parameter may need adjustment.

## Contact

Please contact the authors for any problem related to the code, namely errors and suggestions to improve.
26 changes: 26 additions & 0 deletions example/test.graph
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
#Graph 0
3
0 1 1
1 2 1
#Graph 1
4
0 1 1
0 2 2
2 3 2
#Graph 2
2
#Graph 3
12
0 1 1
1 2 1
2 3 1
3 4 1
4 5 1
3 5 1
5 6 1
6 7 1
7 8 1
6 9 1
9 10 1
8 10 1
10 11 1
108 changes: 108 additions & 0 deletions graph.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
class st_DAG:

def __init__(self, n:int, source:int, target:int, id:str):
self.id = id
self.n = n
self.m = 0
self.w = 0
self.graph = [[] for _ in range(self.n)]
self.graph_R = [[] for _ in range(self.n)]
self.edge_list = []
self.flow = dict()

self.source = source
self.sink = target

def get_adj_list(self):
return self.graph

def get_adj_list_R(self):
return self.graph_R

#maybe make weight argument optional
def add_edge(self,u,v,w):
self.graph[u].append(v)
self.graph_R[v].append(u)
self.edge_list.append((u,v))
self.flow[(u,v)] = w
self.m += 1

def is_source(self,u):
return len(self.graph_R[u]) == 0 and u!=self.source and u!=self.sink

def is_sink(self,u):
return len(self.graph[u]) == 0 and u!=self.source and u!=self.sink

def get_sources(self):
return list( filter ( self.is_source, [i for i in range(self.n)] ) )

def get_sinks(self):
return list( filter ( self.is_sink, [i for i in range(self.n)] ) )

def out_neighbors(self,u):
return self.graph[u]

def in_neighbors(self,u):
return self.graph_R[u]

def out_degree(self,u) -> int:
return len(self.out_neighbors(u))

def in_degree(self,u) -> int:
return len(self.in_neighbors(u))

def unique_out_neighbor(self,u) -> bool:
return self.out_degree(u) == 1

def unique_in_neighbor(self,u) -> bool:
return self.in_degree(u) == 1

def outflow(self,u) -> int:
out_neighbors = self.out_neighbors(u)
return sum(map(lambda v : self.flow[(u,v)], out_neighbors))

def inflow(self,v) -> int:
in_neighbors = self.in_neighbors(v)
return sum(map(lambda u : self.flow[(u,v)], in_neighbors))

def excess(self,u) -> int:
return self.inflow(u) - self.outflow(u)

def get_nodes(self) -> list:
return list(range(self.n))

def get_nodes_but_st(self) -> list:
return list(range(1,self.n-1))

def print(self):
print(">>>Graph {} n={} m={}".format(self.id, self.n, self.m))
for u in range(self.n):
l = "{} -> ".format(u)
for v in self.out_neighbors(u):
l += "({},{}); ".format(v,self.flow[(u,v)])
print(l)

def __str__(self):
G = ">>>Graph {} n={} m={}\n".format(self.id, self.n, self.m)
for u in range(self.n):
l = "{} -> ".format(u)
for v in self.out_neighbors(u):
l += "({},{}); ".format(v,self.flow[(u,v)])
G += l + "\n"
return G

class EdgeList:
def __init__(self, n:int):
self.n = n
self.m = 0
self.edge_list = []

def add_edge(self,u,v):
self.edge_list.append((u,v))
self.m += 1

def append_edges(self, edges):
self.edge_list = edges + self.edge_list

def prepend_edges(self, edges):
self.edge_list = self.edge_list + edges
Loading

0 comments on commit 149cca6

Please sign in to comment.