Skip to content

Commit f932a94

Browse files
authored
Feature/translation gtdb ncbi (#12)
* first attemp translation * translate for all tax * README * infra fixes * test translation minimal * travis
1 parent 649024a commit f932a94

19 files changed

+2260
-1602
lines changed

README.md

Lines changed: 68 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -4,24 +4,33 @@ Python package to obtain, parse and explore biological taxonomies
44

55
## Description
66

7-
MultiTax is a Python package that provides a common and generalized set of functions to download, parse, filter and explore multiple biological taxonomies (**GTDB, NCBI, Silva, Greengenes, Open Tree taxonomy**) and custom formatted taxonomies. Main goals are:
7+
MultiTax is a Python package that provides a common and generalized set of functions to download, parse, filter, explore, translate, convert and write multiple biological taxonomies (**GTDB, NCBI, Silva, Greengenes, Open Tree taxonomy**) and custom formatted taxonomies. Main goals are:
88

99
- Be fast, intuitive, generalized and easy to use
1010
- Explore different taxonomies with same set of commands
1111
- Enable integration and compatibility with multiple taxonomies
12-
- *Translate and convert taxonomies (not yet implemented)*
12+
- Translate taxonomies (partially implemented)
13+
- Convert taxonomies (not yet implemented)
1314

14-
MultiTax does not link sequence identifiers to taxonomic nodes, it just handles the taxonomy alone. Some kind of integration to work with sequence ids is planned, but not yet implemented.
15+
MultiTax does not link sequence identifiers to taxonomic nodes, it just handles the taxonomy alone. Some kind of integration to work with sequence or external identifiers is planned, but not yet implemented.
16+
17+
## API Documentation
18+
19+
https://pirovc.github.io/multitax/
1520

1621
## Installation
17-
22+
1823
### pip
1924

20-
pip install multitax
25+
```bash
26+
pip install multitax
27+
```
2128

2229
### conda
2330

24-
conda install -c bioconda multitax
31+
```bash
32+
conda install -c bioconda multitax
33+
```
2534

2635
### local
2736

@@ -31,11 +40,7 @@ cd multitax
3140
python setup.py install --record files.txt
3241
```
3342

34-
## Documentation
35-
36-
https://pirovc.github.io/multitax/
37-
38-
## Basic Example with GTDB
43+
## Basic usage with GTDB
3944

4045
```python
4146
from multitax import GtdbTx
@@ -48,11 +53,11 @@ tax.lineage("g__Escherichia")
4853
# ['1', 'd__Bacteria', 'p__Proteobacteria', 'c__Gammaproteobacteria', 'o__Enterobacterales', 'f__Enterobacteriaceae', 'g__Escherichia']
4954
```
5055

51-
## Further Examples
56+
## Examples
5257

53-
[List of all functions](https://pirovc.github.io/multitax/multitax/multitax.html)
58+
- [List of functions](https://pirovc.github.io/multitax/multitax/multitax.html)
5459

55-
### Obtain/load/parse taxonomy
60+
### Load
5661

5762
```python
5863
from multitax import GtdbTx # or NcbiTx, SilvaTx, ...
@@ -131,7 +136,11 @@ tax.stats()
131136
# 'domain': 2,
132137
# 'root': 1}),
133138
# 'ranks': 45503}
139+
```
134140

141+
### Filter
142+
143+
```python
135144
# Filter ancestors (desc=True for descendants)
136145
tax.filter(['g__Escherichia', 's__Pseudomonas aeruginosa'])
137146
tax.stats()
@@ -148,7 +157,23 @@ tax.stats()
148157
# 'species': 1,
149158
# 'root': 1}),
150159
# 'ranks': 11}
160+
```
151161

162+
### Translate
163+
164+
```python
165+
# GTDB to NCBI
166+
from multitax import GtdbTx, NcbiTx
167+
ncbi_tax = NcbiTx()
168+
gtdb_tax = GtdbTx()
169+
gtdb_tax.build_translation(ncbi_tax)
170+
gtdb_tax.translate("g__Escherichia")
171+
# {'1301', '547', '561', '570', '590', '620'}
172+
```
173+
174+
### Write
175+
176+
```python
152177
# Write tax to file
153178
tax.write("custom_tax.tsv", cols=["node", "rank", "name_lineage"])
154179

@@ -159,7 +184,7 @@ tax.write("custom_tax.tsv", cols=["node", "rank", "name_lineage"])
159184
#...
160185
```
161186

162-
### The same goes for the other taxonomies
187+
### The same applies to other taxonomies
163188

164189
```python
165190
# NCBI
@@ -191,7 +216,9 @@ tax.lineage("f__Enterobacteriaceae")
191216

192217
Using pylca: https://github.com/pirovc/pylca
193218

194-
conda install -c bioconda pylca
219+
```bash
220+
conda install -c bioconda pylca
221+
```
195222

196223
```python
197224
from pylca.pylca import LCA
@@ -208,30 +235,44 @@ L("s__Escherichia dysenteriae", "s__Pseudomonas aeruginosa")
208235
# 'c__Gammaproteobacteria'
209236
```
210237

211-
## General information
238+
## Details
212239

213240
- Taxonomies are parsed into `nodes`. Each node is annotated with a `name` and a `rank`.
214241
- Some taxonomies have a numeric taxonomic identifier (e.g. NCBI) and other use the rank + name as an identifier (e.g. GTDB). In MultiTax all identifiers are treated as strings.
215242
- A single root node is defined by default for each taxonomy (or `1` when not defined). This can be changed with `root_node` when loading the taxonomy (as well as annotations `root_parent`, `root_name`, `root_rank`). If the `root_node` already exists, the tree will be filtered.
216243
- Standard values for unknown/undefined nodes can be configured with `undefined_node`,`undefined_name` and `undefined_rank`. Those are default values returned when nodes/names/ranks are not found.
217-
- Taxonomy files are automatically download or can be loaded from disk (`files` parameter). Alternative `urls` can be provided. When downloaded, files are handled in memory. It is possible to save the downloaded file to disk with `output_prefix`.
244+
- Taxonomy files are automatically downloaded or can be loaded from disk (`files` parameter). Alternative `urls` can be provided. When downloaded, files are handled in memory. It is possible to save the downloaded file to disk with `output_prefix`.
218245

219246
## Translation between taxonomies
220247

221-
Not yet implemented. The goal here is to map different taxonomies if the linkage data is available. That's what I think will be possible.
248+
Partially implemented. The goal is to map different taxonomies if the linkage data is available. That's what is currently availble.
249+
250+
251+
|from/to |NCBI |GTDB |SILVA |OTT |GG |
252+
|--------|---------|-------|----------|--------|------|
253+
|NCBI |- |PART |[part] |[part] |no |
254+
|GTDB |FULL |- |[part] |no |[part]|
255+
|SILVA |[full] |[part] |- |[part] |no |
256+
|OTT |[part] |no |[part] |- |no |
257+
|GG |no |[part] |no |no |- |
258+
259+
Legend:
260+
261+
- full: complete translation available
262+
- part: partial translation available
263+
- no: no translation possible
264+
- []: not yet implemented
265+
266+
### Files and information about specific translations
222267

223-
|from/to |NCBI |GTDB |SILVA |OTT |GG |
224-
|--------|-------|-------|--------|------|----|
225-
|NCBI |- |part |part |part |no |
226-
|GTDB |full |- |no |no |no |
227-
|SILVA |full |no |- |part |no |
228-
|OTT |part |no |part |- |no |
229-
|GG |no |no |no |no |- |
268+
- NCBI <-> GTDB
269+
- GTDB is a subset of the NCBI repository, so the translation from NCBI to GTDB can be only partial
270+
- Translation in both ways is based on: https://data.gtdb.ecogenomic.org/releases/latest/ar53_metadata.tar.gz and https://data.gtdb.ecogenomic.org/releases/latest/bac120_metadata.tar.gz
230271

231272
## Further ideas
232273

233274
- Add/remove/update nodes
234-
- Conversion between taxonomies (write on specific files/format)
275+
- Conversion between taxonomies (write on specific format)
235276

236277
## Similar projects
237278

0 commit comments

Comments
 (0)