You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+68-27Lines changed: 68 additions & 27 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,24 +4,33 @@ Python package to obtain, parse and explore biological taxonomies
4
4
5
5
## Description
6
6
7
-
MultiTax is a Python package that provides a common and generalized set of functions to download, parse, filterand explore multiple biological taxonomies (**GTDB, NCBI, Silva, Greengenes, Open Tree taxonomy**) and custom formatted taxonomies. Main goals are:
7
+
MultiTax is a Python package that provides a common and generalized set of functions to download, parse, filter, explore, translate, convert and write multiple biological taxonomies (**GTDB, NCBI, Silva, Greengenes, Open Tree taxonomy**) and custom formatted taxonomies. Main goals are:
8
8
9
9
- Be fast, intuitive, generalized and easy to use
10
10
- Explore different taxonomies with same set of commands
11
11
- Enable integration and compatibility with multiple taxonomies
12
-
-*Translate and convert taxonomies (not yet implemented)*
12
+
- Translate taxonomies (partially implemented)
13
+
- Convert taxonomies (not yet implemented)
13
14
14
-
MultiTax does not link sequence identifiers to taxonomic nodes, it just handles the taxonomy alone. Some kind of integration to work with sequence ids is planned, but not yet implemented.
15
+
MultiTax does not link sequence identifiers to taxonomic nodes, it just handles the taxonomy alone. Some kind of integration to work with sequence or external identifiers is planned, but not yet implemented.
- Taxonomies are parsed into `nodes`. Each node is annotated with a `name` and a `rank`.
214
241
- Some taxonomies have a numeric taxonomic identifier (e.g. NCBI) and other use the rank + name as an identifier (e.g. GTDB). In MultiTax all identifiers are treated as strings.
215
242
- A single root node is defined by default for each taxonomy (or `1` when not defined). This can be changed with `root_node` when loading the taxonomy (as well as annotations `root_parent`, `root_name`, `root_rank`). If the `root_node` already exists, the tree will be filtered.
216
243
- Standard values for unknown/undefined nodes can be configured with `undefined_node`,`undefined_name` and `undefined_rank`. Those are default values returned when nodes/names/ranks are not found.
217
-
- Taxonomy files are automatically download or can be loaded from disk (`files` parameter). Alternative `urls` can be provided. When downloaded, files are handled in memory. It is possible to save the downloaded file to disk with `output_prefix`.
244
+
- Taxonomy files are automatically downloaded or can be loaded from disk (`files` parameter). Alternative `urls` can be provided. When downloaded, files are handled in memory. It is possible to save the downloaded file to disk with `output_prefix`.
218
245
219
246
## Translation between taxonomies
220
247
221
-
Not yet implemented. The goal here is to map different taxonomies if the linkage data is available. That's what I think will be possible.
248
+
Partially implemented. The goal is to map different taxonomies if the linkage data is available. That's what is currently availble.
### Files and information about specific translations
222
267
223
-
|from/to |NCBI |GTDB |SILVA |OTT |GG |
224
-
|--------|-------|-------|--------|------|----|
225
-
|NCBI |- |part |part |part |no |
226
-
|GTDB |full |- |no |no |no |
227
-
|SILVA |full |no |- |part |no |
228
-
|OTT |part |no |part |- |no |
229
-
|GG |no |no |no |no |- |
268
+
- NCBI <-> GTDB
269
+
- GTDB is a subset of the NCBI repository, so the translation from NCBI to GTDB can be only partial
270
+
- Translation in both ways is based on: https://data.gtdb.ecogenomic.org/releases/latest/ar53_metadata.tar.gz and https://data.gtdb.ecogenomic.org/releases/latest/bac120_metadata.tar.gz
230
271
231
272
## Further ideas
232
273
233
274
- Add/remove/update nodes
234
-
- Conversion between taxonomies (write on specific files/format)
275
+
- Conversion between taxonomies (write on specific format)
0 commit comments