RNA-Seq registry

API to store RNA-Seq datasets.

The RNA-Seq registry is used to keep track of all the RNA-Seq datasets loaded for production. It stores the datasets and their samples with some metadata, and keeps a record of the history.

Requirements

Have the rnaseq-registry repo loaded and installed in your environment (or better yet, in a virtual environment like penv). For example:

cd $repo_dir
git clone [email protected]:Ensembl/rnaseq-registry.git
cd rnaseq-registry/
pip install .

Make sure you have a build version set in your environment, used to distinguish different production releases e.g.

export BUILD_VERSION=70

Working with the registry

The registry loads a json file in the format, containing unique dataset_name, organism_abbrv, samples and SRA number.

 [{
  "component": "Fungi",
  "name": "dataset_name",
  "runs": [
   {
    "accessions": [
     "SRR"
    ],
    "name": "sample1"
   },
   {
    "accessions": [
     "SRR"
    ],
    "name": "sample2"
   }
  ],
  "species": "organism_abbrv"
 }]

New datasets loading

To add a new dataset to the registry, you need to create a new json file with the dataset. I.e. if you put your data in a file all.json:

rnaseq_registry dataset $DB_FILE --release $BUILD_VERSION --load all.json

If you get the following output:

SKIP organism 'organism_name' not in the registry x/x datasets can not be loaded (use --replace or --ignore)

SKIP dataset organism_name/dataset_name already in release xx x/x datasets can not be loaded (use --replace or --ignore) to update.

You can set the flag --replace if there is to automatically retire the previous version and replace it with the new dataset.

Note: the old version will still be stored in the registry but will have its latest flag set to False, and its retired field set to the release version provided.

Remapping

If you have RNA-Seq to remap from one organism to another, you first need to make sure the new organism is registered (assuming we set NEW_ORG):

rnaseq_registry organism $DB_FILE --get $NEW_ORG
rnaseq_registry dataset $DB_FILE --remap $OLD_ORG,$NEW_ORG

If you get an error No organism named NEW_ORG, add it yourself (make sure to provide the component database too):

To add a new organism_abbrev

rnaseq_registry organism $DB_FILE --add $NEW_ORG --component $COMPONENT

Remove a dataset:

rnaseq_registry dataset $DB_FILE --organism $NEW_ORG --dataset $DATASET_NAME --remove

Dump to JSON for the RNA-Seq pipeline

Once you have loaded all the new data, you can dump all the datasets for the build in a JSON file:

rnaseq_registry dataset $DB_FILE --release $BUILD_VERSION --dump_file ./dump_${BUILD_VERSION}.json

rnaseq_registry dataset $DB_FILE --organism $ORGANISM --dump_file ./dump_${ORGANISM}.json

All the datasets for that organism will be dumped into a JSON file to be used in the RNA-Seq pipeline.

NB: You can have a look at what is in the registry with the 3 main submenus (use --help in any submenu for more details):

rnaseq_registry component $DB_FILE --list
rnaseq_registry organism $DB_FILE --list --with_datasets --component TrichDB
rnaseq_registry dataset $DB_FILE --list --organism tvagG32022

Note:

The organism and dataset lists can get very long, so you should use the filters (depending on the submenu): --release, --component, --organism, --dataset
By default, only the current datasets are shown. To see the ones that have been retired, add the flag --not_latest
The --organism argument lists all registered organisms, even those without datasets.
You can add the flag --with_datasets to only see the ones with datasets.

Name		Name	Last commit message	Last commit date
Latest commit History 157 Commits
src/ensembl/rnaseq/registry		src/ensembl/rnaseq/registry
tests		tests
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RNA-Seq registry

Requirements

Working with the registry

New datasets loading

Remapping

Dump to JSON for the RNA-Seq pipeline

About

Releases

Packages

Contributors 2

Languages

License

Ensembl/rnaseq-registry

Folders and files

Latest commit

History

Repository files navigation

RNA-Seq registry

Requirements

Working with the registry

New datasets loading

Remapping

Dump to JSON for the RNA-Seq pipeline

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages