Skip to content
This repository has been archived by the owner on Aug 12, 2023. It is now read-only.

genenetwork/dump-genenetwork-database

Repository files navigation

dump-genenetwork-database-tests CI badge dump-genenetwork-database CI badge

This repository contains code to dump the metadata in the GeneNetwork relational database to RDF. It requires a connection to a SQL server.

Using

Drop into a development environment with

$ guix shell -m manifest.scm

If the path is not picked up add

export PATH=$GUIX_ENVIRONMENT/bin:$PATH

Build the sources.

$ make

or for a container

mkdir test
guix shell -C --network --share=/run/mysqld/ --manifest=manifest.scm
export GUILE_LOAD_PATH=.:$GUILE_LOAD_PATH
guile json-dump.scm conn.scm test/

Set up connection parameters

Describe the database connection parameters in a file conn.scm file as shown below. Take care to replace the placeholders within angle brackets with the appropriate values.

((generif-data-file . "/path/to/generifs_basic.gz")
 (sql-username . "<sql-username-here>")
 (sql-password . "<sql-password-here>")
 (sql-database . "<sql-database-name-here>")
 (sql-host . "<sql-hostname-here>")
 (sql-port . <sql-port-here>)
 (virtuoso-port . <virtuoso-port-here>)
 (virtuoso-username . "<virtuoso-username-here>")
 (virtuoso-password . "<virtuoso-password-here>")
 (sparql-scheme . <sparql-endpoint-scheme-here>)
 (sparql-host . "<sparql-endpoint-hostname-here>")
 (sparql-port . <sparql-endpoint-port-here>))

Download the GeneRIF data file from https://ftp.ncbi.nih.gov/gene/GeneRIF/generifs_basic.gz and specify its path in the generif-data-file parameter.

Here's a sample conn.scm.

((generif-data-file . "/home/gn/generifs_basic.gz")
 (sql-username . "webqtlout")
 (sql-password . "my-secret-password")
 (sql-database . "db_webqtl")
 (sql-host . "localhost")
 (sql-port . 3306)
 (virtuoso-port . 9081)
 (virtuoso-username . "dba")
 (virtuoso-password . "my-secret-virtuoso-password")
 (sparql-scheme . http)
 (sparql-host . "localhost")
 (sparql-port . 9082))

Dump the database

Then, to dump the database to ~/data/dump, run inside shell

./pre-inst-env ./examples/dump-species-metadata.scm ../conn.scm ~/tmp
$ guix shell -m manifest.scm -- ./pre-inst-env ./examples/dump-dataset-metadata.scm ../conn.scm ~/tmp

Validate and load dump

Then, validate the dumped RDF using rapper and load it into virtuoso. This will load the dumped RDF into the http://genenetwork.org graph, and will delete all pre-existing data in that graph (FIXME)

$ guix shell -m manifest.scm -- rapper --input turtle --count ~/data/dump/dump.ttl
$ guix shell -m manifest.scm -- ./pre-inst-env ./load-rdf.scm conn.scm ~/data/dump/dump.ttl

Upload data to virtuoso

See https://issues.genenetwork.org/topics/systems/virtuoso

Visualize schema

Now, you may query virtuoso to visualize the SQL and RDF schema.

$ guix shell -m manifest.scm -- ./pre-inst-env ./visualize-schema.scm conn.scm

This will output graphviz dot files sql.dot and rdf.dot describing the schema. Render them into SVG images like so.

$ dot -Tsvg -osql.svg sql.dot
$ dot -Tsvg -ordf.svg rdf.dot

Or, peruse them interactively with xdot.

$ xdot sql.dot
$ xdot rdf.dot

The dump-genenetwork-database continuous integration job runs these steps on every commit and publishes its version of sql.svg and rdf.svg.

LICENSE

AGPLv3. See LICENSE.txt

Contributing

The main code tree is hosted at https://git.genenetwork.org/.

git remote add gn git.genenetwork.org:/home/git/public/gn-transform-databases

See bugs and tasks in BUGS.org.