Skip to content

Reference genome graph

JervenBolleman edited this page Sep 18, 2015 · 14 revisions

Assume using VG as an input source.

One can generate a test json file using this bash script pull request at VG

#!/bin/bash
((
jq " .path[]|{(\"\(.name)\"):{\"rdf:type\":[{\"value\":\"Path\",\"type\":\"uri\"}]}} " $1|sed '$d';
jq ".edge[]|{(\"\(.from | tostring)\"):{\"before\":[ {\"value\":(\"\(.to |tostring)\"), \"type\" : \"uri\"}]}}" $1 | sed -e "s_^\}_,_"| sed -e 's_^[\{]__'|sed '1d;$d';
jq ".node[]|{(\"\(.id | tostring)\"): {\"rdf:value\":[ {\"value\":.sequence, \"type\" : \"literal\"} ] }}" $1|sed -e "s_^\}_,_"| sed -e 's_^[\{]__'|sed '1d;$d';
echo ',';
jq  "  .path[] as \$p|\$p.mapping 
| keys[] as \$k 
| { (\$k | tostring):{\"node\":[ {\"value\":(.[\$k].position.node_id | tostring) 
, \"type\":\"uri\"} ]
, \"step\" : 
        [{\"value\":(\$k | tostring)
        , \"type\":\"literal\"}]
,\"path\":
        [{\"value\":(\$p.name),
        \"type\":\"uri\" } ]}} 
" $1 |sed -e "s_^\}_,_"| sed -e 's_^[\{]__';)|sed '$d';echo '}')

Then convert this to n-triples

./vgtordf.sh x.json |
riot --syntax="RDF/JSON" --output="N-Triples" > x.nt

Then one can query this with the command line sparql tool from jena. In the example we rebuild the linear sequence from the paths in the data.

sparql --data x.nt 
"SELECT ?path (group_concat(?sequence; separator='') as ?pathSeq)
 WHERE {?step <http://base/path> ?path; 
              <http://base/node> ?node ;
              <http://base/step> ?order.
        ?node <rdf:value> ?sequence} 
 GROUP BY ?path 
 ORDER BY ?order"

Longest sequence fragment

sparql --data x.nt "SELECT (max(?len) as ?maxlen) 
                    WHERE  { ?x  <rdf:value> ?s . 
                             BIND (strlen(?s) as ?len)}"

Work still to be done

  • Generate stabilish IRIs for nodes in the graphs
  • Determine what terms/predicates we should use to describe the relations captured in the graph.
  • Instead of using multi pass sed/jq bash script actually writeout a RDF serialisation from the C++ code.

Draft stuff while working on the idea

In your git checkout of VG

./vg construct -r ./test/small/x.fa -v /test/small/x.vcf.gz > /test/x.vg
./vg view -V ./test/x.vg > ./test/x.json
#then format the json for testing
jq . ./test/x.json > ./test/nx.json 

Idea1: Use jq to format output as RDF/JSON

Convert the "Nodes" into simple triples where the nodid is related to the sequence

jq ".node[]| \ 
   {(\"<\"+(.id | tostring)+\">\"):{\"rdf:value\":.sequence, \"type\" : \"literal\"}}" \ 
   xn.json > x.rdfjson

Convert the existing edges into a more triples just stating a node is before another node.

jq ".edge[]| \
  {(\"<\"+(.from | tostring)+\">\"):{\":before\":(\"<\"+(.to |tostring)+\">\"), \"type\" : \"uri\"}}" \
  xn.json >> x.rdfjson

Then we should be able to use the jena RIOT to translate this further into any other RDF or pipe it into a database.

Next step is the path, this is slightly more annoying due to RDF natively having only linked lists. For basic reconstruction of fasta sequence from such a path we should store the order in the path explicitly (allows for loops/repeats in the future as well without introducing potential infinite loops).

jq ".path[]|{(\"<\"+.name + \">\"):{\"rdf:type\":{\"type\":\"uri\"}}}" \ 
   xn.json  >> x.rdfjson
jq ".path[].mapping | keys[] as \$k | {\"<\(\$k | tostring)>\":{\":node\":{\"value\":\"<\(.[\$k].position.node_id | tostring)>\", \"type\":\"iri\"}, \":step\" : {\"value\":\$k, \"type\":\"literal\"} }}"   xn.json >> x.rdfjson

Reconstructing a "FASTA" from a PATH using SPARQL

Rough idea of what the query would look like.

SELECT ?pathId GROUP_CONCAT(?sequence)
WHERE
  ?pathId a :Path .
  ?pathId :step ?step .
  ?step :node ?node .
  ?step :step ?stepOrder .
  ?node rdf:value ?sequence .
ORDER BY ?stepOrder

Starting to put it together

#!/bin/bash
#jq ".node[]|{(\"\(.id | tostring)\"): {\"rdf:value\":[ {\"value\":.sequence, \"type\" : \"literal\"} ] }}" $1
#jq ".edge[]|{(\"\(.from | tostring)\"):{\"before\":[ {\"value\":(\"\(.to |tostring)\"), \"type\" : \"uri\"}]}}" $1
#jq "[ .path[]|{(\"\(.name)\"):{\"rdf:type\":[{\"value\":\"Path\",\"type\":\"uri\"}]}} ]" $1
jq " [ .path[].mapping | keys[] as \$k | { (\$k | tostring):{\"node\":[ {\"value\":(.[\$k].position.node_id | tostring) , \"type\":\"uri\"} ], \"step\" : [{\"value\":(\$k | tostring), \"type\":\"literal\"} ]} }]" $1

This is not valid rdf/json yet. It misses comma's in the right place and gives an array instead of a list.