-
Notifications
You must be signed in to change notification settings - Fork 3
Reference genome graph
Assume using VG as an input source.
One can generate a test json file using this bash script pull request at VG
#!/bin/bash
((
jq " .path[]|{(\"\(.name)\"):{\"rdf:type\":[{\"value\":\"Path\",\"type\":\"uri\"}]}} " $1|sed '$d';
jq ".edge[]|{(\"\(.from | tostring)\"):{\"before\":[ {\"value\":(\"\(.to |tostring)\"), \"type\" : \"uri\"}]}}" $1 | sed -e "s_^\}_,_"| sed -e 's_^[\{]__'|sed '1d;$d';
jq ".node[]|{(\"\(.id | tostring)\"): {\"rdf:value\":[ {\"value\":.sequence, \"type\" : \"literal\"} ] }}" $1|sed -e "s_^\}_,_"| sed -e 's_^[\{]__'|sed '1d;$d';
echo ',';
jq " .path[] as \$p|\$p.mapping
| keys[] as \$k
| { (\$k | tostring):{\"node\":[ {\"value\":(.[\$k].position.node_id | tostring)
, \"type\":\"uri\"} ]
, \"step\" :
[{\"value\":(\$k | tostring)
, \"type\":\"literal\"}]
,\"path\":
[{\"value\":(\$p.name),
\"type\":\"uri\" } ]}}
" $1 |sed -e "s_^\}_,_"| sed -e 's_^[\{]__';)|sed '$d';echo '}')
Then convert this to n-triples
./vgtordf.sh x.json |
riot --syntax="RDF/JSON" --output="N-Triples" > x.nt
Then one can query this with the command line sparql tool from jena. In the example we rebuild the linear sequence from the paths in the data.
sparql --data x.nt
"SELECT ?path (group_concat(?sequence; separator='') as ?pathSeq)
WHERE {?step <http://base/path> ?path;
<http://base/node> ?node ;
<http://base/step> ?order.
?node <rdf:value> ?sequence}
GROUP BY ?path
ORDER BY ?order"
Longest sequence fragment
sparql --data x.nt "SELECT (max(?len) as ?maxlen)
WHERE { ?x <rdf:value> ?s .
BIND (strlen(?s) as ?len)}"
- Generate stabilish IRIs for nodes in the graphs
- Determine what terms/predicates we should use to describe the relations captured in the graph.
- Instead of using multi pass sed/jq bash script actually writeout a RDF serialisation from the C++ code.
In your git checkout of VG
./vg construct -r ./test/small/x.fa -v /test/small/x.vcf.gz > /test/x.vg
./vg view -V ./test/x.vg > ./test/x.json
#then format the json for testing
jq . ./test/x.json > ./test/nx.json
Idea1: Use jq to format output as RDF/JSON
Convert the "Nodes" into simple triples where the nodid is related to the sequence
jq ".node[]| \
{(\"<\"+(.id | tostring)+\">\"):{\"rdf:value\":.sequence, \"type\" : \"literal\"}}" \
xn.json > x.rdfjson
Convert the existing edges into a more triples just stating a node is before another node.
jq ".edge[]| \
{(\"<\"+(.from | tostring)+\">\"):{\":before\":(\"<\"+(.to |tostring)+\">\"), \"type\" : \"uri\"}}" \
xn.json >> x.rdfjson
Then we should be able to use the jena RIOT to translate this further into any other RDF or pipe it into a database.
Next step is the path, this is slightly more annoying due to RDF natively having only linked lists. For basic reconstruction of fasta sequence from such a path we should store the order in the path explicitly (allows for loops/repeats in the future as well without introducing potential infinite loops).
jq ".path[]|{(\"<\"+.name + \">\"):{\"rdf:type\":{\"type\":\"uri\"}}}" \
xn.json >> x.rdfjson
jq ".path[].mapping | keys[] as \$k | {\"<\(\$k | tostring)>\":{\":node\":{\"value\":\"<\(.[\$k].position.node_id | tostring)>\", \"type\":\"iri\"}, \":step\" : {\"value\":\$k, \"type\":\"literal\"} }}" xn.json >> x.rdfjson
Rough idea of what the query would look like.
SELECT ?pathId GROUP_CONCAT(?sequence)
WHERE
?pathId a :Path .
?pathId :step ?step .
?step :node ?node .
?step :step ?stepOrder .
?node rdf:value ?sequence .
ORDER BY ?stepOrder
#!/bin/bash
#jq ".node[]|{(\"\(.id | tostring)\"): {\"rdf:value\":[ {\"value\":.sequence, \"type\" : \"literal\"} ] }}" $1
#jq ".edge[]|{(\"\(.from | tostring)\"):{\"before\":[ {\"value\":(\"\(.to |tostring)\"), \"type\" : \"uri\"}]}}" $1
#jq "[ .path[]|{(\"\(.name)\"):{\"rdf:type\":[{\"value\":\"Path\",\"type\":\"uri\"}]}} ]" $1
jq " [ .path[].mapping | keys[] as \$k | { (\$k | tostring):{\"node\":[ {\"value\":(.[\$k].position.node_id | tostring) , \"type\":\"uri\"} ], \"step\" : [{\"value\":(\$k | tostring), \"type\":\"literal\"} ]} }]" $1
This is not valid rdf/json yet. It misses comma's in the right place and gives an array instead of a list.