-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PathZero
generates a _lot_ of unnecessary computation
#49
Comments
PathZero
generates a *lot* of unnecessary computationPathZero
generates a _lot_ of unnecessary computation
I’m sure you’re right. Emphasis was on getting the functionality in rather than optimization. A PR would be welcome. |
I can try when I have some time but I think I will have to deal with the workaround in the interim. Do you know where |
The class/method comment references the definition in the algebra section of the grammar spec. https://www.w3.org/TR/sparql11-query/#defn_evalPP_ZeroOrOnePath. It would be the result of parsing a path expression to SSE; there may be a test for it specifically. I can spend some time on it in the next day or so. |
This is worth its own investigation. Literal comparison has a number of rules regarding exceptions (Type Error). PathZero is part of PathOpt, thus the reference to the ZeroOrOne path. The problem lies in the bottom-up query execution, where Path* is Path+ union Path0 and Path0 evaluates to the set of nodes in queriable (thus the Your query parses to the following expression: (prefix
(
(skos: <http://www.w3.org/2004/02/skos/core#>)
(rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>))
(distinct
(project
(?o ?c)
(join
(filter (&& (isIRI ?o) (isIRI ?c)) (bgp (triple ?o a ?c)))
(path <urn:uuid:00ae5130-bb3f-478b-b97a-e868c8bb17d8>
(reverse (seq (seq skos:memberList (path* rdf:rest)) rdf:first)) ?o )) )) ) The SSE emitted is consistent with that generated by Jena ARQ, but I suspect they have an execution path that does not result in this explosion. I think the solution will be to rewrite path_opt and path_star to factor around the path0 case. It seems to me that spec/algabra/query_spec.rb distills the important bits from the SPARQL test suite relating to this, so a correct fix will pass the existing tests. To efficiently run against the entire SPARQL test suite requires cloning the rdf-tests and rdf-star repositories so that the symlinks in /spec will resolve the URLs of the test manifests. I'll continue to spend some time on this. |
I actually opened a branch on my fork to try rearranging the various comparison functions, particularly in It makes some sense, when both subject and object are variables and there is no additional context, for I agree that I am currently trying to get something out the door but will circle back once I have some spare cycles. |
I think this basically comes down to the theory of execution of the SPARQL operators, which is inherently bottom-up, meaning that this bottom level (base <http://example/base/>
(prefix ((rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>)
(skos: <http://www.w3.org/2004/02/skos/core#>))
(distinct
(project (?o ?c)
(sequence
(filter (exprlist (isIRI ?o) (isIRI ?c))
(bgp (triple ?o rdf:type ?c)))
(sequence
(bgp
(triple ??P244 rdf:first <urn:uuid:00ae5130-bb3f-478b-b97a-e868c8bb17d8>)
(triple ?o skos:memberList ??P245)
)
(path ??P245 (path* rdf:rest) ??P244))))))) While this could improve matters at higher levels, it leaves the same basic solution explosion at the lowest level. Changing the execution model to take solution bindings from the sequence to limit the space that the lowest-level query is run over is probably the way to go, but this looks like a pretty fundamental change, and it's not clear to me how we get there. Ultimately, there is a reason that there are commercial SPARQL engines that have spent a lot on query optimization, and we simply haven't had the resources to explore this more fully. |
Some more thoughts on what could eventually happen:
|
Hey, trying to execute a SPARQL query with the path
?s ^(skos:memberList/rdf:rest*/rdf:first) ?o
where?s
is bound, on a graph of about 35,000 statements. Presumably the query will eventually complete, but only after what looks like comparing the entire graph against itself (cartesian product). Here is a flame graph fromrbspy
that shows the computation is heavily concentrated underSPARQL::Algebra::Operator::PathZero#execute
:I have isolated the behaviour in the following query:
Note that this semantically equivalent query:
…completes in no time at all.
I suspect there may potentially be a strategic issue with the way
PathZero#execute
works. When bothsubject
andobject
are variables, the behaviour appears to be to scan the entire graph for every subject and object and map both subject and object variables to the same values. It also executessolutions.include?
for each solution (once for the subject query and once for the objects), amounting to a sequential scan (O(n²)
) every time, and invocation ofRDF::Query::Solution#==
, each time spurring multiple invocations ofRDF::URI#==
andRDF::Literal#==
, the latter of those two being quite expensive. One additional remark about this situation is that as it stands, literals can't be subjects, so the vast majority of these comparisons—which also happen to be the most expensive to perform—are unnecessary.It seems to me that
(path ?x (path0 :p) ?y)
is like saying "anything that is a solution to?x
is also a solution to?y
(and almost the same the other way around, modulo literals), so without any additional context that is indeed like saying "every subject in the graph which is also an object". The thing is, there should be additional context before this operator is executed, if its default behaviour is to scan the entire graph and compute its cartesian product, only to throw away the results.The text was updated successfully, but these errors were encountered: