Computation of document order needs to be cached #121

shepmaster · 2018-03-01T14:29:54Z

This program causes intense performance usage. Ultimately, it's because Value::into_string is being called repeatedly, which triggers the computation of nodes in document order. "Thankfully", I realized this problem when adding it:

sxd-xpath/src/nodeset.rs

Line 349 in 350c51e

// Rebuilding this multiple times cannot possibly be performant,

extern crate sxd_document;
extern crate sxd_xpath;

use std::fs::File;
use std::io::Read;
use std::collections::HashMap;
use std::borrow::Cow;
use sxd_document::dom::{Document, Element};
use sxd_xpath::{Context, Factory, Value};
use sxd_xpath::nodeset::Node;
use sxd_document::parser;

type DynResult<T> = Result<T, Box<::std::error::Error>>;

fn main() {
    let filename = "radlex.owl";
    println!("Reading file");
    let mut f = File::open(filename).unwrap();
    let mut data = String::new();
    f.read_to_string(&mut data).unwrap();
    let package = parser::parse(&data).unwrap();
    build_rid_index(&package.as_document()).unwrap();
}

/// Build a dictionary of an RID to its respective XML element
fn build_rid_index<'d>(
    radlex: &'d Document<'d>,
) -> DynResult<HashMap<Cow<'d, str>, Element<'d>>> {
    let root = radlex.root();

    let mut ctx = Context::new();
    ctx.set_namespace("xsp", "http://www.owl-ontologies.com/2005/08/07/xsp.owl#");
    ctx.set_namespace("xsd", "http://www.w3.org/2001/XMLSchema#");
    ctx.set_namespace("rdf", "http://www.w3.org/1999/02/22-rdf-syntax-ns#");
    ctx.set_namespace("rdfs", "http://www.w3.org/2000/01/rdf-schema#");
    ctx.set_namespace("owl", "http://www.w3.org/2002/07/owl#");
    ctx.set_namespace("swrl", "http://www.w3.org/2003/11/swrl#");
    ctx.set_namespace("swrlb", "http://www.w3.org/2003/11/swrlb#");

    println!("Building query");
    let factory = Factory::new();
    let xpath = factory.build("/rdf:RDF/*[starts-with(@rdf:ID, 'RID')]")?;
    let xpath = xpath.expect("No XPath was compiled");

    println!("Evaluating query");
    let value = xpath.evaluate(&ctx, root)?;

    println!("Building dictionary");
    if let Value::Nodeset(nodeset) = value {
        let dict: HashMap<_, _> = nodeset
            .into_iter()
            .filter_map(|x| match x {
                Node::Element(e) => Some(e),
                _ => None,
            })
            .map(|e| {
                let rid = e.attributes()
                    .into_iter()
                    .find(|x| x.name().local_part() == "ID")
                    .unwrap()
                    .value();
                (rid.into(), e)
            })
            .collect();

        Ok(dict)
    } else {
        panic!()
    }
}

Source file

/cc @Enet4

The text was updated successfully, but these errors were encountered:

shepmaster · 2018-03-01T14:31:14Z

It was also pointed out that we can have a "short cut" if the nodeset only has a single node. In that case, the computation of document order can be avoided entirely.

shepmaster · 2018-03-01T14:42:57Z

Using the short cut does fix this case 👍

real	0m2.409s
user	0m2.168s
sys	0m0.221s

I also cleaned up the original code a bit:

    let root = radlex.root();

    let mut ctx = Context::new();
    ctx.set_namespace("rdf", "http://www.w3.org/1999/02/22-rdf-syntax-ns#");

    let factory = Factory::new();
    let xpath = factory.build("/rdf:RDF/*/@rdf:ID[starts-with(., 'RID')]")?;
    let xpath = xpath.expect("No XPath was compiled");

    let value = xpath.evaluate(&ctx, root)?;

    if let Value::Nodeset(nodeset) = value {
        let dict = nodeset
            .into_iter()
            .filter_map(|node| node.attribute())
            .map(|a| (a.value().into(), a.parent().unwrap()))
            .collect();

        Ok(dict)
    } else {
        panic!()
    }

shepmaster · 2018-03-01T15:21:26Z

What's painful about caching in the full case is that there's nowhere really good to store the cache. The closest place right now would be something like the Context, but it feels really awkward to require passing a Context to methods like Value::string.

The best alternative within the current structure I can think of would be some ugly hidden thing inside of a Document. That leads to issues around staleness of the cache if the document gets updated.

shepmaster · 2018-03-01T15:24:10Z

A pragmatic compromise could be to have a pair of functions, such as fn string(&self) and fn string_with_cache(&self, DocOrder). This would allow the library to use the optimized version, but not force end users to deal with it unless they needed it.

Partially addresses #121

shepmaster added a commit that referenced this issue Oct 31, 2018

Avoid computing the document order when there's only one node

5a4f6a3

Partially addresses #121

shepmaster mentioned this issue Oct 31, 2018

Avoid computing the document order when there's only one node #125

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Computation of document order needs to be cached #121

Computation of document order needs to be cached #121

shepmaster commented Mar 1, 2018

shepmaster commented Mar 1, 2018

shepmaster commented Mar 1, 2018

shepmaster commented Mar 1, 2018

shepmaster commented Mar 1, 2018

Computation of document order needs to be cached #121

Computation of document order needs to be cached #121

Comments

shepmaster commented Mar 1, 2018

shepmaster commented Mar 1, 2018

shepmaster commented Mar 1, 2018

shepmaster commented Mar 1, 2018

shepmaster commented Mar 1, 2018