Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Computation of document order needs to be cached #121

Open
shepmaster opened this issue Mar 1, 2018 · 4 comments
Open

Computation of document order needs to be cached #121

shepmaster opened this issue Mar 1, 2018 · 4 comments

Comments

@shepmaster
Copy link
Owner

This program causes intense performance usage. Ultimately, it's because Value::into_string is being called repeatedly, which triggers the computation of nodes in document order. "Thankfully", I realized this problem when adding it:

// Rebuilding this multiple times cannot possibly be performant,

image

extern crate sxd_document;
extern crate sxd_xpath;

use std::fs::File;
use std::io::Read;
use std::collections::HashMap;
use std::borrow::Cow;
use sxd_document::dom::{Document, Element};
use sxd_xpath::{Context, Factory, Value};
use sxd_xpath::nodeset::Node;
use sxd_document::parser;

type DynResult<T> = Result<T, Box<::std::error::Error>>;

fn main() {
    let filename = "radlex.owl";
    println!("Reading file");
    let mut f = File::open(filename).unwrap();
    let mut data = String::new();
    f.read_to_string(&mut data).unwrap();
    let package = parser::parse(&data).unwrap();
    build_rid_index(&package.as_document()).unwrap();
}

/// Build a dictionary of an RID to its respective XML element
fn build_rid_index<'d>(
    radlex: &'d Document<'d>,
) -> DynResult<HashMap<Cow<'d, str>, Element<'d>>> {
    let root = radlex.root();

    let mut ctx = Context::new();
    ctx.set_namespace("xsp", "http://www.owl-ontologies.com/2005/08/07/xsp.owl#");
    ctx.set_namespace("xsd", "http://www.w3.org/2001/XMLSchema#");
    ctx.set_namespace("rdf", "http://www.w3.org/1999/02/22-rdf-syntax-ns#");
    ctx.set_namespace("rdfs", "http://www.w3.org/2000/01/rdf-schema#");
    ctx.set_namespace("owl", "http://www.w3.org/2002/07/owl#");
    ctx.set_namespace("swrl", "http://www.w3.org/2003/11/swrl#");
    ctx.set_namespace("swrlb", "http://www.w3.org/2003/11/swrlb#");

    println!("Building query");
    let factory = Factory::new();
    let xpath = factory.build("/rdf:RDF/*[starts-with(@rdf:ID, 'RID')]")?;
    let xpath = xpath.expect("No XPath was compiled");

    println!("Evaluating query");
    let value = xpath.evaluate(&ctx, root)?;

    println!("Building dictionary");
    if let Value::Nodeset(nodeset) = value {
        let dict: HashMap<_, _> = nodeset
            .into_iter()
            .filter_map(|x| match x {
                Node::Element(e) => Some(e),
                _ => None,
            })
            .map(|e| {
                let rid = e.attributes()
                    .into_iter()
                    .find(|x| x.name().local_part() == "ID")
                    .unwrap()
                    .value();
                (rid.into(), e)
            })
            .collect();

        Ok(dict)
    } else {
        panic!()
    }
}

Source file

/cc @Enet4

@shepmaster
Copy link
Owner Author

It was also pointed out that we can have a "short cut" if the nodeset only has a single node. In that case, the computation of document order can be avoided entirely.

@shepmaster
Copy link
Owner Author

Using the short cut does fix this case 👍

real	0m2.409s
user	0m2.168s
sys	0m0.221s

I also cleaned up the original code a bit:

    let root = radlex.root();

    let mut ctx = Context::new();
    ctx.set_namespace("rdf", "http://www.w3.org/1999/02/22-rdf-syntax-ns#");

    let factory = Factory::new();
    let xpath = factory.build("/rdf:RDF/*/@rdf:ID[starts-with(., 'RID')]")?;
    let xpath = xpath.expect("No XPath was compiled");

    let value = xpath.evaluate(&ctx, root)?;

    if let Value::Nodeset(nodeset) = value {
        let dict = nodeset
            .into_iter()
            .filter_map(|node| node.attribute())
            .map(|a| (a.value().into(), a.parent().unwrap()))
            .collect();

        Ok(dict)
    } else {
        panic!()
    }

@shepmaster
Copy link
Owner Author

What's painful about caching in the full case is that there's nowhere really good to store the cache. The closest place right now would be something like the Context, but it feels really awkward to require passing a Context to methods like Value::string.

The best alternative within the current structure I can think of would be some ugly hidden thing inside of a Document. That leads to issues around staleness of the cache if the document gets updated.

@shepmaster
Copy link
Owner Author

A pragmatic compromise could be to have a pair of functions, such as fn string(&self) and fn string_with_cache(&self, DocOrder). This would allow the library to use the optimized version, but not force end users to deal with it unless they needed it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant