-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Binary OWL vs. compression: some analysis #6
Comments
Experimenting using a simple delta-coding for IRI identifiers, with minimal changes to the output format shows some interesting results. The encoder maintains a table of previously used identifiers whose size is 2^{k}. When looking up a key, the table is scanned to identify the whose value is the smallest absolute difference from the key. If the difference is zero, then that entry is shuffled to the start of the table; otherwise table is shifted right, and the first entry in the table is set to the key being looked up. The return value is a long containing the delta left-shifted by k bits, with the index of the value used as the base for the delta is stored in the lower k-bits. To simplify testing, the index length/flag byte is used to indicate whether the coded value fits in a byte, a short, or an int, and the coded value is stored using the appropriate data output form. This simple approach wastes some space, as there is room available in the flag byte to store the delta-base code for small k. For the GO example above:
This adds an additional ~8% saving in uncompressed size, and a ~13% in compressed size relative to no delta-coding for IRIs. There is a tiny overhead for table maintenance (especially when decoding), and the table is likely to fit in a single cache line. The relatively poor performance with only a single table entry (k=0) is to be expected. For example, a reference to the IRI id for xsd:string from a Literal used in an annotation assertion is likely to generate a large negative delta, followed by a large positive delta; the annotation property is also likely to generate a similar pair of deviations. It is possible that separate history tables for properties, classes, and datatypes might be useful, as might different tables for LHS and RHS of subclass axioms, etc. However, all this is probably secondary to moving strings & literals into their own segment. |
Simon, do you have any idea of how this affects parse time? |
Matthew - parse time is basically unaffected (it may be slightly reduced, but there is sufficient noise in the I/O code to make it hard to measure). The Delta coding is especially cheap when decoding, as there's no need to search the history table (which is quite likely to be in a single cache line). Output time is increased due to the cost of sorting; this is not hideously expensive. For the possible big win for ontology load times (vs parse times), see my next issue. |
Moving literals to their own block, with indexes generated by walking the axioms in the same order as the final axiom output, using k = 4, gives a small win.
It's possible that sorting instead of using axiom order will give sufficiently better results for the string table compression to make up for the more random deltas; it may also be the case that separating the textual data from the binary data (i.e. the lengths and datatypes) may be beneficial. |
Changing the way that deltas are stored (for literals only, so far) gives another win.
|
As above, but LiteralLookupTable serialized in one of two ways, with fields converted from AOS to SOA , and either
or
Null termination doesn't really add much to the cost of decoding, since there's a mandatory array allocation and copy when the String object is finally built. Note that the uncompressed size increases in both cases, since I haven't done any special casing for the literals (which are, to a first approximation, all xsd:strings. Since the datatype array precedes the language tag array, language tags can only be emitted for langString / plainLiteral. The long run of empty strings is eaten by the compression, but there's a little bit of parse time efficiency to be gained by not writing them in the first place. The IRIs can be used to enable the other datatype specific handling. (The long run of zero deltas for all the IRIs also compresses well. I haven't changed the delta base encoding for IRIs yet, but it wouldn't really save much in this case. Variable length ints could save ~0.5MB for these IRIs, but there may be issues. |
Encoding langString only for plain literals:
2.5% saving for uncompressed, 0.035% saving for compressed... |
If IRI's are assigned indexes in the order in which they will be encountered when processing (with Declaration axioms handled last, as otherwise the axiom sort imposes an IRI sort), the compressed size increases to 2,120.2 KiB. Omitting the declaration axioms entirely barely changes this (2,104.7 KiB). (Improperly) writing the IRI table in sorted order, but using the in-order numbering changes the size to 2,092.6 KiB, indicating that the table compression is not the real issue. Switching to the IRI base encoding scheme used for literals does not help in-order numbering much (2,116.9 KiB). With sorted IRIs (and correct numbering), switching the encoding scheme gives sizes of 23.2 MiB uncompressed, and 1,970.2 KiB compressed. |
Using an SOA layout for the IRI table doesn't make any real difference, though there are some potential savings in the uncompressed size if IRIs are grouped by start/prefix. OBO derived ontologies typically have so little entropy in their IRIs that most compressors can handle them easily (though i've wondered about whether some kind of reverse sprintf matching could make some difference - especially with deltas). |
So basically we have some fairly easy path to save about 16% off the uncompressed size, and 53% off the compressed size. Since xz decompression time is almost linear in the compressed size, this gives a savings in decompression time of ~47% (0.440s vs. 0.242s). Since decompression can be done in parallel with loading, and loading times are dominated by indexing times, this isn't really a huge performance win, though if network bandwidth is an issue, the savings may be enough to make a difference. The rejiggered output format shows improvements for smaller block and dictionary sizes as well. Reduced block sizes with xz also degrades more gracefully. 256K blocks - 2,554.2 KiB vs 7,296.3 KiB (0.35 : 1) Note: xz allows for random access by block, but only if the index at the end of the file is readable. Of course, delta tables need to be reset or saved at each block, and blocks need to be started at the beginning of each axiom. This only really matters for parallel parsing, or if loading on demand is wanted. This really requires persisted indexes, and may also require a different output order - e.g. one might want to group all annotation assertion axioms together in property-major order, so that rdfs:labels are grouped together). This might work particularly well for web protege. |
A few more notes: Sorting subclassof axioms by superclass then subclass saves, and sorting annotation assertion axioms by literal-id, then property, then subject, saves about 3% off compressed size. Writing coded IRI deltas as variable length integers, with delta base in the lower bits, and no explicit length word, reduces the uncompressed size by about 5%, but increases the compressed size slightly. This is probably due to alignment issues. |
Some more notes: There are about 442.4 KiB worth of IRI fragments/remainders. The alternate ordering is:
|
It seems possible that generating IRI and literal tables, and then writing axioms, using an based on walking using an ordering where unseen entities used in the RHS of a subclass / equivalence are processed before those used in the LHS may give good results for ontologies that have a lot of precoordinated strings, where phrases that make up parts of the literals associated with a class may be associated with entities that make up part of the super class / defining class expression. These strings may not be close in lexicographically. OBO ontologies seem to use this pattern a lot. How common this style of precoordination is in other large ontologies is the interesting question.
|
Using the quasi-hierarchical ordering has virtually no effect on the compressed size of the literal table, but the improved correlation of entity references makes a major difference in overall compression size. The overall win in compressed sizes seems to be about 8-10%. The results may be different for different orders of walking the tree. Also, I am using two delta history tables; one for iris, and one for literals. This loses some context compared to using separate tables for different uses (e.g. a table for primary entity; a table for annotation properties; a table for literals in OWLAnnotations, etc. This is slightly mitigated by the replacement policy in the delta history tables; if the delta for the closest match is != 0, but falls between configurable lower and upper bounds, the lookup is assumed to be part of a reference stream, and the closest entry is replaced. Allowing negative deltas to use replace-closest reduces compression; probably due to interference with filling the cache with frequently used entities. Separate delta history tables for properties may help address this, since negative deltas may signal a stream being reset (and the delta + stream id may be more predictable). |
But I think I need to stash my changes, and start applying them cleanly, with readers and writers matched, and in a way that is easier to control with automatic benchmarking. I also need to create a separate version with just the default axiom sorting, which has no backwards compatibility issues, and gives a worthwhile win on its own (4.4M -> 2.8M). |
Files encoded using Binary OWL show really poor compression ratios with gzip and xz.
For example, for the Gene Ontology,
uncompressed FSS uses 58M
uncompressed binowl uses 28M
But:
gzipped FSS uses 5.3M; xz uses 3.1M.
gzipped binowl uses 8.8M; xz uses 4.4M
Since the xz compressed FSS shrinks to 2.7M is the file is lexically sorted, it seems possible that changing the order in which axioms are written by binary owl may have a salutary effect on compression ratios.
This seems to be the case:
Better results may be possible if the previous approach of literal interning is revived; this would separate strings from the axiom structure, allowing for both to be optimized.
There may be better ordering than the standard owlapi sorts ; e.g. axioms associated with sibling classes may have similar patterns.
The encoding format may benefit from using variable length integer encodings, and entity/literal references may benefit from delta-encoding. These changes can benefit uncompressed size and performance.
Instead of just using the most recent iri or literal id, delta coding could be made relative to a small number of most recent ids (e.g. four most recent ids). This may create longer matches.
The text was updated successfully, but these errors were encountered: