Skip to content

Conversation

hmottestad
Copy link
Contributor

GitHub issue resolved: #5447

Briefly describe the changes proposed in this PR:

Currently a collection of various performance improvements. Still WIP.


PR Author Checklist (see the contributor guidelines for more details):

  • my pull request is self-contained
  • I've added tests for the changes I made
  • I've applied code formatting (you can use mvn process-resources to format from the command line)
  • I've squashed my commits where necessary
  • every commit message starts with the issue number (GH-xxxx) followed by a meaningful description of the change

AGENTS.md Outdated
* `-Dtest=ClassName`
* `-Dtest=ClassName#method`
* `-Dit.test=ITClass#method`
* `-Dit.test=ITClassName[#method]`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert.

}

// @Override
// public boolean equals(Object other){
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: During GRPUP BY I noticed that ArrayBindingSet could have use for it's own equals method, which can be faster.

We should also look into if hashcode is cached in the super class.

Txn txn, Resource subj, IRI pred, Value obj, boolean explicit, Resource... contexts) throws IOException {
if (!explicit && !mayHaveInferred) {
// there are no inferred statements and the iterator should only return inferred statements
return EMPTY_ITERATION;
Copy link
Contributor Author

@hmottestad hmottestad Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes a very big difference when not using inferencing. We can very quickly return an empty iterator without any interaction with the underlying LMDB implementation. I have had a similar optimisation for the Memory Store for a while now and it's been working very well.


import java.nio.ByteBuffer;

final class IndexKeyWriters {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class provides hard coded variants of key generation and shouldMatch logic. It's faster to pick a specific method once, instead of having a single method that supports all key combinations using a loop.

Comment on lines 139 to 148
public GroupMatcher getGroupMatcher() {
if (groupMatcher != null)
return groupMatcher;
if (matchValues) {
this.groupMatcher = index.createMatcher(subj, pred, obj, context);
}
return groupMatcher;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lazy group matcher. Not sure this makes much of a difference.

Comment on lines -26 to +27
class LmdbStatementIterator extends LookAheadIteration<Statement> {
class LmdbStatementIterator extends AbstractCloseableIteration<Statement> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From experience we should try to avoid having these very low level iterators use inheritance. Object orientation, inheritance and polymorphism can be unexpectedly expensive. But not sure if it makes that much of a difference here as it did for the MemoryStore when I changed it there.

Comment on lines +85 to +92
static MatcherFactory matcherFactory(String fieldSeq) {
switch (fieldSeq) {
case "spoc":
return IndexKeyWriters::spocShouldMatch;
case "spco":
return IndexKeyWriters::spcoShouldMatch;
case "sopc":
return IndexKeyWriters::sopcShouldMatch;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might benefit from fieldSeq being an enum instead of a string. Not sure if it will make that much of a performance difference though, but might be cleaner code.

Comment on lines 104 to 120
// Fast path for Long.MAX_VALUE (0xFF header + 8 data bytes)
if (value == Long.MAX_VALUE) {
final ByteOrder prev = bb.order();
if (prev != ByteOrder.BIG_ENDIAN) {
bb.order(ByteOrder.BIG_ENDIAN);
}
try {
bb.put((byte) 0xFF); // header for 8 payload bytes
bb.putLong(Long.MAX_VALUE); // 7F FF FF FF FF FF FF FF
} finally {
if (prev != ByteOrder.BIG_ENDIAN) {
bb.order(prev);
}
}
return;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've attempted to optimize both the writeUnsigned and readUnsigned to reduce the usage of loops and also to read/write multiple bytes at a time. I did run into issues with endianness that I'm not sure if I'm managing correctly. The original code didn't really care about endianness, but if I don't do anything to handle it then some tests start failing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hmottestad I've added a working version here:
kenwenzel@3f0a6e6

We should decide if we use yours or mine :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hmottestad I've just read this:
https://github.com/LWJGL/lwjgl3-wiki/wiki/1.3.-Memory-FAQ#what-javaniobyteorder-should-be-used

I'm not sure if this is an requirement or just a best practice.

int bytes = a0Unsigned - 250 + 3;
return 1 + bytes;
}
return FIRST_TO_LENGTH[a0 & 0xFF];
Copy link
Contributor Author

@hmottestad hmottestad Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes a pretty decent performance difference actually. Essentially a pre-calculated lookup table instead of branching.

I think I might need to clean up some of the other code to see if I can use this in other places too.

Comment on lines +473 to +474
values[indexMap[0]] = readUnsigned(bb);
values[indexMap[1]] = readUnsigned(bb);
values[indexMap[2]] = readUnsigned(bb);
values[indexMap[3]] = readUnsigned(bb);
Copy link
Contributor Author

@hmottestad hmottestad Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quads are always 4, so this helps the JVM inline and optimize the byte code.

Comment on lines 491 to 493
private static long readSignificantBitsDirect(ByteBuffer bb, int n) {
return SignificantBytesBE.readDirect(bb, n);
// return VarUInt.readUnsigned(bb);
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Find out where this is called from and if we can be certain about the number of bytes somewhere upstream. Then we might be able to use specific lambas like we use in the Bytes class.

final int[] lengths;
final Bytes.RegionComparator[] cmps;

public GroupMatcher(ByteBuffer value, boolean[] shouldMatch) {
Copy link
Contributor Author

@hmottestad hmottestad Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Since this code is only ever called with a ByteBuffer backed by an array, then we can probably change the signature to take the array of bytes instead.

int len = firstToLength(valueArray[baseOffset + pos]);
lengths[i] = len;

cmps[i] = Bytes.capturedComparator(valueArray, baseOffset + pos, len);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we populate an array with lambda comparators where each lambda is hard coded for a specific length. This is possible because the left side in the compare (valueArray) is known already, and we also know the effective position and length,

Comment on lines 560 to 594
{
int len = lengths[0];
int otherLen = firstToLength(other.get(otherPos));

if (shouldMatch[0]) {
if (len != otherLen) {
return false;
}
if (cmps[0].compare(other, otherPos) != 0) {
return false;
}
if (!shouldMatch[1] && !shouldMatch[2] && !shouldMatch[3]) {
return true;
}
}

otherPos += otherLen;
}
{
int len = lengths[1];
int otherLen = firstToLength(other.get(otherPos));

if (shouldMatch[1]) {
if (len != otherLen) {
return false;
}
if (cmps[1].compare(other, otherPos) != 0) {
return false;
}
if (!shouldMatch[2] && !shouldMatch[3]) {
return true;
}
}

otherPos += otherLen;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You wouldn't believe it, but this is considerably faster than a loop. Even when we hard code the loop for 4 iterations. Unrolling the loop also allows us to terminate early if no subsequent parts need to match.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hmottestad I've also unrolled the writeSignificantBits method in my other PR.

Comment on lines +64 to +93
public static long readDirect(ByteBuffer bb, int n) {
if (n < 3 || n > 8) {
throw new IllegalArgumentException("n must be in [3,8]");
}

boolean littleEndian = bb.order() == ByteOrder.LITTLE_ENDIAN;

switch (n) {
case 8:
return getLongBE(bb, littleEndian);
case 7:
return ((bb.get() & 0xFFL) << 48)
| ((u32(getIntBE(bb, littleEndian)) << 16))
| (u16(getShortBE(bb, littleEndian)));
case 6:
return (((long) u16(getShortBE(bb, littleEndian)) << 32))
| (u32(getIntBE(bb, littleEndian)));
case 5:
return ((bb.get() & 0xFFL) << 32)
| (u32(getIntBE(bb, littleEndian)));
case 4:
return u32(getIntBE(bb, littleEndian));
case 3:
return (((long) u16(getShortBE(bb, littleEndian)) << 8))
| (bb.get() & 0xFFL);
// TODO: add 1 and 2 byte cases here!!!
default:
throw new AssertionError("unreachable");
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is faster than the original approach, which used a loop, but I really want to figure out how I can provide a lambda directly to the calling code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hmottestad This is exactly what I'm using in my other PR for writeSignificantBits. I think the main trick is to reduce the number of calls to the ByteBuffer.

@hmottestad
Copy link
Contributor Author

hmottestad commented Sep 23, 2025

Main branch benchmark results

Benchmark                                                     Mode  Cnt     Score     Error  Units
QueryBenchmark.complexQuery                                   avgt    5     6.496 ±   0.028  ms/op
QueryBenchmark.different_datasets_with_similar_distributions  avgt    5     3.845 ±   0.025  ms/op
QueryBenchmark.groupByQuery                                   avgt    5     1.365 ±   0.006  ms/op
QueryBenchmark.long_chain                                     avgt    5  1107.433 ±  19.781  ms/op
QueryBenchmark.lots_of_optional                               avgt    5   400.529 ±   3.078  ms/op
QueryBenchmark.minus                                          avgt    5   941.617 ± 140.149  ms/op
QueryBenchmark.nested_optionals                               avgt    5   246.380 ±   2.078  ms/op
QueryBenchmark.pathExpressionQuery1                           avgt    5    40.707 ±   0.394  ms/op
QueryBenchmark.pathExpressionQuery2                           avgt    5    10.374 ±   0.026  ms/op
QueryBenchmark.query_distinct_predicates                      avgt    5    59.451 ±   0.410  ms/op
QueryBenchmark.simple_filter_not                              avgt    5    11.192 ±   0.056  ms/op
QueryBenchmarkFoaf.groupByCount                               avgt    5  1225.178 ±  15.143  ms/op
QueryBenchmarkFoaf.groupByCountSorted                         avgt    5  1067.095 ±  10.463  ms/op
QueryBenchmarkFoaf.personsAndFriends                          avgt    5   426.081 ±   5.668  ms/op


Benchmark                           Mode  Cnt   Score   Error  Units
RecordIteratorBenchmark.iterateAll  avgt    5  29.401 ± 0.478  ms/op - 1 thread
RecordIteratorBenchmark.iterateAll  avgt    5  44.083 ± 1.716  ms/op - 8 threads


Current branch (before group matcher rewrite)

Benchmark                                                     Mode  Cnt     Score    Error  Units
QueryBenchmark.complexQuery                                   avgt    5     4.012 ±  0.064  ms/op
QueryBenchmark.different_datasets_with_similar_distributions  avgt    5     2.526 ±  0.104  ms/op
QueryBenchmark.groupByQuery                                   avgt    5     0.998 ±  0.010  ms/op
QueryBenchmark.long_chain                                     avgt    5   694.577 ±  8.078  ms/op
QueryBenchmark.lots_of_optional                               avgt    5   279.962 ±  1.423  ms/op
QueryBenchmark.minus                                          avgt    5   935.986 ± 31.434  ms/op
QueryBenchmark.nested_optionals                               avgt    5   162.784 ±  0.989  ms/op
QueryBenchmark.pathExpressionQuery1                           avgt    5    32.298 ±  0.186  ms/op
QueryBenchmark.pathExpressionQuery2                           avgt    5     9.922 ±  0.901  ms/op
QueryBenchmark.query_distinct_predicates                      avgt    5    57.322 ±  0.918  ms/op
QueryBenchmark.simple_filter_not                              avgt    5     6.618 ±  0.132  ms/op
QueryBenchmarkFoaf.groupByCount                               avgt    5  1075.289 ± 16.802  ms/op
QueryBenchmarkFoaf.groupByCountSorted                         avgt    5   929.522 ± 18.188  ms/op
QueryBenchmarkFoaf.personsAndFriends                          avgt    5   407.621 ±  2.500  ms/op

Benchmark                           Mode  Cnt   Score   Error  Units
RecordIteratorBenchmark.iterateAll  avgt    5  22.677 ± 0.343  ms/op - 1 thread
RecordIteratorBenchmark.iterateAll  avgt    5  41.153 ± 2.515  ms/op - 8 threads



Current (after rewrite of GroupMatcher)

Benchmark                                                     Mode  Cnt     Score    Error  Units
QueryBenchmark.complexQuery                                   avgt    5     4.051 ±  0.084  ms/op
QueryBenchmark.different_datasets_with_similar_distributions  avgt    5     2.539 ±  0.019  ms/op
QueryBenchmark.groupByQuery                                   avgt    5     0.989 ±  0.007  ms/op
QueryBenchmark.long_chain                                     avgt    5   771.359 ± 17.200  ms/op
QueryBenchmark.lots_of_optional                               avgt    5   300.407 ± 20.564  ms/op
QueryBenchmark.minus                                          avgt    5   943.843 ± 45.570  ms/op
QueryBenchmark.nested_optionals                               avgt    5   170.965 ±  0.593  ms/op
QueryBenchmark.pathExpressionQuery1                           avgt    5    34.113 ±  0.334  ms/op
QueryBenchmark.pathExpressionQuery2                           avgt    5     9.709 ±  0.111  ms/op
QueryBenchmark.query_distinct_predicates                      avgt    5    57.224 ±  0.422  ms/op
QueryBenchmark.simple_filter_not                              avgt    5     6.613 ±  0.055  ms/op
QueryBenchmarkFoaf.groupByCount                               avgt    5  1084.447 ± 14.277  ms/op
QueryBenchmarkFoaf.groupByCountSorted                         avgt    5   929.013 ±  7.214  ms/op
QueryBenchmarkFoaf.personsAndFriends                          avgt    5   413.556 ± 43.487  ms/op

Benchmark                           Mode  Cnt   Score   Error  Units
RecordIteratorBenchmark.iterateAll  avgt    5  19.466 ± 0.303  ms/op - 1 thread
RecordIteratorBenchmark.iterateAll  avgt    5  41.212 ± 1.155  ms/op - 8 threads


@hmottestad hmottestad force-pushed the GH-5447-lmdb-query-performance branch from de0fa0b to c5ab948 Compare September 27, 2025 09:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve performance of queries in the LMDB Store
2 participants