-
Notifications
You must be signed in to change notification settings - Fork 175
GH-5447 LMDB Store query performance improvements #5448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
AGENTS.md
Outdated
* `-Dtest=ClassName` | ||
* `-Dtest=ClassName#method` | ||
* `-Dit.test=ITClass#method` | ||
* `-Dit.test=ITClassName[#method]` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revert.
} | ||
|
||
// @Override | ||
// public boolean equals(Object other){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: During GRPUP BY I noticed that ArrayBindingSet could have use for it's own equals method, which can be faster.
We should also look into if hashcode is cached in the super class.
Txn txn, Resource subj, IRI pred, Value obj, boolean explicit, Resource... contexts) throws IOException { | ||
if (!explicit && !mayHaveInferred) { | ||
// there are no inferred statements and the iterator should only return inferred statements | ||
return EMPTY_ITERATION; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes a very big difference when not using inferencing. We can very quickly return an empty iterator without any interaction with the underlying LMDB implementation. I have had a similar optimisation for the Memory Store for a while now and it's been working very well.
|
||
import java.nio.ByteBuffer; | ||
|
||
final class IndexKeyWriters { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class provides hard coded variants of key generation and shouldMatch logic. It's faster to pick a specific method once, instead of having a single method that supports all key combinations using a loop.
public GroupMatcher getGroupMatcher() { | ||
if (groupMatcher != null) | ||
return groupMatcher; | ||
if (matchValues) { | ||
this.groupMatcher = index.createMatcher(subj, pred, obj, context); | ||
} | ||
return groupMatcher; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lazy group matcher. Not sure this makes much of a difference.
class LmdbStatementIterator extends LookAheadIteration<Statement> { | ||
class LmdbStatementIterator extends AbstractCloseableIteration<Statement> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From experience we should try to avoid having these very low level iterators use inheritance. Object orientation, inheritance and polymorphism can be unexpectedly expensive. But not sure if it makes that much of a difference here as it did for the MemoryStore when I changed it there.
static MatcherFactory matcherFactory(String fieldSeq) { | ||
switch (fieldSeq) { | ||
case "spoc": | ||
return IndexKeyWriters::spocShouldMatch; | ||
case "spco": | ||
return IndexKeyWriters::spcoShouldMatch; | ||
case "sopc": | ||
return IndexKeyWriters::sopcShouldMatch; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might benefit from fieldSeq being an enum instead of a string. Not sure if it will make that much of a performance difference though, but might be cleaner code.
// Fast path for Long.MAX_VALUE (0xFF header + 8 data bytes) | ||
if (value == Long.MAX_VALUE) { | ||
final ByteOrder prev = bb.order(); | ||
if (prev != ByteOrder.BIG_ENDIAN) { | ||
bb.order(ByteOrder.BIG_ENDIAN); | ||
} | ||
try { | ||
bb.put((byte) 0xFF); // header for 8 payload bytes | ||
bb.putLong(Long.MAX_VALUE); // 7F FF FF FF FF FF FF FF | ||
} finally { | ||
if (prev != ByteOrder.BIG_ENDIAN) { | ||
bb.order(prev); | ||
} | ||
} | ||
return; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've attempted to optimize both the writeUnsigned and readUnsigned to reduce the usage of loops and also to read/write multiple bytes at a time. I did run into issues with endianness that I'm not sure if I'm managing correctly. The original code didn't really care about endianness, but if I don't do anything to handle it then some tests start failing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hmottestad I've added a working version here:
kenwenzel@3f0a6e6
We should decide if we use yours or mine :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hmottestad I've just read this:
https://github.com/LWJGL/lwjgl3-wiki/wiki/1.3.-Memory-FAQ#what-javaniobyteorder-should-be-used
I'm not sure if this is an requirement or just a best practice.
int bytes = a0Unsigned - 250 + 3; | ||
return 1 + bytes; | ||
} | ||
return FIRST_TO_LENGTH[a0 & 0xFF]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes a pretty decent performance difference actually. Essentially a pre-calculated lookup table instead of branching.
I think I might need to clean up some of the other code to see if I can use this in other places too.
values[indexMap[0]] = readUnsigned(bb); | ||
values[indexMap[1]] = readUnsigned(bb); | ||
values[indexMap[2]] = readUnsigned(bb); | ||
values[indexMap[3]] = readUnsigned(bb); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quads are always 4, so this helps the JVM inline and optimize the byte code.
private static long readSignificantBitsDirect(ByteBuffer bb, int n) { | ||
return SignificantBytesBE.readDirect(bb, n); | ||
// return VarUInt.readUnsigned(bb); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: Find out where this is called from and if we can be certain about the number of bytes somewhere upstream. Then we might be able to use specific lambas like we use in the Bytes class.
final int[] lengths; | ||
final Bytes.RegionComparator[] cmps; | ||
|
||
public GroupMatcher(ByteBuffer value, boolean[] shouldMatch) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: Since this code is only ever called with a ByteBuffer backed by an array, then we can probably change the signature to take the array of bytes instead.
int len = firstToLength(valueArray[baseOffset + pos]); | ||
lengths[i] = len; | ||
|
||
cmps[i] = Bytes.capturedComparator(valueArray, baseOffset + pos, len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we populate an array with lambda comparators where each lambda is hard coded for a specific length. This is possible because the left side in the compare (valueArray) is known already, and we also know the effective position and length,
{ | ||
int len = lengths[0]; | ||
int otherLen = firstToLength(other.get(otherPos)); | ||
|
||
if (shouldMatch[0]) { | ||
if (len != otherLen) { | ||
return false; | ||
} | ||
if (cmps[0].compare(other, otherPos) != 0) { | ||
return false; | ||
} | ||
if (!shouldMatch[1] && !shouldMatch[2] && !shouldMatch[3]) { | ||
return true; | ||
} | ||
} | ||
|
||
otherPos += otherLen; | ||
} | ||
{ | ||
int len = lengths[1]; | ||
int otherLen = firstToLength(other.get(otherPos)); | ||
|
||
if (shouldMatch[1]) { | ||
if (len != otherLen) { | ||
return false; | ||
} | ||
if (cmps[1].compare(other, otherPos) != 0) { | ||
return false; | ||
} | ||
if (!shouldMatch[2] && !shouldMatch[3]) { | ||
return true; | ||
} | ||
} | ||
|
||
otherPos += otherLen; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You wouldn't believe it, but this is considerably faster than a loop. Even when we hard code the loop for 4 iterations. Unrolling the loop also allows us to terminate early if no subsequent parts need to match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hmottestad I've also unrolled the writeSignificantBits method in my other PR.
public static long readDirect(ByteBuffer bb, int n) { | ||
if (n < 3 || n > 8) { | ||
throw new IllegalArgumentException("n must be in [3,8]"); | ||
} | ||
|
||
boolean littleEndian = bb.order() == ByteOrder.LITTLE_ENDIAN; | ||
|
||
switch (n) { | ||
case 8: | ||
return getLongBE(bb, littleEndian); | ||
case 7: | ||
return ((bb.get() & 0xFFL) << 48) | ||
| ((u32(getIntBE(bb, littleEndian)) << 16)) | ||
| (u16(getShortBE(bb, littleEndian))); | ||
case 6: | ||
return (((long) u16(getShortBE(bb, littleEndian)) << 32)) | ||
| (u32(getIntBE(bb, littleEndian))); | ||
case 5: | ||
return ((bb.get() & 0xFFL) << 32) | ||
| (u32(getIntBE(bb, littleEndian))); | ||
case 4: | ||
return u32(getIntBE(bb, littleEndian)); | ||
case 3: | ||
return (((long) u16(getShortBE(bb, littleEndian)) << 8)) | ||
| (bb.get() & 0xFFL); | ||
// TODO: add 1 and 2 byte cases here!!! | ||
default: | ||
throw new AssertionError("unreachable"); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is faster than the original approach, which used a loop, but I really want to figure out how I can provide a lambda directly to the calling code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hmottestad This is exactly what I'm using in my other PR for writeSignificantBits. I think the main trick is to reduce the number of calls to the ByteBuffer.
Main branch benchmark results
Current branch (before group matcher rewrite)
Current (after rewrite of GroupMatcher)
|
de0fa0b
to
c5ab948
Compare
GitHub issue resolved: #5447
Briefly describe the changes proposed in this PR:
Currently a collection of various performance improvements. Still WIP.
PR Author Checklist (see the contributor guidelines for more details):
mvn process-resources
to format from the command line)