-
Notifications
You must be signed in to change notification settings - Fork 175
GH-5343 Make LMDBSail Size() 36000x Faster 🚀🚀🚀🚀 #5345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
e9dab06
to
ff4a793
Compare
This is quite good. We could even get better if we don't materialize anything. |
Let me further explore the materialize-nothing solution to see how fast it can be. |
Maybe we can parameterize the cardinality() function with |
Ok.
|
My cardinalityExact implementation. protected long cardinalityExact(final long subj, final long pred, final long obj, final long context)
throws IOException {
// get size of entire db
if (subj == LmdbValue.UNKNOWN_ID && pred == LmdbValue.UNKNOWN_ID && obj == LmdbValue.UNKNOWN_ID
&& context == LmdbValue.UNKNOWN_ID) {
return txnManager.doWith((stack, txn) -> {
long cardinality = 0;
final TripleIndex index = getBestIndex(subj, pred, obj, context);
for (boolean explicit : new boolean[] { true, false }) {
int dbi = index.getDB(explicit);
MDBStat stat = MDBStat.mallocStack(stack);
mdb_stat(txn, dbi, stat);
cardinality += stat.ms_entries();
}
return cardinality;
});
}
try (TxnManager.Txn txn = txnManager.createReadTxn()) {
final RecordIterator explicitIter = getTriples(txn, subj, pred, obj, context, true);
final RecordIterator implicitIter = getTriples(txn, subj, pred, obj, context, false);
long size = 0;
try {
for (long[] quad = explicitIter.next(); quad != null; quad = explicitIter.next()) {
size++;
}
for (long[] quad = implicitIter.next(); quad != null; quad = implicitIter.next()) {
size++;
}
} finally {
explicitIter.close();
implicitIter.close();
}
return size;
}
} |
ff4a793
to
6b638e0
Compare
2be6b32
to
c8c107a
Compare
For changeset: If there are no approved or deprecated changes, it uses a fast direct lookup. Otherwise, it falls back to iterating over matching statements, which remains performant since it avoids LMDB’s lazy evaluation. @Override
public long size(final Resource subj, final IRI pred, final Value obj, final Resource... contexts) {
// Fast path: no approved or deprecated
if (!changes.hasApproved() && !changes.hasDeprecated()) {
return derivedFrom.size(subj, pred, obj, contexts);
}
// Fallback path: iterate over all matching statements; remains fast due to bypassing LMDB's lazy evaluation
return getStatements(subj, pred, obj, contexts).stream().count();
} |
c8c107a
to
8c4b328
Compare
Hi @odysa, the context size is also already tracked:
|
protected long calculateSize(final boolean includeInferred, final Resource... contexts) throws SailException { | ||
try (SailSource branch = branch(IncludeInferred.fromBoolean(includeInferred))) { | ||
return branch.dataset(getIsolationLevel()).size(null, null, null, contexts); | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason for why we need a new method here?
From a quick glance I saw that there is protected long sizeInternal(Resource... contexts) throws SailException {
which I wonder if would be a better place for this code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This optimization is only for LMDB, so I made 2 different pathes for size.
LMDB gets size directly from banch.dataset
Other sails call getStatementsInternal
to count.
I don't want to accidently break other sails, so leave sizeInternal
unchanged.
8c4b328
to
87a524d
Compare
Done getting size from tracked context size |
core/sail/base/src/main/java/org/eclipse/rdf4j/sail/base/SailDatasetImpl.java
Outdated
Show resolved
Hide resolved
Besides my comment it looks good to me. |
I need to take a closer look. It would be best to not have to make any changes to the common sail code. We have the ElasticsearchStore that could also leverage a faster way to count by offloading it to Elasticsearch. Another aspect is to make sure that this new way of counting follows all the transactional isolation levels that are supported by the LMDB Store. Finally I would like to see if we can't add a benchmark to our current suite so that this new optimisation won't get forgotten when we make changes in the future. Im a bit pressed for time right now though. I'll do my best to prioritise this. |
Lemme add more test cases to it. |
87a524d
to
44a16d2
Compare
Added isolationLevel test. |
44a16d2
to
e4c1475
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the good work. I only have some minor comments.
} | ||
|
||
// Fast path: if only context is specified, return the size of the given context | ||
if (subj == LmdbValue.UNKNOWN_ID && pred == LmdbValue.UNKNOWN_ID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be combined with previous if where context variable is checked inside. This would also avoid the slower call to stackGet().
import java.util.Iterator; | ||
import java.util.List; | ||
import java.util.Map; | ||
import java.util.*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hmottestad Do we have any rules when using wildcard imports are allowed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We usually don't use wildcard, but it's unfortunately not picked up by the formatter.
} | ||
|
||
// Fallback path: iterate over all matching statements | ||
return getStatements(subj, pred, obj, contexts).stream().count(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@odysa Does this call materialize/resolve all values in LmdbStore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. It calls getStatements
here
rdf4j/core/sail/lmdb/src/main/java/org/eclipse/rdf4j/sail/lmdb/LmdbSailStore.java
Lines 927 to 941 in d78001b
public CloseableIteration<? extends Statement> getStatements(Resource subj, IRI pred, Value obj, | |
Resource... contexts) throws SailException { | |
try { | |
return createStatementIterator(txn, subj, pred, obj, explicit, contexts); | |
} catch (IOException e) { | |
try { | |
logger.warn("Failed to get statements, retrying", e); | |
// try once more before giving up | |
Thread.yield(); | |
return createStatementIterator(txn, subj, pred, obj, explicit, contexts); | |
} catch (IOException e2) { | |
throw new SailException("Unable to get statements", e); | |
} | |
} | |
} |
And eventually calls
getNextElement
rdf4j/core/sail/lmdb/src/main/java/org/eclipse/rdf4j/sail/lmdb/LmdbStatementIterator.java
Lines 53 to 78 in d78001b
public Statement getNextElement() throws SailException { | |
try { | |
long[] quad = recordIt.next(); | |
if (quad == null) { | |
return null; | |
} | |
long subjID = quad[TripleStore.SUBJ_IDX]; | |
Resource subj = (Resource) valueStore.getLazyValue(subjID); | |
long predID = quad[TripleStore.PRED_IDX]; | |
IRI pred = (IRI) valueStore.getLazyValue(predID); | |
long objID = quad[TripleStore.OBJ_IDX]; | |
Value obj = valueStore.getLazyValue(objID); | |
Resource context = null; | |
long contextID = quad[TripleStore.CONTEXT_IDX]; | |
if (contextID != 0) { | |
context = (Resource) valueStore.getLazyValue(contextID); | |
} | |
return valueStore.createStatement(subj, pred, obj, context); | |
} catch (IOException e) { | |
throw causeIOException(e); | |
} |
e4c1475
to
093a8f8
Compare
This is still on my list of things to do. I've just been a bit too busy with other things, RDF4J is a side project that I work on next to my regular job. |
Thank you so much. @hmottestad I really appreciate all the work that’s gone into RDF4J, and it's impressive that you're maintaining such a big project in your free time. If there’s anything I can do to help with maintaining the project, such as reviewing PRs or anything else, I’d be happy to contribute. |
LGTM |
@hmottestad Can this be merged? |
} | ||
|
||
// Fallback path: iterate over all matching statements | ||
return getStatements(subj, pred, obj, contexts).stream().count(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return getStatements(subj, pred, obj, contexts).stream().count();
should probably we wrapped in a try-with-resource.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As in:
try(var stream = getStatements(subj, pred, obj, contexts).stream()){
return stream.size();
}
|
||
@Experimental | ||
default long size(final Resource subj, final IRI pred, final Value obj, final Resource... contexts) { | ||
return getStatements(subj, pred, obj, contexts).stream() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that this also needs to be closed.
* @throws SailException if an error occurs while accessing the Sail store | ||
*/ | ||
@Experimental | ||
protected long getSizeFromSnapshot(final boolean includeInferred, final Resource... contexts) throws SailException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need includeInferred
? I can't see that the other size related methods support differentiating between inferred or explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kinda wish we could just use the existing sizeInternal(...) method, but I get that it's better to be safe and introduce a separate method for now and then later on we can consider changing up sizeInternal(...).
GitHub issue resolved: #
#5343
Briefly describe the changes proposed in this PR:
This PR introduces an optimization for the
size(...)
method in the LMDBstore implementation. Introduce acardinalityExact
to calcualte the exact size, leverageingmdb_stats
when possible.Key Changes
ChangeSet
Fast-path Optimization
approved
ordeprecated
changes in the current transaction or not in a transcation, the method directly delegates to thederivedFrom
store for fast cardinality estimation.Fallback to Iterator
getStatements(...).stream().count()
. This bypasses LMDB’s lazy evaluation to ensure consistency, even with uncommitted changes.Low-level Size Calculation (
cardinalityExact
)mdb_stat
to return the total size efficiently.Perf
I created a LMDBSail with 10M triples.

Original
size()
: 21802msOptimized
size()
:685.6 μs to get the full size by leveraging
mdb_stats
.274.2 ms to get the size of a context of 5000000 triples.
PR Author Checklist (see the contributor guidelines for more details):
mvn process-resources
to format from the command line)