GH-5343 Make LMDBSail Size() 36000x Faster 🚀🚀🚀🚀 #5345

odysa · 2025-05-31T16:12:11Z

GitHub issue resolved: #
#5343
Briefly describe the changes proposed in this PR:

This PR introduces an optimization for the size(...) method in the LMDBstore implementation. Introduce a cardinalityExact to calcualte the exact size, leverageingmdb_stats when possible.

Key Changes

ChangeSet
- Fast-path Optimization
  - When there are no approved or deprecated changes in the current transaction or not in a transcation, the method directly delegates to the derivedFrom store for fast cardinality estimation.
- Fallback to Iterator
  - When changes exist in the transaction, the method falls back to streaming through matching statements using getStatements(...).stream().count(). This bypasses LMDB’s lazy evaluation to ensure consistency, even with uncommitted changes.
Low-level Size Calculation (cardinalityExact)
- If the pattern is completely unspecified (i.e. all wildcards), the method uses LMDB's mdb_stat to return the total size efficiently.
- For specific patterns, it iterates over both explicit and implicit triples and counts the results.

Perf

I created a LMDBSail with 10M triples.
Original size(): 21802ms

Optimized size():
685.6 μs to get the full size by leveraging mdb_stats.
274.2 ms to get the size of a context of 5000000 triples.

Total Size: 10000000, Time taken: 685.6 μs
Size in context: 5000000, Time taken: 274.2 ms

PR Author Checklist (see the contributor guidelines for more details):

my pull request is self-contained
I've added tests for the changes I made
I've applied code formatting (you can use mvn process-resources to format from the command line)
I've squashed my commits where necessary
every commit message starts with the issue number (GH-xxxx) followed by a meaningful description of the change

kenwenzel · 2025-05-31T17:48:14Z

This is quite good. We could even get better if we don't materialize anything.
But for now we could could/should stay with your solution?!

odysa · 2025-05-31T18:32:10Z

Let me further explore the materialize-nothing solution to see how fast it can be.

kenwenzel · 2025-05-31T21:21:12Z

Let me further explore the materialize-nothing solution to see how fast it can be.

Maybe we can parameterize the cardinality() function with exact and reuse the existing code?

odysa · 2025-05-31T22:54:52Z

Ok.
Counting 10M triples without context filter only took 685.2 μs
Counting triples within a context requires scanning all entires. It took 274.2ms to scan 5000000 triples.

Total Size: 10000000, Time taken: 685.6 μs
Size in context: 5000000, Time taken: 274.2 ms

odysa · 2025-05-31T22:55:22Z

My cardinalityExact implementation.

protected long cardinalityExact(final long subj, final long pred, final long obj, final long context)
			throws IOException {

		// get size of entire db
		if (subj == LmdbValue.UNKNOWN_ID && pred == LmdbValue.UNKNOWN_ID && obj == LmdbValue.UNKNOWN_ID
				&& context == LmdbValue.UNKNOWN_ID) {
			return txnManager.doWith((stack, txn) -> {
				long cardinality = 0;
				final TripleIndex index = getBestIndex(subj, pred, obj, context);
				for (boolean explicit : new boolean[] { true, false }) {
					int dbi = index.getDB(explicit);
					MDBStat stat = MDBStat.mallocStack(stack);
					mdb_stat(txn, dbi, stat);
					cardinality += stat.ms_entries();
				}
				return cardinality;
			});
		}

		try (TxnManager.Txn txn = txnManager.createReadTxn()) {
			final RecordIterator explicitIter = getTriples(txn, subj, pred, obj, context, true);
			final RecordIterator implicitIter = getTriples(txn, subj, pred, obj, context, false);
			long size = 0;
			try {
				for (long[] quad = explicitIter.next(); quad != null; quad = explicitIter.next()) {
					size++;
				}
				for (long[] quad = implicitIter.next(); quad != null; quad = implicitIter.next()) {
					size++;
				}
			} finally {
				explicitIter.close();
				implicitIter.close();
			}
			return size;
		}
	}

odysa · 2025-06-01T02:18:28Z

For changeset:

If there are no approved or deprecated changes, it uses a fast direct lookup. Otherwise, it falls back to iterating over matching statements, which remains performant since it avoids LMDB’s lazy evaluation.

@Override
	public long size(final Resource subj, final IRI pred, final Value obj, final Resource... contexts) {
		// Fast path: no approved or deprecated
		if (!changes.hasApproved() && !changes.hasDeprecated()) {
			return derivedFrom.size(subj, pred, obj, contexts);
		}

		// Fallback path: iterate over all matching statements; remains fast due to bypassing LMDB's lazy evaluation
		return getStatements(subj, pred, obj, contexts).stream().count();
	}

kenwenzel · 2025-06-01T08:16:54Z

Hi @odysa,

the context size is also already tracked:

rdf4j/core/sail/lmdb/src/main/java/org/eclipse/rdf4j/sail/lmdb/TripleStore.java

Line 913 in 157627b

    
           private void incrementContext(MemoryStack stack, long context) throws IOException {

hmottestad · 2025-06-01T12:13:23Z

core/sail/base/src/main/java/org/eclipse/rdf4j/sail/base/SailSourceConnection.java

+	protected long calculateSize(final boolean includeInferred, final Resource... contexts) throws SailException {
+		try (SailSource branch = branch(IncludeInferred.fromBoolean(includeInferred))) {
+			return branch.dataset(getIsolationLevel()).size(null, null, null, contexts);
+		}
+	}
+


Any reason for why we need a new method here?

From a quick glance I saw that there is protected long sizeInternal(Resource... contexts) throws SailException { which I wonder if would be a better place for this code?

This optimization is only for LMDB, so I made 2 different pathes for size.
LMDB gets size directly from banch.dataset
Other sails call getStatementsInternal to count.

I don't want to accidently break other sails, so leave sizeInternal unchanged.

odysa · 2025-06-02T03:01:56Z

Done getting size from tracked context size

core/sail/base/src/main/java/org/eclipse/rdf4j/sail/base/SailDatasetImpl.java

kenwenzel · 2025-06-02T14:31:38Z

Besides my comment it looks good to me.
@hmottestad What do you think about the changes in the SAIL implementation?

hmottestad · 2025-06-02T16:54:50Z

I need to take a closer look. It would be best to not have to make any changes to the common sail code. We have the ElasticsearchStore that could also leverage a faster way to count by offloading it to Elasticsearch.

Another aspect is to make sure that this new way of counting follows all the transactional isolation levels that are supported by the LMDB Store.

Finally I would like to see if we can't add a benchmark to our current suite so that this new optimisation won't get forgotten when we make changes in the future.

Im a bit pressed for time right now though. I'll do my best to prioritise this.

odysa · 2025-06-03T03:03:42Z

Another aspect is to make sure that this new way of counting follows all the transactional isolation levels that are supported by the LMDB Store.

Lemme add more test cases to it.

odysa · 2025-06-12T03:18:25Z

Added isolationLevel test.
Fixed the comment issue.

kenwenzel

Thank you for the good work. I only have some minor comments.

kenwenzel · 2025-06-24T06:13:35Z

core/sail/lmdb/src/main/java/org/eclipse/rdf4j/sail/lmdb/TripleStore.java

+		}
+
+		// Fast path: if only context is specified, return the size of the given context
+		if (subj == LmdbValue.UNKNOWN_ID && pred == LmdbValue.UNKNOWN_ID


Could be combined with previous if where context variable is checked inside. This would also avoid the slower call to stackGet().

kenwenzel · 2025-06-24T06:15:09Z

core/sail/base/src/main/java/org/eclipse/rdf4j/sail/base/SailDatasetImpl.java

-import java.util.Iterator;
-import java.util.List;
-import java.util.Map;
+import java.util.*;


@hmottestad Do we have any rules when using wildcard imports are allowed?

We usually don't use wildcard, but it's unfortunately not picked up by the formatter.

kenwenzel · 2025-06-24T06:16:19Z

core/sail/base/src/main/java/org/eclipse/rdf4j/sail/base/SailDatasetImpl.java

+		}
+
+		// Fallback path: iterate over all matching statements
+		return getStatements(subj, pred, obj, contexts).stream().count();


@odysa Does this call materialize/resolve all values in LmdbStore?

No. It calls getStatements here

rdf4j/core/sail/lmdb/src/main/java/org/eclipse/rdf4j/sail/lmdb/LmdbSailStore.java

Lines 927 to 941 in d78001b

public CloseableIteration<? extends Statement> getStatements(Resource subj, IRI pred, Value obj,

Resource... contexts) throws SailException {

try {

return createStatementIterator(txn, subj, pred, obj, explicit, contexts);

} catch (IOException e) {

try {

logger.warn("Failed to get statements, retrying", e);

// try once more before giving up

Thread.yield();

return createStatementIterator(txn, subj, pred, obj, explicit, contexts);

} catch (IOException e2) {

throw new SailException("Unable to get statements", e);

}

}

}

And eventually calls getNextElement

rdf4j/core/sail/lmdb/src/main/java/org/eclipse/rdf4j/sail/lmdb/LmdbStatementIterator.java

Lines 53 to 78 in d78001b

public Statement getNextElement() throws SailException {

try {

long[] quad = recordIt.next();

if (quad == null) {

return null;

}

long subjID = quad[TripleStore.SUBJ_IDX];

Resource subj = (Resource) valueStore.getLazyValue(subjID);

long predID = quad[TripleStore.PRED_IDX];

IRI pred = (IRI) valueStore.getLazyValue(predID);

long objID = quad[TripleStore.OBJ_IDX];

Value obj = valueStore.getLazyValue(objID);

Resource context = null;

long contextID = quad[TripleStore.CONTEXT_IDX];

if (contextID != 0) {

context = (Resource) valueStore.getLazyValue(contextID);

}

return valueStore.createStatement(subj, pred, obj, context);

} catch (IOException e) {

throw causeIOException(e);

}

hmottestad · 2025-07-14T11:00:15Z

This is still on my list of things to do. I've just been a bit too busy with other things, RDF4J is a side project that I work on next to my regular job.

odysa · 2025-07-14T21:13:41Z

This is still on my list of things to do. I've just been a bit too busy with other things, RDF4J is a side project that I work on next to my regular job.

Thank you so much. @hmottestad I really appreciate all the work that’s gone into RDF4J, and it's impressive that you're maintaining such a big project in your free time. If there’s anything I can do to help with maintaining the project, such as reviewing PRs or anything else, I’d be happy to contribute.

kenwenzel · 2025-07-14T21:22:54Z

LGTM
@hmottestad Same with me - that's why I asked for maybe using funding opportunities.

kenwenzel · 2025-09-22T07:42:48Z

@hmottestad Can this be merged?

hmottestad · 2025-09-22T13:03:23Z

core/sail/base/src/main/java/org/eclipse/rdf4j/sail/base/SailDatasetImpl.java

+		}
+
+		// Fallback path: iterate over all matching statements
+		return getStatements(subj, pred, obj, contexts).stream().count();


return getStatements(subj, pred, obj, contexts).stream().count(); should probably we wrapped in a try-with-resource.

As in:

try(var stream = getStatements(subj, pred, obj, contexts).stream()){ return stream.size(); }

hmottestad · 2025-09-22T13:13:28Z

core/sail/base/src/main/java/org/eclipse/rdf4j/sail/base/SailDataset.java


+	@Experimental
+	default long size(final Resource subj, final IRI pred, final Value obj, final Resource... contexts) {
+		return getStatements(subj, pred, obj, contexts).stream()


I believe that this also needs to be closed.

hmottestad · 2025-09-22T13:16:39Z

core/sail/base/src/main/java/org/eclipse/rdf4j/sail/base/SailSourceConnection.java

+	 * @throws SailException if an error occurs while accessing the Sail store
+	 */
+	@Experimental
+	protected long getSizeFromSnapshot(final boolean includeInferred, final Resource... contexts) throws SailException {


Do we need includeInferred? I can't see that the other size related methods support differentiating between inferred or explicit.

I kinda wish we could just use the existing sizeInternal(...) method, but I get that it's better to be safe and introduce a separate method for now and then later on we can consider changing up sizeInternal(...).

odysa force-pushed the feat/make-size-faster branch from e9dab06 to ff4a793 Compare May 31, 2025 16:18

odysa changed the title ~~GH-5343 Make LMDBSail Size 30x Faster 🚀🚀~~ GH-5343 Make LMDBSail Size() 30x Faster 🚀🚀 May 31, 2025

odysa force-pushed the feat/make-size-faster branch from ff4a793 to 6b638e0 Compare June 1, 2025 00:33

odysa changed the title ~~GH-5343 Make LMDBSail Size() 30x Faster 🚀🚀~~ GH-5343 Make LMDBSail Size() 36000x Faster 🚀🚀🚀🚀 Jun 1, 2025

odysa force-pushed the feat/make-size-faster branch 6 times, most recently from 2be6b32 to c8c107a Compare June 1, 2025 02:15

odysa force-pushed the feat/make-size-faster branch from c8c107a to 8c4b328 Compare June 1, 2025 02:19

hmottestad reviewed Jun 1, 2025

View reviewed changes

odysa force-pushed the feat/make-size-faster branch from 8c4b328 to 87a524d Compare June 1, 2025 18:26

kenwenzel reviewed Jun 2, 2025

View reviewed changes

core/sail/base/src/main/java/org/eclipse/rdf4j/sail/base/SailDatasetImpl.java Outdated Show resolved Hide resolved

odysa force-pushed the feat/make-size-faster branch from 87a524d to 44a16d2 Compare June 12, 2025 03:18

odysa requested review from kenwenzel and hmottestad June 12, 2025 03:18

odysa force-pushed the feat/make-size-faster branch from 44a16d2 to e4c1475 Compare June 15, 2025 01:34

kenwenzel reviewed Jun 24, 2025

View reviewed changes

eclipse-rdf4jGH-5343 speed-up size calculation in LmdbStore

093a8f8

odysa force-pushed the feat/make-size-faster branch from e4c1475 to 093a8f8 Compare July 14, 2025 01:16

hmottestad reviewed Sep 22, 2025

View reviewed changes

	public CloseableIteration<? extends Statement> getStatements(Resource subj, IRI pred, Value obj,
	Resource... contexts) throws SailException {
	try {
	return createStatementIterator(txn, subj, pred, obj, explicit, contexts);
	} catch (IOException e) {
	try {
	logger.warn("Failed to get statements, retrying", e);
	// try once more before giving up
	Thread.yield();
	return createStatementIterator(txn, subj, pred, obj, explicit, contexts);
	} catch (IOException e2) {
	throw new SailException("Unable to get statements", e);
	}
	}
	}

	public Statement getNextElement() throws SailException {
	try {
	long[] quad = recordIt.next();
	if (quad == null) {
	return null;
	}

	long subjID = quad[TripleStore.SUBJ_IDX];
	Resource subj = (Resource) valueStore.getLazyValue(subjID);

	long predID = quad[TripleStore.PRED_IDX];
	IRI pred = (IRI) valueStore.getLazyValue(predID);

	long objID = quad[TripleStore.OBJ_IDX];
	Value obj = valueStore.getLazyValue(objID);

	Resource context = null;
	long contextID = quad[TripleStore.CONTEXT_IDX];
	if (contextID != 0) {
	context = (Resource) valueStore.getLazyValue(contextID);
	}

	return valueStore.createStatement(subj, pred, obj, context);
	} catch (IOException e) {
	throw causeIOException(e);
	}

GH-5343 Make LMDBSail Size() 36000x Faster 🚀🚀🚀🚀 #5345

Are you sure you want to change the base?

GH-5343 Make LMDBSail Size() 36000x Faster 🚀🚀🚀🚀 #5345

Uh oh!

Conversation

odysa commented May 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key Changes

Perf

Uh oh!

kenwenzel commented May 31, 2025

Uh oh!

odysa commented May 31, 2025

Uh oh!

kenwenzel commented May 31, 2025

Uh oh!

odysa commented May 31, 2025

Uh oh!

odysa commented May 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

odysa commented Jun 1, 2025

Uh oh!

kenwenzel commented Jun 1, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

odysa commented Jun 2, 2025

Uh oh!

Uh oh!

kenwenzel commented Jun 2, 2025

Uh oh!

hmottestad commented Jun 2, 2025

Uh oh!

odysa commented Jun 3, 2025

Uh oh!

odysa commented Jun 12, 2025

Uh oh!

kenwenzel left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hmottestad commented Jul 14, 2025

Uh oh!

odysa commented Jul 14, 2025

Uh oh!

kenwenzel commented Jul 14, 2025

Uh oh!

kenwenzel commented Sep 22, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hmottestad Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

odysa commented May 31, 2025 •

edited

Loading

odysa commented May 31, 2025 •

edited

Loading

hmottestad Sep 22, 2025 •

edited

Loading