Skip to content

Conversation

odysa
Copy link

@odysa odysa commented May 31, 2025

GitHub issue resolved: #
#5343
Briefly describe the changes proposed in this PR:

This PR introduces an optimization for the size(...) method in the LMDBstore implementation. Introduce a cardinalityExact to calcualte the exact size, leverageingmdb_stats when possible.

Key Changes

  • ChangeSet

    • Fast-path Optimization

      • When there are no approved or deprecated changes in the current transaction or not in a transcation, the method directly delegates to the derivedFrom store for fast cardinality estimation.
    • Fallback to Iterator

      • When changes exist in the transaction, the method falls back to streaming through matching statements using getStatements(...).stream().count(). This bypasses LMDB’s lazy evaluation to ensure consistency, even with uncommitted changes.
  • Low-level Size Calculation (cardinalityExact)

    • If the pattern is completely unspecified (i.e. all wildcards), the method uses LMDB's mdb_stat to return the total size efficiently.
    • For specific patterns, it iterates over both explicit and implicit triples and counts the results.

Perf

I created a LMDBSail with 10M triples.
Original size(): 21802ms
Screenshot 2025-05-31 at 11 57 46 AM

Optimized size():
685.6 μs to get the full size by leveraging mdb_stats.
274.2 ms to get the size of a context of 5000000 triples.

Total Size: 10000000, Time taken: 685.6 μs
Size in context: 5000000, Time taken: 274.2 ms

PR Author Checklist (see the contributor guidelines for more details):

  • my pull request is self-contained
  • I've added tests for the changes I made
  • I've applied code formatting (you can use mvn process-resources to format from the command line)
  • I've squashed my commits where necessary
  • every commit message starts with the issue number (GH-xxxx) followed by a meaningful description of the change

@odysa odysa force-pushed the feat/make-size-faster branch from e9dab06 to ff4a793 Compare May 31, 2025 16:18
@odysa odysa changed the title GH-5343 Make LMDBSail Size 30x Faster 🚀🚀 GH-5343 Make LMDBSail Size() 30x Faster 🚀🚀 May 31, 2025
@kenwenzel
Copy link
Contributor

This is quite good. We could even get better if we don't materialize anything.
But for now we could could/should stay with your solution?!

@odysa
Copy link
Author

odysa commented May 31, 2025

Let me further explore the materialize-nothing solution to see how fast it can be.

@kenwenzel
Copy link
Contributor

Let me further explore the materialize-nothing solution to see how fast it can be.

Maybe we can parameterize the cardinality() function with exact and reuse the existing code?

@odysa
Copy link
Author

odysa commented May 31, 2025

Ok.
Counting 10M triples without context filter only took 685.2 μs
Counting triples within a context requires scanning all entires. It took 274.2ms to scan 5000000 triples.

Total Size: 10000000, Time taken: 685.6 μs
Size in context: 5000000, Time taken: 274.2 ms

@odysa
Copy link
Author

odysa commented May 31, 2025

My cardinalityExact implementation.

protected long cardinalityExact(final long subj, final long pred, final long obj, final long context)
			throws IOException {

		// get size of entire db
		if (subj == LmdbValue.UNKNOWN_ID && pred == LmdbValue.UNKNOWN_ID && obj == LmdbValue.UNKNOWN_ID
				&& context == LmdbValue.UNKNOWN_ID) {
			return txnManager.doWith((stack, txn) -> {
				long cardinality = 0;
				final TripleIndex index = getBestIndex(subj, pred, obj, context);
				for (boolean explicit : new boolean[] { true, false }) {
					int dbi = index.getDB(explicit);
					MDBStat stat = MDBStat.mallocStack(stack);
					mdb_stat(txn, dbi, stat);
					cardinality += stat.ms_entries();
				}
				return cardinality;
			});
		}

		try (TxnManager.Txn txn = txnManager.createReadTxn()) {
			final RecordIterator explicitIter = getTriples(txn, subj, pred, obj, context, true);
			final RecordIterator implicitIter = getTriples(txn, subj, pred, obj, context, false);
			long size = 0;
			try {
				for (long[] quad = explicitIter.next(); quad != null; quad = explicitIter.next()) {
					size++;
				}
				for (long[] quad = implicitIter.next(); quad != null; quad = implicitIter.next()) {
					size++;
				}
			} finally {
				explicitIter.close();
				implicitIter.close();
			}
			return size;
		}
	}

@odysa odysa force-pushed the feat/make-size-faster branch from ff4a793 to 6b638e0 Compare June 1, 2025 00:33
@odysa odysa changed the title GH-5343 Make LMDBSail Size() 30x Faster 🚀🚀 GH-5343 Make LMDBSail Size() 36000x Faster 🚀🚀🚀🚀 Jun 1, 2025
@odysa odysa force-pushed the feat/make-size-faster branch 6 times, most recently from 2be6b32 to c8c107a Compare June 1, 2025 02:15
@odysa
Copy link
Author

odysa commented Jun 1, 2025

For changeset:

If there are no approved or deprecated changes, it uses a fast direct lookup. Otherwise, it falls back to iterating over matching statements, which remains performant since it avoids LMDB’s lazy evaluation.

@Override
	public long size(final Resource subj, final IRI pred, final Value obj, final Resource... contexts) {
		// Fast path: no approved or deprecated
		if (!changes.hasApproved() && !changes.hasDeprecated()) {
			return derivedFrom.size(subj, pred, obj, contexts);
		}

		// Fallback path: iterate over all matching statements; remains fast due to bypassing LMDB's lazy evaluation
		return getStatements(subj, pred, obj, contexts).stream().count();
	}

@odysa odysa force-pushed the feat/make-size-faster branch from c8c107a to 8c4b328 Compare June 1, 2025 02:19
@kenwenzel
Copy link
Contributor

Hi @odysa,

the context size is also already tracked:

private void incrementContext(MemoryStack stack, long context) throws IOException {

Comment on lines 1038 to 1053
protected long calculateSize(final boolean includeInferred, final Resource... contexts) throws SailException {
try (SailSource branch = branch(IncludeInferred.fromBoolean(includeInferred))) {
return branch.dataset(getIsolationLevel()).size(null, null, null, contexts);
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason for why we need a new method here?

From a quick glance I saw that there is protected long sizeInternal(Resource... contexts) throws SailException { which I wonder if would be a better place for this code?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This optimization is only for LMDB, so I made 2 different pathes for size.
LMDB gets size directly from banch.dataset
Other sails call getStatementsInternal to count.

I don't want to accidently break other sails, so leave sizeInternal unchanged.

@odysa odysa force-pushed the feat/make-size-faster branch from 8c4b328 to 87a524d Compare June 1, 2025 18:26
@odysa
Copy link
Author

odysa commented Jun 2, 2025

Done getting size from tracked context size

@kenwenzel
Copy link
Contributor

Besides my comment it looks good to me.
@hmottestad What do you think about the changes in the SAIL implementation?

@hmottestad
Copy link
Contributor

I need to take a closer look. It would be best to not have to make any changes to the common sail code. We have the ElasticsearchStore that could also leverage a faster way to count by offloading it to Elasticsearch.

Another aspect is to make sure that this new way of counting follows all the transactional isolation levels that are supported by the LMDB Store.

Finally I would like to see if we can't add a benchmark to our current suite so that this new optimisation won't get forgotten when we make changes in the future.

Im a bit pressed for time right now though. I'll do my best to prioritise this.

@odysa
Copy link
Author

odysa commented Jun 3, 2025

Another aspect is to make sure that this new way of counting follows all the transactional isolation levels that are supported by the LMDB Store.

Lemme add more test cases to it.

@odysa odysa force-pushed the feat/make-size-faster branch from 87a524d to 44a16d2 Compare June 12, 2025 03:18
@odysa
Copy link
Author

odysa commented Jun 12, 2025

Added isolationLevel test.
Fixed the comment issue.

@odysa odysa requested review from kenwenzel and hmottestad June 12, 2025 03:18
@odysa odysa force-pushed the feat/make-size-faster branch from 44a16d2 to e4c1475 Compare June 15, 2025 01:34
Copy link
Contributor

@kenwenzel kenwenzel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the good work. I only have some minor comments.

}

// Fast path: if only context is specified, return the size of the given context
if (subj == LmdbValue.UNKNOWN_ID && pred == LmdbValue.UNKNOWN_ID
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be combined with previous if where context variable is checked inside. This would also avoid the slower call to stackGet().

import java.util.Iterator;
import java.util.List;
import java.util.Map;
import java.util.*;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hmottestad Do we have any rules when using wildcard imports are allowed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually don't use wildcard, but it's unfortunately not picked up by the formatter.

}

// Fallback path: iterate over all matching statements
return getStatements(subj, pred, obj, contexts).stream().count();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@odysa Does this call materialize/resolve all values in LmdbStore?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. It calls getStatements here

public CloseableIteration<? extends Statement> getStatements(Resource subj, IRI pred, Value obj,
Resource... contexts) throws SailException {
try {
return createStatementIterator(txn, subj, pred, obj, explicit, contexts);
} catch (IOException e) {
try {
logger.warn("Failed to get statements, retrying", e);
// try once more before giving up
Thread.yield();
return createStatementIterator(txn, subj, pred, obj, explicit, contexts);
} catch (IOException e2) {
throw new SailException("Unable to get statements", e);
}
}
}

And eventually calls getNextElement
public Statement getNextElement() throws SailException {
try {
long[] quad = recordIt.next();
if (quad == null) {
return null;
}
long subjID = quad[TripleStore.SUBJ_IDX];
Resource subj = (Resource) valueStore.getLazyValue(subjID);
long predID = quad[TripleStore.PRED_IDX];
IRI pred = (IRI) valueStore.getLazyValue(predID);
long objID = quad[TripleStore.OBJ_IDX];
Value obj = valueStore.getLazyValue(objID);
Resource context = null;
long contextID = quad[TripleStore.CONTEXT_IDX];
if (contextID != 0) {
context = (Resource) valueStore.getLazyValue(contextID);
}
return valueStore.createStatement(subj, pred, obj, context);
} catch (IOException e) {
throw causeIOException(e);
}

@odysa odysa force-pushed the feat/make-size-faster branch from e4c1475 to 093a8f8 Compare July 14, 2025 01:16
@hmottestad
Copy link
Contributor

This is still on my list of things to do. I've just been a bit too busy with other things, RDF4J is a side project that I work on next to my regular job.

@odysa
Copy link
Author

odysa commented Jul 14, 2025

This is still on my list of things to do. I've just been a bit too busy with other things, RDF4J is a side project that I work on next to my regular job.

Thank you so much. @hmottestad I really appreciate all the work that’s gone into RDF4J, and it's impressive that you're maintaining such a big project in your free time. If there’s anything I can do to help with maintaining the project, such as reviewing PRs or anything else, I’d be happy to contribute.

@kenwenzel
Copy link
Contributor

LGTM
@hmottestad Same with me - that's why I asked for maybe using funding opportunities.

@kenwenzel
Copy link
Contributor

@hmottestad Can this be merged?

}

// Fallback path: iterate over all matching statements
return getStatements(subj, pred, obj, contexts).stream().count();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return getStatements(subj, pred, obj, contexts).stream().count(); should probably we wrapped in a try-with-resource.

Copy link
Contributor

@hmottestad hmottestad Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in:

try(var stream = getStatements(subj, pred, obj, contexts).stream()){
  return stream.size();
}


@Experimental
default long size(final Resource subj, final IRI pred, final Value obj, final Resource... contexts) {
return getStatements(subj, pred, obj, contexts).stream()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that this also needs to be closed.

* @throws SailException if an error occurs while accessing the Sail store
*/
@Experimental
protected long getSizeFromSnapshot(final boolean includeInferred, final Resource... contexts) throws SailException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need includeInferred? I can't see that the other size related methods support differentiating between inferred or explicit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kinda wish we could just use the existing sizeInternal(...) method, but I get that it's better to be safe and introduce a separate method for now and then later on we can consider changing up sizeInternal(...).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants