Skip to content

[WIP][df] Introduce RDatasetSpec for RNTuple #19341

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

martamaja10
Copy link
Contributor

This PR introduces functionality of RDatasetSpec for RNTuple. Main changes are inside RLoopManager and RNTupleDS - mainly the introduction of WithGlobalRanges.

Marked as WIP to make sure all tests pass but the main development should be ready for the review.

@martamaja10 martamaja10 self-assigned this Jul 11, 2025
@martamaja10 martamaja10 force-pushed the datasetspec-rntuple branch from 3645459 to 48101c1 Compare July 11, 2025 09:51
Copy link

Test Results

    18 files      18 suites   3d 21h 48m 57s ⏱️
 3 198 tests  3 194 ✅ 0 💤  4 ❌
56 086 runs  56 035 ✅ 0 💤 51 ❌

For more details on these failures, see this check.

Results for commit 48101c1.

Copy link
Member

@hageboeck hageboeck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello, thanks for the development!

I left already a few comments.

@@ -104,12 +104,18 @@ class RNTupleDS final : public ROOT::RDF::RDataSource {
std::vector<std::vector<ROOT::Internal::RDF::RNTupleColumnReader *>> fActiveColumnReaders;

ULong64_t fSeenEntries = 0; ///< The number of entries so far returned by GetEntryRanges()
ULong64_t fSeenEntriesRange = 0;
ULong64_t fSeenEntriesNonRange = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the name hard to understand. A small comment could help or if it's not too much work, one could rename it to something like fSeenEntriesOutOfRange (if that's what it does).

Copy link
Contributor Author

@martamaja10 martamaja10 Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, it's not what it does, it counts entries if no global range is requested, I'll add a comment, I could also rename it but I'm not sure what to rename it to to be more evident. It used to be just fSeenEntries which was simply for counting processed entries but now I need to have two separate counters depending on the case. I still need the fSeenEntries to count correctly at the end of processing but depending on whether I have the global ranges or not, I need to count differently first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe fSeenEntriesNonRange --> fSeenEntriesNoGlobalRange and fSeenEntriesRange --> fSeenEntriesWithGlobalRange

Comment on lines +582 to +584
throw std::runtime_error(
"More than one RNTuple name was found, please make sure to use RNTuples with the same RnTuple name.");
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a temporary limitation?

Copy link
Contributor Author

@martamaja10 martamaja10 Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but this is what we have in any RNTuple chaining at this point.

auto start = fCurrentRanges[i].fFirstEntry + fSeenEntries;
auto end = fCurrentRanges[i].fLastEntry + fSeenEntries;

start = fCurrentRanges[i].fFirstEntry + fSeenEntriesNonRange;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it need to be out of the loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I use it also after the loop in this if statement block:

r->Connect(*fCurrentRanges[0].fSource, start);

Copy link
Member

@hageboeck hageboeck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a few shadow warnings in the CI, but you probably saw them, too.

catch (const std::runtime_error &err) {
EXPECT_EQ(
std::string(err.what()),
"More than one RNTuple name was found, please make sure to use RNTuples with the same RnTuple name.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not specifically for this PR, but general for RDF devs:
I always wonder if it makes sense to really test the entire error message. In this way, we always have to touch two places if we wanted to correct small things, such as in this case:
RnTuple --> RNTuple

I would either just test that there is any exception (possibly of a specific type), or just test a substring such as exc.startsWith("More than one RNTuple name");. In this way, small adjustments or making the error message more helpful doesn't immediately fail the test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants