-
Notifications
You must be signed in to change notification settings - Fork 1.4k
[WIP][df] Introduce RDatasetSpec for RNTuple #19341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
3645459
to
48101c1
Compare
Test Results 18 files 18 suites 3d 21h 48m 57s ⏱️ For more details on these failures, see this check. Results for commit 48101c1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello, thanks for the development!
I left already a few comments.
@@ -104,12 +104,18 @@ class RNTupleDS final : public ROOT::RDF::RDataSource { | |||
std::vector<std::vector<ROOT::Internal::RDF::RNTupleColumnReader *>> fActiveColumnReaders; | |||
|
|||
ULong64_t fSeenEntries = 0; ///< The number of entries so far returned by GetEntryRanges() | |||
ULong64_t fSeenEntriesRange = 0; | |||
ULong64_t fSeenEntriesNonRange = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find the name hard to understand. A small comment could help or if it's not too much work, one could rename it to something like fSeenEntriesOutOfRange
(if that's what it does).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, it's not what it does, it counts entries if no global range is requested, I'll add a comment, I could also rename it but I'm not sure what to rename it to to be more evident. It used to be just fSeenEntries which was simply for counting processed entries but now I need to have two separate counters depending on the case. I still need the fSeenEntries to count correctly at the end of processing but depending on whether I have the global ranges or not, I need to count differently first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe fSeenEntriesNonRange --> fSeenEntriesNoGlobalRange and fSeenEntriesRange --> fSeenEntriesWithGlobalRange
throw std::runtime_error( | ||
"More than one RNTuple name was found, please make sure to use RNTuples with the same RnTuple name."); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a temporary limitation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, but this is what we have in any RNTuple chaining at this point.
auto start = fCurrentRanges[i].fFirstEntry + fSeenEntries; | ||
auto end = fCurrentRanges[i].fLastEntry + fSeenEntries; | ||
|
||
start = fCurrentRanges[i].fFirstEntry + fSeenEntriesNonRange; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it need to be out of the loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I use it also after the loop in this if statement block:
root/tree/dataframe/src/RNTupleDS.cxx
Line 776 in 058f1bb
r->Connect(*fCurrentRanges[0].fSource, start); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see a few shadow warnings in the CI, but you probably saw them, too.
catch (const std::runtime_error &err) { | ||
EXPECT_EQ( | ||
std::string(err.what()), | ||
"More than one RNTuple name was found, please make sure to use RNTuples with the same RnTuple name."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not specifically for this PR, but general for RDF devs:
I always wonder if it makes sense to really test the entire error message. In this way, we always have to touch two places if we wanted to correct small things, such as in this case:
RnTuple --> RNTuple
I would either just test that there is any exception (possibly of a specific type), or just test a substring such as exc.startsWith("More than one RNTuple name");
. In this way, small adjustments or making the error message more helpful doesn't immediately fail the test.
This PR introduces functionality of RDatasetSpec for RNTuple. Main changes are inside RLoopManager and RNTupleDS - mainly the introduction of WithGlobalRanges.
Marked as WIP to make sure all tests pass but the main development should be ready for the review.