Skip to content

[RDF] Add internal GetClusterRanges utility to retrieve cluster entry ranges#21768

Open
siliataider wants to merge 5 commits intoroot-project:masterfrom
siliataider:rdf-clusterranges
Open

[RDF] Add internal GetClusterRanges utility to retrieve cluster entry ranges#21768
siliataider wants to merge 5 commits intoroot-project:masterfrom
siliataider:rdf-clusterranges

Conversation

@siliataider
Copy link
Copy Markdown
Contributor

@siliataider siliataider commented Apr 1, 2026

This Pull request:

Adds GetClusterRanges as an internal utility to retrieve entry ranges for each cluster in a TTree or RNTuple based RDataFrame.

It returns a list of cluster boundaries across files, using a global offset.

This utility is required by the RDataLoader to shuffle and prefetch data for ML training.

Changes

  • RInterfaceBase: make GetLoopManager() public to allow access from internal utilities
  • RNTupleDS: add GetClusterRanges as a friend function to access private members fNTupleName and fFileNames
  • RNTupleDS: set fNTupleName in the single file constructor (like the multi file constructor)
  • RDFUtils: add GetClusterRanges() implementation for both TTree and RNTuple datasources

@siliataider siliataider self-assigned this Apr 1, 2026
@siliataider siliataider added in:RDataFrame in:ML Everything under ROOT/ML labels Apr 1, 2026
@siliataider
Copy link
Copy Markdown
Contributor Author

@vepadulano the multithreaded approach may not be so straightforward: the global offset depends on the cumulative entry count from previous files, opened sequentially

We could maybe think of a two step approach, collecting clusters in parallel and then sequentially adjusting with a global offset somehow preserving the original order of the files..

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 1, 2026

Test Results

    22 files      22 suites   3d 6h 4m 35s ⏱️
 3 831 tests  3 829 ✅  1 💤 1 ❌
76 513 runs  76 494 ✅ 18 💤 1 ❌

For more details on these failures, see this check.

Results for commit 395d045.

♻️ This comment has been updated with latest results.

@siliataider siliataider force-pushed the rdf-clusterranges branch 4 times, most recently from 22ca922 to cef5c00 Compare April 7, 2026 14:48
* \brief Function to retrieve the entry ranges for each cluster in the dataset,
* across files, with a global offset.
*/
std::vector<std::pair<ULong64_t, ULong64_t>> GetClusterRanges(Detail::RDF::RLoopManager &lm);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation should clearly not that this is a slow operation as it (serially) open all the files and load all the TTree objects (and delete/close them). Those operations can be noticeably slow when done on many (large) remote files. So this function should be used parsimoniously.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

in:ML Everything under ROOT/ML in:RDataFrame

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants