[RDF] Add internal GetClusterRanges utility to retrieve cluster entry ranges#21768
[RDF] Add internal GetClusterRanges utility to retrieve cluster entry ranges#21768siliataider wants to merge 5 commits intoroot-project:masterfrom
Conversation
|
@vepadulano the multithreaded approach may not be so straightforward: the global offset depends on the cumulative entry count from previous files, opened sequentially We could maybe think of a two step approach, collecting clusters in parallel and then sequentially adjusting with a global offset somehow preserving the original order of the files.. |
Test Results 22 files 22 suites 3d 6h 4m 35s ⏱️ For more details on these failures, see this check. Results for commit 395d045. ♻️ This comment has been updated with latest results. |
22ca922 to
cef5c00
Compare
cef5c00 to
395d045
Compare
| * \brief Function to retrieve the entry ranges for each cluster in the dataset, | ||
| * across files, with a global offset. | ||
| */ | ||
| std::vector<std::pair<ULong64_t, ULong64_t>> GetClusterRanges(Detail::RDF::RLoopManager &lm); |
There was a problem hiding this comment.
The documentation should clearly not that this is a slow operation as it (serially) open all the files and load all the TTree objects (and delete/close them). Those operations can be noticeably slow when done on many (large) remote files. So this function should be used parsimoniously.
This Pull request:
Adds
GetClusterRangesas an internal utility to retrieve entry ranges for each cluster in aTTreeorRNTuplebasedRDataFrame.It returns a list of cluster boundaries across files, using a global offset.
This utility is required by the
RDataLoaderto shuffle and prefetch data for ML training.Changes
RInterfaceBase: makeGetLoopManager()public to allow access from internal utilitiesRNTupleDS: addGetClusterRangesas a friend function to access private membersfNTupleNameandfFileNamesRNTupleDS: setfNTupleNamein the single file constructor (like the multi file constructor)RDFUtils: addGetClusterRanges()implementation for both TTree and RNTuple datasources