Add files_processed and files_scanned metrics to FileStreamMetrics#20592
Open
adriangb wants to merge 3 commits intoapache:mainfrom
Open
Add files_processed and files_scanned metrics to FileStreamMetrics#20592adriangb wants to merge 3 commits intoapache:mainfrom
adriangb wants to merge 3 commits intoapache:mainfrom
Conversation
Track file-level progress in FileStream with two new counters: - files_opened: incremented when a file is successfully opened - files_scanned: incremented when a file's reader stream is fully consumed Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rename `files_opened` metric to `files_processed` so it reflects all files assigned to the partition, not just those that were opened. When a LIMIT terminates the stream early, the remaining files (including any prefetched next file) are counted so that `files_processed` always equals the total partition file count. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
files_processedcounter toFileStreamMetrics, incremented for every file assigned to the partition — whether it was opened, pruned (returned an empty stream), or skipped due to a LIMIT. When the stream completes, this equals the total number of files in the partition.files_scannedcounter toFileStreamMetrics, incremented when a file's reader stream is fully consumed (all batches read).Motivation
These metrics enable tracking query progress during long-running scans. Today, there is no way to monitor how far along a file scan is. The existing
FileStreamMetricsonly provide:time_elapsed_opening,time_elapsed_scanning_total, etc.) — these measure duration but don't indicate progress. You can't tell whether a scan is 10% or 90% done from elapsed time alone.file_open_errors,file_scan_errors) — these only count failures, not successful progress.output_rows(fromBaselineMetrics) — counts rows emitted, but since we don't know upfront how many rows will be emitted in total this is a poor metric, i.e. it never converges to 100% if there are filters, etc.In contrast,
files_processedandfiles_scannedcombined with the known number of files infile_groupsgive a clear progress indicator:files_processed / total_files. This is the most natural and reliable way to track scan progress since the file count is known at plan time.Edge case behavior
files_processedfiles_scannedTest plan
file_streamtests pass (8/8)cargo check -p datafusion-datasourcecompiles cleanly🤖 Generated with Claude Code