Skip to content

Add files_processed and files_scanned metrics to FileStreamMetrics#20592

Open
adriangb wants to merge 3 commits intoapache:mainfrom
pydantic:file-open-metrics
Open

Add files_processed and files_scanned metrics to FileStreamMetrics#20592
adriangb wants to merge 3 commits intoapache:mainfrom
pydantic:file-open-metrics

Conversation

@adriangb
Copy link
Contributor

@adriangb adriangb commented Feb 27, 2026

Summary

  • Add files_processed counter to FileStreamMetrics, incremented for every file assigned to the partition — whether it was opened, pruned (returned an empty stream), or skipped due to a LIMIT. When the stream completes, this equals the total number of files in the partition.
  • Add files_scanned counter to FileStreamMetrics, incremented when a file's reader stream is fully consumed (all batches read).

Motivation

These metrics enable tracking query progress during long-running scans. Today, there is no way to monitor how far along a file scan is. The existing FileStreamMetrics only provide:

  • Timing metrics (time_elapsed_opening, time_elapsed_scanning_total, etc.) — these measure duration but don't indicate progress. You can't tell whether a scan is 10% or 90% done from elapsed time alone.
  • Error counters (file_open_errors, file_scan_errors) — these only count failures, not successful progress.
  • output_rows (from BaselineMetrics) — counts rows emitted, but since we don't know upfront how many rows will be emitted in total this is a poor metric, i.e. it never converges to 100% if there are filters, etc.

In contrast, files_processed and files_scanned combined with the known number of files in file_groups give a clear progress indicator: files_processed / total_files. This is the most natural and reliable way to track scan progress since the file count is known at plan time.

Edge case behavior

Scenario files_processed files_scanned
File-level pruning (FilePruner / dynamic filter) +1 (open resolves with empty stream) +1 (empty stream yields None)
All row groups pruned (stats/bloom/TopK) +1 +1
EarlyStoppingStream terminates mid-scan +1 +1 (stream yields None)
LIMIT hit mid-file +1 for current file, +N for remaining files never opened 0 for current file
Normal full scan +1 +1

Test plan

  • Existing file_stream tests pass (8/8)
  • cargo check -p datafusion-datasource compiles cleanly

🤖 Generated with Claude Code

Track file-level progress in FileStream with two new counters:
- files_opened: incremented when a file is successfully opened
- files_scanned: incremented when a file's reader stream is fully consumed

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added the datasource Changes to the datasource crate label Feb 27, 2026
adriangb and others added 2 commits February 27, 2026 12:11
Rename `files_opened` metric to `files_processed` so it reflects
all files assigned to the partition, not just those that were opened.
When a LIMIT terminates the stream early, the remaining files
(including any prefetched next file) are counted so that
`files_processed` always equals the total partition file count.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@adriangb adriangb changed the title Add files_opened and files_scanned metrics to FileStreamMetrics Add files_processed and files_scanned metrics to FileStreamMetrics Mar 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

datasource Changes to the datasource crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant