Add files_processed and files_scanned metrics to FileStreamMetrics by adriangb · Pull Request #20592 · apache/datafusion

adriangb · 2026-02-27T12:07:50Z

Summary

Add files_processed counter to FileStreamMetrics, incremented for every file assigned to the partition — whether it was opened, pruned (returned an empty stream), or skipped due to a LIMIT. When the stream completes, this equals the total number of files in the partition.
Add files_scanned counter to FileStreamMetrics, incremented when a file's reader stream is fully consumed (all batches read).

Motivation

These metrics enable tracking query progress during long-running scans. Today, there is no way to monitor how far along a file scan is. The existing FileStreamMetrics only provide:

Timing metrics (time_elapsed_opening, time_elapsed_scanning_total, etc.) — these measure duration but don't indicate progress. You can't tell whether a scan is 10% or 90% done from elapsed time alone.
Error counters (file_open_errors, file_scan_errors) — these only count failures, not successful progress.
output_rows (from BaselineMetrics) — counts rows emitted, but since we don't know upfront how many rows will be emitted in total this is a poor metric, i.e. it never converges to 100% if there are filters, etc.

In contrast, files_processed and files_scanned combined with the known number of files in file_groups give a clear progress indicator: files_processed / total_files. This is the most natural and reliable way to track scan progress since the file count is known at plan time.

Edge case behavior

Scenario	`files_processed`	`files_scanned`
File-level pruning (FilePruner / dynamic filter)	+1 (open resolves with empty stream)	+1 (empty stream yields None)
All row groups pruned (stats/bloom/TopK)	+1	+1
EarlyStoppingStream terminates mid-scan	+1	+1 (stream yields None)
LIMIT hit mid-file	+1 for current file, +N for remaining files never opened	0 for current file
Normal full scan	+1	+1

Test plan

Existing file_stream tests pass (8/8)
cargo check -p datafusion-datasource compiles cleanly

🤖 Generated with Claude Code

Track file-level progress in FileStream with two new counters: - files_opened: incremented when a file is successfully opened - files_scanned: incremented when a file's reader stream is fully consumed Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Rename `files_opened` metric to `files_processed` so it reflects all files assigned to the partition, not just those that were opened. When a LIMIT terminates the stream early, the remaining files (including any prefetched next file) are counted so that `files_processed` always equals the total partition file count. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions bot added the datasource Changes to the datasource crate label Feb 27, 2026

adriangb and others added 2 commits February 27, 2026 12:11

fmt

20115eb

adriangb changed the title ~~Add files_opened and files_scanned metrics to FileStreamMetrics~~ Add files_processed and files_scanned metrics to FileStreamMetrics Mar 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add files_processed and files_scanned metrics to FileStreamMetrics#20592

Add files_processed and files_scanned metrics to FileStreamMetrics#20592
adriangb wants to merge 3 commits intoapache:mainfrom
pydantic:file-open-metrics

adriangb commented Feb 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

adriangb commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Edge case behavior

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

adriangb commented Feb 27, 2026 •

edited

Loading