Skip to content

Fix duplicate filtering path in Arrow task batches#3448

Open
GayathriSrividya wants to merge 2 commits into
apache:mainfrom
GayathriSrividya:fix/issue-3272-task-to-record-batches-segfault
Open

Fix duplicate filtering path in Arrow task batches#3448
GayathriSrividya wants to merge 2 commits into
apache:mainfrom
GayathriSrividya:fix/issue-3272-task-to-record-batches-segfault

Conversation

@GayathriSrividya
Copy link
Copy Markdown

Closes #3272

What this changes

This PR updates the Arrow scan path in _task_to_record_batches to avoid redundant filtering when there are no positional deletes.

  • Keeps predicate pushdown in Scanner.from_fragment as the only filter path when positional_deletes is absent.
  • Applies current_batch.filter(pyarrow_filter) only in the positional-delete path, after deletes are applied.
  • Preserves empty-batch handling after both delete application and conditional filtering.

Why

The previous flow could perform an extra table-level refilter even when the scanner already applied the predicate. This change removes that stale workaround path while keeping correct behavior for positional delete scenarios.

Tests

Added regression coverage in tests/io/test_pyarrow.py:

  • test_task_to_record_batches_filter_without_positional_deletes_avoids_table_refilter
  • test_task_to_record_batches_filter_with_positional_deletes_handles_empty_batch

Validated locally:

  • python -m pytest tests/io/test_pyarrow.py -q -k "task_to_record_batches_nanos or filter_without_positional_deletes_avoids_table_refilter or filter_with_positional_deletes_handles_empty_batch"
  • make lint

Comment thread tests/io/test_pyarrow.py Outdated
Replace the two weak tests that passed on the unfixed code with proper
regression coverage:

1. test_task_to_record_batches_does_not_use_table_filter_without_positional_deletes
   Replaces pyarrow.Table with a sentinel class whose from_batches raises
   AssertionError.  A custom metaclass preserves isinstance checks so the
   rest of the code-path is unaffected.  With the fix the sentinel is never
   reached; without the fix it fires immediately, detecting the old
   pa.Table.from_batches / filter / combine_chunks / to_batches[0] path
   that caused the SIGSEGV on Apple Silicon.

2. test_task_to_record_batches_filter_applied_after_positional_deletes
   Uses data [1,2,3,4,5] with positional deletes on positions 1 and 3
   (removing values 2 and 4), then applies filter id > 2.  Expected
   result [3, 5] would be wrong if either deletes or the subsequent
   filter were skipped.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

_task_to_record_batches stale PyArrow workaround causes SIGSEGV on Apple Silicon

2 participants