[SPARK-55885][SQL] Optimize vectorized Parquet boolean reading with lookup-table expansion and batch buffer reads#54818
[SPARK-55885][SQL] Optimize vectorized Parquet boolean reading with lookup-table expansion and batch buffer reads#54818LuciferYang wants to merge 3 commits intoapache:masterfrom
Conversation
|
test first and after supplementing the benchmark results, the code unrelated to pr will be revert |
This reverts commit 35dd829.
dongjoon-hyun
left a comment
There was a problem hiding this comment.
+1, LGTM. Thank you, @LuciferYang .
|
BTW, @LuciferYang . Although I understand why you used the term, spark/.github/PULL_REQUEST_TEMPLATE Lines 55 to 56 in 7eef6f7 |
|
Anyway, merged to master~ |
Thank you for your correction. |
What changes were proposed in this pull request?
This PR optimizes the vectorized Parquet plain boolean reading path in two ways:
Lookup-table-based bit expansion: Replace 8 individual byte writes per packed boolean byte with a single 64-bit
Platform.putLongwrite, using a precomputed 256-entry lookup table (BOOL_BYTE_TO_LONG_TABLE) that expands each bit into a separate byte within along. Big-endian platforms are handled viaLong.reverseBytes().Batch buffer reads: Replace per-byte
in.read()calls inVectorizedPlainValuesReader.readBooleanswith a singlegetBuffer(fullBytes)call (backed byByteBufferInputStream.slice), reducing I/O overhead from ~N/8 individual read calls to one bulk acquisition per batch.Why are the changes needed?
In the current implementation, reading a batch of N boolean values from Parquet requires:
in.read()calls (each going throughByteBufferInputStream)Platform.putBytecalls)For a typical batch size of 4096, this means ~512
in.read()calls and ~4096 individual byte writes. The optimized path reduces this to:getBuffer()call for the entire batchPlatform.putLongwrites (one per packed byte)This is a meaningful improvement on the hot path of Parquet boolean column scanning.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
ColumnVectorSuiteBenchmark Code (click to expand)
Perform
build/sbt "sql/Test/runMain org.apache.spark.sql.execution.datasources.parquet.VectorizedPlainValuesReaderBenchmark"to conduct the testBenchmark results:
From the perspective of micro-benchmarking, the new solution demonstrates over a 3-fold performance improvement in terms of latency.
Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Sonnet 4.6