Skip to content

Conversation

@lyne7-sc
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

The current hex function implementation uses format! macro and StringArray::from iterator pattern, which causes:

  1. Per-element String allocations: Each value allocates a new String via format!
  2. Unnecessary conversions: Multiple intermediate conversions between types
  3. Inefficient dictionary type handling: Collects all values into vectors before building the result

What changes are included in this PR?

This PR optimizes the hex encoding by:

  • Replacing format!("{num:X}") with a fast lookup table approach
  • Building results directly using StringBuilder
  • Reusing a single vec buffer per iteration to avoid re-allocation
  • Optimizing dictionary array handling by building results iteratively

Benchmark Results

Group Size Before After Speedup
hex_binary 1024 89.9 µs 51.9 µs 1.73x
hex_binary 4096 385.2 µs 218.7 µs 1.76x
hex_binary 8192 741.6 µs 451.6 µs 1.64x
hex_int64 1024 32.0 µs 12.4 µs 2.57x
hex_int64 4096 132.4 µs 59.7 µs 2.22x
hex_int64 8192 258.5 µs 120.6 µs 2.14x
hex_int64_dict 1024 75.2 µs 12.4 µs 6.04x
hex_int64_dict 4096 313.2 µs 60.5 µs 5.18x
hex_int64_dict 8192 614.7 µs 129.0 µs 4.76x
hex_utf8 1024 88.5 µs 53.5 µs 1.66x
hex_utf8 4096 357.6 µs 211.1 µs 1.69x
hex_utf8 8192 698.7 µs 424.8 µs 1.64x

Are these changes tested?

Yes. Existing units and sqllogictest tests pass. New benchmarks added.

Are there any user-facing changes?

No.

@github-actions github-actions bot added the spark label Jan 11, 2026
let hex_string = hex_encode(bytes, lowercase);
Ok(hex_string)
/// Generic hex encoding for int64 type
fn hex_encode_int64<I>(iter: I, len: usize) -> Result<ColumnarValue, DataFusionError>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this doesnt need to be generic since its only used for int64 arrays

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both Int64Array and Dictionary<Int64> paths need hex encoding and yield different iterator types. Perhaps using a generic iterator here could help share the same encoding logic?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I didn't see that dictionary used it like that. However that made me realise this edge case:

#[test]
fn test_dict_values_null() {
    let keys = Int32Array::from(vec![Some(0), None, Some(1)]);
    let vals = Int64Array::from(vec![Some(32), None]);
    // [32, null, null]
    let dict = DictionaryArray::new(keys, Arc::new(vals));

    let columnar_value = ColumnarValue::Array(Arc::new(dict));
    let result = super::spark_hex(&[columnar_value]).unwrap();

    let result = match result {
        ColumnarValue::Array(array) => array,
        _ => panic!("Expected array"),
    };

    let result = as_string_array(&result);
    let expected = StringArray::from(vec![Some("20"), None, None]);

    assert_eq!(&expected, result);
}
  • When both key & value arrays have nulls

On main this succeeds because the 2nd element is null from the key (aka from dictionary) but 3rd element is null from the underlying values array.

On this PR it seems it will ignore any nulls in the values and treat it as non-null.

See doc on dictionaries: https://arrow.apache.org/docs/format/Columnar.html#dictionary-encoded-layout

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, I've refactored the dictionary handling to explicitly check the validity bitmap of the values array using is_valid.
And I've also incorporated your test case to ensure this edge case is covered. Thanks for the detailed explanation and the doc link!

for v in iter {
if let Some(num) = v {
buffer.clear();
hex_int64(num, &mut buffer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it a little interesting how we have this buffer across iterations here, but then inside hex_int64 we have another buffer, which is used to copy into this outer buffer. Can we unify them somehow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I have refactored the code to eliminate one redundant memory copy by unifying the buffer logic. Benchmarks show a significant performance boost of approximately 14-37%. Thanks for catching that.

group                       before                                after
-----                       -------                                -------
hex_int64/size=1024         1.37     10.6±1.49µs        ? ?/sec    1.00      7.8±0.32µs        ? ?/sec
hex_int64/size=4096         1.19     46.1±1.78µs        ? ?/sec    1.00     38.8±1.00µs        ? ?/sec
hex_int64/size=8192         1.14     97.9±3.75µs        ? ?/sec    1.00     85.8±3.83µs        ? ?/sec
hex_int64_dict/size=1024    1.34     11.0±0.54µs        ? ?/sec    1.00      8.2±0.29µs        ? ?/sec
hex_int64_dict/size=4096    1.24     51.1±6.34µs        ? ?/sec    1.00     41.2±2.05µs        ? ?/sec
hex_int64_dict/size=8192    1.18    104.2±3.37µs        ? ?/sec    1.00     88.4±2.19µs        ? ?/sec

Comment on lines 251 to 263
let int_values = as_int64_array(values)?;
hex_encode_int64(
keys.iter().map(|k| {
k.and_then(|idx| {
if int_values.is_valid(idx as usize) {
Some(int_values.value(idx as usize))
} else {
None
}
})
}),
dict.len(),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More compact way of iterating over dictionaries using a TypedDictionaryArray:

let arr =
    dict.downcast_dict::<arrow::array::Int64Array>().unwrap();
hex_encode_int64(arr.into_iter(), dict.len())

(Alternatively could look into preserving the dictionary type as an output; so we can apply the hex only on the values and return a dict with the same keys, instead of expanding this dict into a regular string array)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion! I’ve updated the implementation to use TypedDictionaryArray first. Preserving the dictionary output seems to require some refactoring and test updates, so I’ll try that next and follow up with another PR if there’s an improvement. Does that sound good?

@Jefffrey Jefffrey added this pull request to the merge queue Jan 14, 2026
Merged via the queue into apache:main with commit 1d19c52 Jan 14, 2026
31 checks passed
@Jefffrey
Copy link
Contributor

Thanks @lyne7-sc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants