-
Notifications
You must be signed in to change notification settings - Fork 1.9k
perf: improve performance of spark hex function
#19738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| let hex_string = hex_encode(bytes, lowercase); | ||
| Ok(hex_string) | ||
| /// Generic hex encoding for int64 type | ||
| fn hex_encode_int64<I>(iter: I, len: usize) -> Result<ColumnarValue, DataFusionError> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this doesnt need to be generic since its only used for int64 arrays
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both Int64Array and Dictionary<Int64> paths need hex encoding and yield different iterator types. Perhaps using a generic iterator here could help share the same encoding logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I didn't see that dictionary used it like that. However that made me realise this edge case:
#[test]
fn test_dict_values_null() {
let keys = Int32Array::from(vec![Some(0), None, Some(1)]);
let vals = Int64Array::from(vec![Some(32), None]);
// [32, null, null]
let dict = DictionaryArray::new(keys, Arc::new(vals));
let columnar_value = ColumnarValue::Array(Arc::new(dict));
let result = super::spark_hex(&[columnar_value]).unwrap();
let result = match result {
ColumnarValue::Array(array) => array,
_ => panic!("Expected array"),
};
let result = as_string_array(&result);
let expected = StringArray::from(vec![Some("20"), None, None]);
assert_eq!(&expected, result);
}- When both key & value arrays have nulls
On main this succeeds because the 2nd element is null from the key (aka from dictionary) but 3rd element is null from the underlying values array.
On this PR it seems it will ignore any nulls in the values and treat it as non-null.
See doc on dictionaries: https://arrow.apache.org/docs/format/Columnar.html#dictionary-encoded-layout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood, I've refactored the dictionary handling to explicitly check the validity bitmap of the values array using is_valid.
And I've also incorporated your test case to ensure this edge case is covered. Thanks for the detailed explanation and the doc link!
| for v in iter { | ||
| if let Some(num) = v { | ||
| buffer.clear(); | ||
| hex_int64(num, &mut buffer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it a little interesting how we have this buffer across iterations here, but then inside hex_int64 we have another buffer, which is used to copy into this outer buffer. Can we unify them somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I have refactored the code to eliminate one redundant memory copy by unifying the buffer logic. Benchmarks show a significant performance boost of approximately 14-37%. Thanks for catching that.
group before after
----- ------- -------
hex_int64/size=1024 1.37 10.6±1.49µs ? ?/sec 1.00 7.8±0.32µs ? ?/sec
hex_int64/size=4096 1.19 46.1±1.78µs ? ?/sec 1.00 38.8±1.00µs ? ?/sec
hex_int64/size=8192 1.14 97.9±3.75µs ? ?/sec 1.00 85.8±3.83µs ? ?/sec
hex_int64_dict/size=1024 1.34 11.0±0.54µs ? ?/sec 1.00 8.2±0.29µs ? ?/sec
hex_int64_dict/size=4096 1.24 51.1±6.34µs ? ?/sec 1.00 41.2±2.05µs ? ?/sec
hex_int64_dict/size=8192 1.18 104.2±3.37µs ? ?/sec 1.00 88.4±2.19µs ? ?/sec
| let int_values = as_int64_array(values)?; | ||
| hex_encode_int64( | ||
| keys.iter().map(|k| { | ||
| k.and_then(|idx| { | ||
| if int_values.is_valid(idx as usize) { | ||
| Some(int_values.value(idx as usize)) | ||
| } else { | ||
| None | ||
| } | ||
| }) | ||
| }), | ||
| dict.len(), | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More compact way of iterating over dictionaries using a TypedDictionaryArray:
let arr =
dict.downcast_dict::<arrow::array::Int64Array>().unwrap();
hex_encode_int64(arr.into_iter(), dict.len())(Alternatively could look into preserving the dictionary type as an output; so we can apply the hex only on the values and return a dict with the same keys, instead of expanding this dict into a regular string array)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion! I’ve updated the implementation to use TypedDictionaryArray first. Preserving the dictionary output seems to require some refactoring and test updates, so I’ll try that next and follow up with another PR if there’s an improvement. Does that sound good?
|
Thanks @lyne7-sc |
Which issue does this PR close?
Rationale for this change
The current
hexfunction implementation usesformat!macro andStringArray::fromiterator pattern, which causes:Stringviaformat!What changes are included in this PR?
This PR optimizes the
hexencoding by:format!("{num:X}")with a fast lookup table approachStringBuilderBenchmark Results
Are these changes tested?
Yes. Existing units and sqllogictest tests pass. New benchmarks added.
Are there any user-facing changes?
No.