Update reverse UDF to emit utf8view when input is utf8view#20604
Update reverse UDF to emit utf8view when input is utf8view#20604Omega359 wants to merge 4 commits intoapache:mainfrom
Conversation
|
|
||
| fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> { | ||
| utf8_to_str_type(&arg_types[0], "reverse") | ||
| if arg_types[0] == Utf8View { |
There was a problem hiding this comment.
Once all the functions that use the utf8_to_str_type function are updated to emit utf8view appropriately a followup pr to update that function to include utf8view -> utf8view would clean this up.
There was a problem hiding this comment.
Alternatively just
Ok(arg_types[0].clone())|
Moved back to draft to switch StringArrayBuilderType to StringLikeArrayBuilder |
|
Ready for review again. |
|
Hey @Omega359, Maybe I'm missing something, but what is the actual advantage of returning a For functions like I'm happy to be corrected if there's something I'm misunderstanding here. |
In a word: consistency. What you outline is true if you look at the function in isolation. For my use case where I use DataFusion in an ETL environment it's rarely just one function being applied to a column, rather it's a chain of them. I've been attempting to enabled Utf8View in my app for the better part of the last month and I'm continuing to encounter issues where df calls are emitting utf8view in one case and utf8 in another. Works fine till you try and merge the schemas and then .. boom. Finding all of the instances where something is consuming utf8view and emitting utf8 and throwing a cast after it is tiresome and expensive both in my time and actual test runs. |
|
@Omega359 Ah, that does make sense. So, if I understand correctly, the primary pain point is that you can wind up in situations where you've got two dataframes with the same logical schema, but different physical representations (e.g. I think this points to a more fundamental tension between physical and logical types I think is worth addressing in a higher-level discussion. I'll have a look around to see if the topic has come up before. |
|
|
||
| fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> { | ||
| utf8_to_str_type(&arg_types[0], "reverse") | ||
| if arg_types[0] == Utf8View { |
There was a problem hiding this comment.
Alternatively just
Ok(arg_types[0].clone())| @@ -107,20 +113,38 @@ impl ScalarUDFImpl for ReverseFunc { | |||
| /// Reverses the order of the characters in the string `reverse('abcde') = 'edcba'`. | |||
| /// The implementation uses UTF-8 code points as characters | |||
| fn reverse<T: OffsetSizeTrait>(args: &[ArrayRef]) -> Result<ArrayRef> { | |||
There was a problem hiding this comment.
We probably should take this opportunity to refactor the T generic out of reverse here, since it reads a bit confusing to match on datatype in invoke_with_args to determine the T for reverse but then we again match inside reverse anyway.
I am not 100% certain where the issue lies but I don't think I've seen it at the logical layer at all, nor with union. I think it's that I'm merging into a single table files that happen to have been written out with slightly different schemas resulting in |
Which issue does this PR close?
Part of #20585
Rationale for this change
Functions ideally should emit strings in the same format as the input and previously the reverse function was emitting using utf8 for input that was in utf8view.
What changes are included in this PR?
Code, tests.
Are these changes tested?
Yes.
Are there any user-facing changes?