Skip to content

Update reverse UDF to emit utf8view when input is utf8view#20604

Open
Omega359 wants to merge 4 commits intoapache:mainfrom
Omega359:reverse_utf8view
Open

Update reverse UDF to emit utf8view when input is utf8view#20604
Omega359 wants to merge 4 commits intoapache:mainfrom
Omega359:reverse_utf8view

Conversation

@Omega359
Copy link
Contributor

Which issue does this PR close?

Part of #20585

Rationale for this change

Functions ideally should emit strings in the same format as the input and previously the reverse function was emitting using utf8 for input that was in utf8view.

What changes are included in this PR?

Code, tests.

Are these changes tested?

Yes.

Are there any user-facing changes?

@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Feb 27, 2026
@Omega359 Omega359 marked this pull request as ready for review February 27, 2026 16:13

fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> {
utf8_to_str_type(&arg_types[0], "reverse")
if arg_types[0] == Utf8View {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once all the functions that use the utf8_to_str_type function are updated to emit utf8view appropriately a followup pr to update that function to include utf8view -> utf8view would clean this up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively just

Ok(arg_types[0].clone())

@Omega359 Omega359 marked this pull request as draft February 27, 2026 19:47
@Omega359
Copy link
Contributor Author

Moved back to draft to switch StringArrayBuilderType to StringLikeArrayBuilder

@Omega359 Omega359 marked this pull request as ready for review February 27, 2026 22:08
@Omega359
Copy link
Contributor Author

Ready for review again.

@Rafferty97
Copy link
Contributor

Rafferty97 commented Mar 1, 2026

Hey @Omega359,

Maybe I'm missing something, but what is the actual advantage of returning a Utf8View from a UDF like reverse?

For functions like trim or substr it makes sense to return a Utf8View regardless of the input type, as it allows for reuse of the underlying string buffer, given that each output value is just a slice of existing string data. But for a function like reverse that creates brand new string values, surely returning Utf8View just adds extra overhead with no benefit?

I'm happy to be corrected if there's something I'm misunderstanding here.

@Omega359
Copy link
Contributor Author

Omega359 commented Mar 1, 2026

Maybe I'm missing something, but what is the actual advantage of returning a Utf8View from a UDF like reverse?

For functions like trim or substr it makes sense to return a Utf8View regardless of the input type, as it allows for reuse of the underlying string buffer, given that each output value is just a slice of existing string data. But for a function like reverse that creates brand new string values, surely returning Utf8View just adds extra overhead with no benefit?

In a word: consistency.

What you outline is true if you look at the function in isolation. For my use case where I use DataFusion in an ETL environment it's rarely just one function being applied to a column, rather it's a chain of them.

I've been attempting to enabled Utf8View in my app for the better part of the last month and I'm continuing to encounter issues where df calls are emitting utf8view in one case and utf8 in another. Works fine till you try and merge the schemas and then .. boom. Finding all of the instances where something is consuming utf8view and emitting utf8 and throwing a cast after it is tiresome and expensive both in my time and actual test runs.

@Rafferty97
Copy link
Contributor

@Omega359 Ah, that does make sense.

So, if I understand correctly, the primary pain point is that you can wind up in situations where you've got two dataframes with the same logical schema, but different physical representations (e.g. Utf8 v Uft8View), preventing them from being unioned together without some explicit casts?

I think this points to a more fundamental tension between physical and logical types I think is worth addressing in a higher-level discussion. I'll have a look around to see if the topic has come up before.


fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> {
utf8_to_str_type(&arg_types[0], "reverse")
if arg_types[0] == Utf8View {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively just

Ok(arg_types[0].clone())

@@ -107,20 +113,38 @@ impl ScalarUDFImpl for ReverseFunc {
/// Reverses the order of the characters in the string `reverse('abcde') = 'edcba'`.
/// The implementation uses UTF-8 code points as characters
fn reverse<T: OffsetSizeTrait>(args: &[ArrayRef]) -> Result<ArrayRef> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably should take this opportunity to refactor the T generic out of reverse here, since it reads a bit confusing to match on datatype in invoke_with_args to determine the T for reverse but then we again match inside reverse anyway.

@Omega359
Copy link
Contributor Author

Omega359 commented Mar 2, 2026

@Omega359 Ah, that does make sense.

So, if I understand correctly, the primary pain point is that you can wind up in situations where you've got two dataframes with the same logical schema, but different physical representations (e.g. Utf8 v Uft8View), preventing them from being unioned together without some explicit casts?

I think this points to a more fundamental tension between physical and logical types I think is worth addressing in a higher-level discussion. I'll have a look around to see if the topic has come up before.

I am not 100% certain where the issue lies but I don't think I've seen it at the logical layer at all, nor with union. I think it's that I'm merging into a single table files that happen to have been written out with slightly different schemas resulting in Field::try_merge failing with Fail to merge schema field '<field name>' because the from data_type = Utf8 does not equal Utf8View.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants