Skip to content

feat: Support for StringSplit#2772

Open
Shekharrajak wants to merge 13 commits intoapache:mainfrom
Shekharrajak:feature/add-string-split-support
Open

feat: Support for StringSplit#2772
Shekharrajak wants to merge 13 commits intoapache:mainfrom
Shekharrajak:feature/add-string-split-support

Conversation

@Shekharrajak
Copy link
Contributor

@Shekharrajak Shekharrajak commented Nov 13, 2025

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Shekharrajak for you contribution, please add a function to the fuzztesting kit, similar to #2755

@mbutrovich
Copy link
Contributor

mbutrovich commented Nov 13, 2025

In the past I think we've encountered differences in Java and Rust's regex engines wrt graphemes. Could we get some larger UTF-8 characters in the tests?

@andygrove
Copy link
Member

In the past I think we've encountered differences in Java and Rust's regex engines wrt graphemes. Could we get some larger UTF-8 characters in the tests?

We probably need to fall back to Spark unless this config is enabled:

  val COMET_REGEXP_ALLOW_INCOMPATIBLE: ConfigEntry[Boolean] =
    conf("spark.comet.regexp.allowIncompatible")
      .category(CATEGORY_EXEC)
      .doc("Comet is not currently fully compatible with Spark for all regular expressions. " +
        s"Set this config to true to allow them anyway. $COMPAT_GUIDE.")
      .booleanConf
      .createWithDefault(false)

@Shekharrajak
Copy link
Contributor Author

Thanks @Shekharrajak for you contribution, please add a function to the fuzztesting kit, similar to #2755

Thanks! Added in commit 8eddd29

@Shekharrajak
Copy link
Contributor Author

In the past I think we've encountered differences in Java and Rust's regex engines wrt graphemes. Could we get some larger UTF-8 characters in the tests?

Added tests 987b646

@Shekharrajak
Copy link
Contributor Author

In the past I think we've encountered differences in Java and Rust's regex engines wrt graphemes. Could we get some larger UTF-8 characters in the tests?

We probably need to fall back to Spark unless this config is enabled:

  val COMET_REGEXP_ALLOW_INCOMPATIBLE: ConfigEntry[Boolean] =
    conf("spark.comet.regexp.allowIncompatible")
      .category(CATEGORY_EXEC)
      .doc("Comet is not currently fully compatible with Spark for all regular expressions. " +
        s"Set this config to true to allow them anyway. $COMPAT_GUIDE.")
      .booleanConf
      .createWithDefault(false)

How can we check if it is not falling back to Spark's JVM execution? @andygrove

@wForget wForget changed the title Support for StringSplit feat: Support for StringSplit Nov 17, 2025
@Shekharrajak Shekharrajak force-pushed the feature/add-string-split-support branch from dbb34d5 to 1f8f2b2 Compare November 17, 2025 18:52
@codecov-commenter
Copy link

codecov-commenter commented Nov 17, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 60.03%. Comparing base (f09f8af) to head (e7b267b).
⚠️ Report is 918 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #2772      +/-   ##
============================================
+ Coverage     56.12%   60.03%   +3.90%     
- Complexity      976     1428     +452     
============================================
  Files           119      170      +51     
  Lines         11743    15809    +4066     
  Branches       2251     2608     +357     
============================================
+ Hits           6591     9491    +2900     
- Misses         4012     5000     +988     
- Partials       1140     1318     +178     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@kazuyukitanimura
Copy link
Contributor

Thanks @Shekharrajak
Looks like there are rust check failures
https://github.com/apache/datafusion-comet/actions/runs/19441578149/job/55638326879?pr=2772

Perhaps you can try cargo fix?

@Shekharrajak
Copy link
Contributor Author

Perhaps you can try cargo fix?

I ran but I am not sure why the checks keep failing

@Shekharrajak
Copy link
Contributor Author

Please trigger the workflow.

@Shekharrajak
Copy link
Contributor Author

All checks are looking fine. This is ready to merge.

@Shekharrajak Shekharrajak force-pushed the feature/add-string-split-support branch from e7b267b to c0a22b8 Compare February 5, 2026 18:54
@Shekharrajak
Copy link
Contributor Author

Updated the branch with latest main branch. Please trigger the workflow

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for StringSplit

7 participants