Skip to content

💥 Use Temporal Failures for Nexus Error Serialization#2773

Open
Quinn-With-Two-Ns wants to merge 8 commits intotemporalio:masterfrom
Quinn-With-Two-Ns:NEXUS-38
Open

💥 Use Temporal Failures for Nexus Error Serialization#2773
Quinn-With-Two-Ns wants to merge 8 commits intotemporalio:masterfrom
Quinn-With-Two-Ns:NEXUS-38

Conversation

@Quinn-With-Two-Ns
Copy link
Contributor

@Quinn-With-Two-Ns Quinn-With-Two-Ns commented Feb 5, 2026

💥 Use Temporal Failures for Nexus Error Serialization. The Java SDK now responds to Nexus operation failures and task failures with plain Temporal Failures. This change also allows OperationException and HandlerException to have their own message and stacktrace independent of their cause. If the Server does not support the new format the SDK converts back to the old format before sending the response.

Requires:


Note

Medium Risk
Touches Nexus error/failure serialization and server reply paths, which can affect cross-version interoperability and how failures are surfaced/attributed. Backward-compat fallback and expanded tests reduce risk but edge cases could still change failure shapes and metrics tags.

Overview
Switches Nexus task failure reporting to the new API format by sending plain Temporal Failure protos for both operation failures (StartOperationResponse.failure) and handler failures (RespondNexusTaskFailedRequest.failure), enabling independent message/stacktrace on OperationException/HandlerException.

Adds compatibility logic to downgrade to the old HandlerError/UnsuccessfulOperationError format when the server doesn’t advertise Request.Capabilities.temporal_failure_responses (or when forced via temporal.nexus.forceOldFailureFormat). Updates NexusUtil and DefaultFailureConverter to round-trip Temporal failures through Nexus metadata/details and preserve stack traces.

Test server and SDK tests are updated/expanded to handle both formats (including forced-old-format suites) and validate metrics tagging for failed vs canceled operations; CI bumps the Temporal CLI version and build config moves nexusVersion to 0.5.0-SNAPSHOT (plus mavenLocal() for resolution).

Written by Cursor Bugbot for commit 9545fd2. This will update automatically on new commits. Configure here.

@Quinn-With-Two-Ns Quinn-With-Two-Ns requested a review from a team as a code owner February 5, 2026 22:21
@Quinn-With-Two-Ns Quinn-With-Two-Ns marked this pull request as draft February 5, 2026 22:22
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7b2a0c78ca

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

}
return Failure.newBuilder()
.setMessage(failure.getMessage())
.setMessage(temporalFailure.getMessage())
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicated failure serialization logic in NexusUtil

Low Severity

temporalFailureToNexusFailure and temporalFailureToNexusFailureInfo share nearly identical logic for serializing a Temporal failure to JSON. Both call PROTO_JSON_PRINTER.print() with the same arguments, handle InvalidProtocolBufferException identically, and set the same metadata. Only the return type differs (Failure vs FailureInfo). The JSON serialization logic could be extracted to a shared helper.

Fix in Cursor Fix in Web

@Quinn-With-Two-Ns Quinn-With-Two-Ns marked this pull request as ready for review February 19, 2026 00:27
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 349588549b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

return new HandlerException(info.getType(), cause, retryBehavior);
if (failure
.getMessage()
.startsWith(String.format("handler error (%s)", info.getType()))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand this is here to allow proper decoding of legacy errors, but for which languages exactly? If we mean to support legacy errors coming from Go or others, are we sure the casing returned by .getType() here would match the error type of other SDKs?, or should we make this condition a little bit more resilient?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any old failure serialized by Go or Java. This is the handler error type so that should match the Nexus spec and consistent across Go and Java

}

@SuppressWarnings("deprecation")
private void sendReply(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The structure of this function is very hard to reason about.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do this, are you still concerned about the structure?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That one is certainly the most patent culprit, but I'd also try to restructure this to avoid mixing happy paths and failure cases.

Like, instead of:

      if (taskResponse != null) {
        if (!supportTemporalFailure && taskResponse.getStartOperation().hasFailure()) {   
            // some failure case on old server
            reuturn
        }
        // happy path AND some failure case on newer servers
      }
      // other failure cases

I'd suggest restructuring the function along the line of:

if (taskResponse != null && !taskResponse.getStartOperation().hasFailure()) {
    // Operation succeeded

} else if (taskResponse != null) {
    // Operation failed - OperationError

} else if (response.getHandlerException()) {
    // Operation failed - HandlerError

} else {
    // throw ...
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that will really be clearer since we use the same gRPC method to respond to the server ion the first and second case here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have an attempt to refactor to make it clearer let me push and we can discuss

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.


allprojects {
repositories {
mavenLocal()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SNAPSHOT dependency and mavenLocal not suitable for merge

Medium Severity

mavenLocal() was added to the allprojects repositories block alongside nexusVersion = '0.5.0-SNAPSHOT'. Both are development-time configurations — mavenLocal() can cause non-reproducible builds by resolving artifacts from the local cache, and SNAPSHOT dependencies are inherently unstable. These need to be replaced with a released version before merging.

Additional Locations (1)

Fix in Cursor Fix in Web

Copy link

@VegetarianOrc VegetarianOrc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a couple of minor things

String details;
try {
details = JSON_PRINTER.print(failure.toBuilder().setMessage("").build());
details = PROTO_JSON_PRINTER.print(temporalFailure.toBuilder().setMessage("").build());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Go & Python, we're clearing the stack trace out of the details as well as the message. Seems like it'd make sense to do it here as well.

HandlerError.newBuilder()
.setErrorType(e.getErrorType().toString())
.setRetryBehavior(mapRetryBehavior(e.getRetryBehavior()));
// TODO: check if this works on old server

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we checked this? If so, let's remove this todo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants