Skip to content

8342977: [lworld] skynet fail intermittently#2283

Closed
marc-chevalier wants to merge 5 commits intoopenjdk:lworldfrom
marc-chevalier:JDK-8342977
Closed

8342977: [lworld] skynet fail intermittently#2283
marc-chevalier wants to merge 5 commits intoopenjdk:lworldfrom
marc-chevalier:JDK-8342977

Conversation

@marc-chevalier
Copy link
Copy Markdown
Member

@marc-chevalier marc-chevalier commented Mar 31, 2026

That is going to take a bit long to explain.

Patricio did quite some investigation, that is worth reading in a comment of JDK-8342977. Even if the end doesn't look quite correct to me (but it might be a point of view difference), it was very useful.

I'm going to summarize the important elements, for people who don't know so much about this tests, or the queues involved, then dive in the story.

The Test

The important part of the test is there:

static void skynet(Channel<Long> result, int num, int size, int div) {
if (size == 1) {
result.send((long)num);
} else {
var chan = new Channel<Long>();
for (int i = 0; i < div; i++) {
int subNum = num + i * (size / div);
Thread.startVirtualThread(() -> skynet(chan, subNum, size / div, div));
}
long sum = 0;
for (int i = 0; i < div; i++) {
sum += chan.receive();
}
result.send(sum);
}
}

the method skynet calls itself recursively through virtual threads, and eventually sends some Long on a custom Channel<Long> (backed by a SynchronousQueue<Long>), received these numbers, sum them, until send the sum. Eventually, only the sum of everything is left. The expected result is thus constant. The intermittent failure is a wrong result, by a little bit.

Let's give a bit of context about these queue for those who would be unfamiliar with them (like me before this issue). Do not take this explanation as 100% exact! It is probably just good enough to understand this bug, but it doesn't pretend to be perfectly correct or exhaustive. These queues can keep two kind of items in them: either elements, or requests. When some thread is trying to received an object, and the queue contains some objects, the objects is "matched" and given to the requesting thread. But if the queue is empty of elements, then the request is enqueued instead, and the thread is made to wait. When an object is sent on the queue, if a request is already there, the element is given to the requesting thread, otherwise, the element is added to the queue. Clearly, a queue should contain only requests or objects (ignoring matched items that are still in the queue and removed lazily for performance reasons).

The queue we are having is an extreme case of that: it has no capacity, it will just match threads offering a value with threads wanting to receive one.

These queues have many good properties of lock-freeness, performance and so on, but in our context, the interesting thing is that the main operations (putting an object, or taking one) can be implemented with a single function, here called xfer. The implementation of put(e) will call (something like) xfer(e), and take() will call xfer(null). xfer(e) should return null except if the putting thread was interrupted (then, it returns e I think, but I don't think it is specified). Similarly, xfer(null) should not return null except if the taking thread was interrupted.

See

public void put(E e) throws InterruptedException {
Objects.requireNonNull(e);
if (!Thread.interrupted()) {
if (xfer(e, Long.MAX_VALUE) == null)
return;
Thread.interrupted(); // failure possible only due to interrupt
}
throw new InterruptedException();
}

and

public E take() throws InterruptedException {
Object e;
if (!Thread.interrupted()) {
if ((e = xfer(null, Long.MAX_VALUE)) != null)
return (E) e;
Thread.interrupted();
}
throw new InterruptedException();
}

In case of interruption, the implementation of the custom queue in the test is just retrying until it works (same for receive but less relevant here, and let's save space):

void send(T e) {
boolean interrupted = false;
while (true) {
try {
q.put(e);
break;
} catch (InterruptedException x) {
interrupted = true;
}
}
if (interrupted)
Thread.currentThread().interrupt();
}

Useful Scriptures: the Matter of Patricio

What Patricio discovered is that the call q.put(e); throws InterruptedException, then the loop loops, and we try to put again. But nothing in this test is interrupting any thread! Something is wrong here. What happens is that the put adds an element to the queue, but fails to show its success, so we enqueue it again, and then the same element is given twice to receiving threads instead of another element. At the end of the program, we enqueued more elements than we received, so some virtual threads must still be blocked on their own send, waiting for a receiver, but the main thread already gathered its result and terminating. Always according to Patricio, put throws InterruptedException because instead of returning null, xfer(e) returns e. This happens only when xfer is C2-compiled, and SynchronousQueue$Transferer::xferLifo and LinkedTransferQueue$DualNode::await are inlined

final Object xferLifo(Object e, long ns) {

final Object await(Object e, long ns, Object blocker, boolean spin) {

It also seems that in xferLifo after the call to await, e is returned, and e is replaced by null (which should not happen). Somehow, await behave like it was taking instead of putting, which leads to the wrong idea that the thread was interrupted.

The Argument of the Tragedy

I'm not sure we need SynchronousQueue::xfer to be compiled (but it's really small and I didn't try to spare it), but we need SynchronousQueue$Transferer::xferLifo to be, and LinkedTransferQueue$DualNode::<init> and LinkedTransferQueue$DualNode::await to be inlined. At the end something like

794   42       4       java.util.concurrent.SynchronousQueue::xfer (31 bytes)
                              @ 27   java.util.concurrent.SynchronousQueue$Transferer::xferLifo (258 bytes)   inline (hot)
                                @ 65   java.util.concurrent.LinkedTransferQueue::cmpExHead (10 bytes)   failed to inline: disallowed by CompileCommand
                                @ 101   java.util.concurrent.LinkedTransferQueue$DualNode::cmpExItem (10 bytes)   failed to inline: disallowed by CompileCommand
                                @ 133   java.util.concurrent.LinkedTransferQueue::cmpExHead (10 bytes)   failed to inline: disallowed by CompileCommand
                                @ 139   java.util.concurrent.locks.LockSupport::unpark (31 bytes)   failed to inline: disallowed by CompileCommand
                                @ 172   java.util.concurrent.LinkedTransferQueue$DualNode::<init> (18 bytes)   inline (hot)
                                  @ 1   java.lang.Object::<init> (1 bytes)   inline (hot)
                                  @ 9   java.lang.invoke.VarHandleGuards::guard_LL_V (123 bytes)   failed to inline: disallowed by CompileCommand
                                @ 191   java.util.concurrent.LinkedTransferQueue::cmpExHead (10 bytes)   failed to inline: disallowed by CompileCommand
               !                @ 223   java.util.concurrent.LinkedTransferQueue$DualNode::await (252 bytes)   force inline by CompileCommand
                                  @ 36   java.lang.Thread::currentThread (0 bytes)   (intrinsic)
                                  @ 46   java.util.concurrent.ForkJoinWorkerThread::hasKnownQueuedWork (110 bytes)   failed to inline: disallowed by CompileCommand
                                  @ 104   java.lang.Thread::onSpinWait (1 bytes)   (intrinsic)
                                  @ 118   java.util.concurrent.LinkedTransferQueue$DualNode::checkForUniprocessor (37 bytes)   failed to inline: too big
                                  @ 123   java.util.concurrent.locks.LockSupport::setCurrentBlocker (14 bytes)   failed to inline: disallowed by CompileCommand
                                  @ 132   java.lang.invoke.VarHandle::fullFence (7 bytes)   failed to inline: disallowed by CompileCommand
                                  @ 140   java.lang.VirtualThread::isInterrupted (5 bytes)   accessor   callee changed to  java.util.concurrent.LinkedTransferQueue$DualNode::await (252 bytes)    \\-> TypeProfile (10192/10192 counts) = java/lang/VirtualThread
                                  @ 229   java.util.concurrent.locks.LockSupport::park (29 bytes)   failed to inline: disallowed by CompileCommand
                                  @ 241   java.util.concurrent.locks.LockSupport::setCurrentBlocker (14 bytes)   failed to inline: disallowed by CompileCommand
                                @ 236   java.util.concurrent.SynchronousQueue$Transferer::unspliceLifo (131 bytes)   failed to inline: too big
                                @ 249   java.util.concurrent.LinkedTransferQueue$DualNode::selfLinkItem (9 bytes)   failed to inline: disallowed by CompileCommand

is enough, with xfer being the only compiled method. Even better: inlining less seems to increase the likelihood of crash (from ~3% with a simple CompileOnly and inline for xferLifo and await, to ~12% with all dontinline)!

In xferLifo, we have

if (s == null) // try to push node and wait
s = new DualNode(e, haveData);
s.next = p;
if (p == (p = cmpExHead(p, s))) {
if ((m = s.await(e, ns, this, // spin if (nearly) empty
p == null || p.waiter == null)) == e)

The constructor of DualNode is not very interesting

DualNode(Object item, boolean isData) {
ITEM.set(this, item); // relaxed write before publication
this.isData = isData;
}

but there, e is profiled to be always null (in the failing cases). This is correct so far: e (or item in the ctor) is an unknown Object, whose speculative type is null.

Thanks to inlining, this is propagated into await

final Object await(Object e, long ns, Object blocker, boolean spin) {
Object m; // the match or e if none
boolean timed = (ns != Long.MAX_VALUE);
long deadline = (timed) ? System.nanoTime() + ns : 0L;
boolean upc = isUniprocessor; // don't spin but later recheck
Thread w = Thread.currentThread();
if (spin && ForkJoinWorkerThread.hasKnownQueuedWork())
spin = false; // don't spin
int spins = (spin & !upc) ? SPINS : 0; // negative when may park
while ((m = item) == e) {

Before the head of the loop, e can indeed be observed not to be null. Then comes the loop condition: item == e, this is the now famous (to me) acmp. In the Valhalla world, acmp is hard. Here, we end up doing something like

// pointer equality
if item == e then equal
// e is speculated to be null, but then, item must be null to be equal to e, and thus would have passed already
else if e == null then not_equal
else trap

The first branch gets us in the loop, without much hypothesis. The second branch makes us skip the loop, but knowing that e is null. That is correct. When what is not? We need to look at the C++ code implementing that, and the generated graph. It will be clearer with pictures. This is done in do_acmp

if (right_ptr == ProfileAlwaysNull) {
// Comparison with null. Assert the input is indeed null and we're done.
acmp_always_null_input(right, tright, btest, eq_region);
return;
}

and then in

void Parse::acmp_always_null_input(Node* input, const TypeOopPtr* tinput, BoolTest::mask btest, Node* eq_region) {
inc_sp(2);
Node* cast = null_check_common(input, T_OBJECT, true, nullptr,
!too_many_traps_or_recompiles(Deoptimization::Reason_speculate_null_check) &&
speculative_ptr_kind(tinput) == ProfileAlwaysNull);
dec_sp(2);
if (btest == BoolTest::ne) {
{
PreserveJVMState pjvms(this);
replace_in_map(input, cast);
int target_bci = iter().get_dest();
merge(target_bci);
}
record_for_igvn(eq_region);
set_control(_gvn.transform(eq_region));
} else {
replace_in_map(input, cast);
}
}

Let's look at how the graph looks like:

IR

This partial graph is taken shortly after parsing acmp.

The node 761 If is the pointer comparison, if equal (IfFalse projection), we go in the loop, but if not, we go to 820 If which is checking whether 413 CheckCastPP (that represents e) is indeed null, as speculated. If false, we deopt in 823 CallStaticJava. If e is indeed null 830 Region is the "not equal" case (which leads nowhere in this screenshot, but parsing isn't done yet).

Before the call to null_check_common in acmp_always_null_input, the control is 762 IfTrue (the IfTrue under 761 If, condensed on the screenshot), and the current map is 747 SafePoint. At this point, the input of the map for e is indeed 413 CheckCastPP.

After null_check_common, map is the same, but the control is now 821 IfTrue (under 820 If condensed on the screenshot). That is we are in the case where e == null (and so item != e). At this point, the input of the map for e has been replaced by 48 ConP #ptr:null. This is still correct, but I suspect, unexpected. In our case, we have a if_acmpne (bci=113 in await), so we enter the first branch. Here, we save the state, get a new cloned map 829 SafePoint (with control 812 IfTrue) replace the input (e) by the casted result (null) in this map, we merge this state with the destination of the acmp and we restore the JVM state. The map is again 747 SafePoint, whose control is still 812 IfTrue. But then, we change the control to eq_region, that is a region under 763 IfFalse (under 761 If), that is in the case where item == e as pointers. In this branch, it is incorrect to assume that e == null. It is possible if item == null, but it could also be false if both objects are equal and non-null. Yet, the current map is now in this branch, with the assumption e == null coming from the call to null_check_common, from here:

if (assert_null) {
// Cast obj to null on this path.
replace_in_map(value, zerocon(type));
return zerocon(type);
}

So, there are 2 ways to solve that:

  1. we can make null_check_common not necessarily do the replace_in_map. This is doable, for in the way it's done with null_check_oop
    Node* GraphKit::null_check_oop(Node* value, Node* *null_control,
    bool never_see_null,
    bool safe_for_replace,
    bool speculative) {
    // Initial null check taken path
    (*null_control) = top();
    Node* cast = null_check_common(value, T_OBJECT, false, null_control, speculative);
    // Generate uncommon_trap:
    if (never_see_null && (*null_control) != top()) {
    // If we see an unexpected null at a check-cast we record it and force a
    // recompile; the offending check-cast will be compiled to handle nulls.
    // If we see more than one offending BCI, then all checkcasts in the
    // method will be compiled to handle nulls.
    PreserveJVMState pjvms(this);
    set_control(*null_control);
    replace_in_map(value, null());
    Deoptimization::DeoptReason reason = Deoptimization::reason_null_check(speculative);
    uncommon_trap(reason,
    Deoptimization::Action_make_not_entrant);
    (*null_control) = top(); // null path is dead
    }
    if ((*null_control) == top() && safe_for_replace) {
    replace_in_map(value, cast);
    }
    // Cast away null-ness on the result
    return cast;
    }

    we can give null_check_common a new parameter bool safe_for_replace that would inhibit the replace_in_map. That would only kick in when assert_null == true. That is the case only in the acmp usage we are looking at and in
    Node* null_assert(Node* value, BasicType type = T_OBJECT) {
    return null_check_common(value, type, true, nullptr, _gvn.type(value)->speculative_always_null());
    }

    where it would be ok to just add a true for this new parameter.
  2. we simply let null_check_common do the replace_in_map, but with a copy of the map we won't touch.

I went for the second solution, as it doesn't introduce new execution paths, but simply make use of functions to do what they should. In the fixed version null_check_common is called in the saved state, and 747 SafePoint is not changed by this call. After restoring the previous state, the old safepoint is unchanged, and it is safe to move it. It also means that we don't need any replace_in_map in acmp_always_null_input at all, we are delegating it.

Contemplating Hotspot's Mysteries

On null_check_common

As I said, right now, null_check_common is used with null_assert == true only in the acmp logic and in null_assert. In the usages of null_assert we are not trying to make some fancy branching: we have an object, we want to make sure the value is null, trap otherwise, but we continue straight. In this case, editing the current map is exactly what we expect. The sin of acmp is to do this side jump after calling null_check_common.

In the Test

Why does this test crash so rarely, and crash rather more with fewer compilation and inlining? Of course, it's speculative types, it's profiling, it's timing all the way down. But I explained earlier that xfer is called with null or real object, and that we have as many send as receive.

My understanding is that the function skynet will spown div virtual threads, then do div receive. Probably because of how the scheduler words, the thread hit and blocks on the first receive before anything is in the queue. Then, only the sub-thread can execute. They will do the same, spawn threads and block on receive. Eventually, size == 1 and a thread sends something, but it took 1_000_000 / 10 = 100_000 levels of recursions. So, we probably have thousands of threads that tried to receive something (calling xfer(null)) before anything was put in the queue. That is more than enough to profile e == null. If we compile fewer methods, it's more likely to come to compile xfer earlier, before any send is run, and changing the profile to "e might be null, but not always". I think that is also why this problem is much harder to reproduce when logging too much or in slowdebug: anything that leave time to the threads to run at least one send reduce the likelihood of hitting the problem.

Another experiment that support the explanation is that if one inhibits the path using the "always null" profile in do_acmp, I also can't reproduce the failure anymore.

Isn't it crazy we never found this problem before?! The problem is that in a == b, we wrongly assume that a is null if it is profiled to be null in the case where a == b but a != null (otherwise, either we deopt, or the assumption is correct). That seems oddly specific. There probably aren't that many cases where an operand is null during the whole warmup, then suddenly, it isn't, but it is equal to the other operand. Also, we need to have a case where this a becomes non-null without triggering any trap before reaching the acmp. I can see why in the wild, it doesn't happen a lot. Also, assuming something isn't null when it's null can easily lead to segfault and NPE, but assuming something is null when it's not is less likely to make something crash.

To Make the Raven Happy: Nevermore

While there are many steps between the cause and the test failure, it is rather direct to write a test for this bug, when one knows what to look for.

I tried to see whether the symmetric situation is also buggy: we profile b to be an identiy object (instead of null), but the pointer comparison a == b fails, so we try to see if the profiling was correct. If b is indeed the said object type (instead of null), the equality doesn't hold. Otherwise, we need to check better, and in this "otherwise" branch, we shouldn't assume that b is an identity object (instead of null). So, this situation is very similar, but not the same code at all is involved. It doesn't look wrong when just looking at it. I've still added a test that would trigger is we assume the profiling in the wrong branch.

Thanks,
Marc


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed (1 review required, with at least 1 Committer)

Issue

  • JDK-8342977: [lworld] skynet fail intermittently (Bug - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/valhalla.git pull/2283/head:pull/2283
$ git checkout pull/2283

Update a local copy of the PR:
$ git checkout pull/2283
$ git pull https://git.openjdk.org/valhalla.git pull/2283/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 2283

View PR using the GUI difftool:
$ git pr show -t 2283

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/valhalla/pull/2283.diff

Using Webrev

Link to Webrev Comment

@bridgekeeper
Copy link
Copy Markdown

bridgekeeper bot commented Mar 31, 2026

👋 Welcome back mchevalier! A progress list of the required criteria for merging this PR into lworld will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link
Copy Markdown

openjdk bot commented Mar 31, 2026

@marc-chevalier This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8342977: [lworld] skynet fail intermittently

Reviewed-by: thartmann

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 2 new commits pushed to the lworld branch:

  • f0c25c3: 8381004: [lworld] FieldLayoutInfo use an aggregate for nonoop_acmp_map entries
  • 8035506: 8293280: [lworld] IdentityHashmap spec when key is a Value class

Please see this link for an up-to-date comparison between the source branch of this pull request and the lworld branch.
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the lworld branch, type /integrate in a new comment.

@marc-chevalier marc-chevalier marked this pull request as ready for review March 31, 2026 16:02
@openjdk openjdk bot added the rfr Pull request is ready for review label Mar 31, 2026
@mlbridge
Copy link
Copy Markdown

mlbridge bot commented Mar 31, 2026

Webrevs

@dcubed-ojdk
Copy link
Copy Markdown
Member

Wow! Quite the write-up on this issue! I read the original e-mail for this PR
first, but I see that the github PR is a MUCH better way to digest this story.

Copy link
Copy Markdown
Member

@TobiHartmann TobiHartmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impressive investigative work and a nice short story Marc! The fix looks good to me and it's great that we have tests for these cases now.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Apr 1, 2026
@marc-chevalier
Copy link
Copy Markdown
Member Author

/integrate

Thanks for review! I'm sure the PR description was longer to read than to write.

@openjdk
Copy link
Copy Markdown

openjdk bot commented Apr 1, 2026

Going to push as commit 22c7841.
Since your change was applied there have been 2 commits pushed to the lworld branch:

  • f0c25c3: 8381004: [lworld] FieldLayoutInfo use an aggregate for nonoop_acmp_map entries
  • 8035506: 8293280: [lworld] IdentityHashmap spec when key is a Value class

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Apr 1, 2026
@openjdk openjdk bot closed this Apr 1, 2026
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Apr 1, 2026
@openjdk
Copy link
Copy Markdown

openjdk bot commented Apr 1, 2026

@marc-chevalier Pushed as commit 22c7841.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

integrated Pull request has been integrated

Development

Successfully merging this pull request may close these issues.

3 participants