8342977: [lworld] skynet fail intermittently#2283
8342977: [lworld] skynet fail intermittently#2283marc-chevalier wants to merge 5 commits intoopenjdk:lworldfrom
Conversation
|
👋 Welcome back mchevalier! A progress list of the required criteria for merging this PR into |
|
@marc-chevalier This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be: You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 2 new commits pushed to the
Please see this link for an up-to-date comparison between the source branch of this pull request and the ➡️ To integrate this PR with the above commit message to the |
|
Wow! Quite the write-up on this issue! I read the original e-mail for this PR |
TobiHartmann
left a comment
There was a problem hiding this comment.
Impressive investigative work and a nice short story Marc! The fix looks good to me and it's great that we have tests for these cases now.
|
/integrate Thanks for review! I'm sure the PR description was longer to read than to write. |
|
Going to push as commit 22c7841.
Your commit was automatically rebased without conflicts. |
|
@marc-chevalier Pushed as commit 22c7841. 💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored. |
That is going to take a bit long to explain.
Patricio did quite some investigation, that is worth reading in a comment of JDK-8342977. Even if the end doesn't look quite correct to me (but it might be a point of view difference), it was very useful.
I'm going to summarize the important elements, for people who don't know so much about this tests, or the queues involved, then dive in the story.
The Test
The important part of the test is there:
valhalla/test/jdk/java/lang/Thread/virtual/stress/Skynet.java
Lines 65 to 80 in 559e0e4
the method
skynetcalls itself recursively through virtual threads, and eventually sends someLongon a customChannel<Long>(backed by aSynchronousQueue<Long>), received these numbers, sum them, until send the sum. Eventually, only the sum of everything is left. The expected result is thus constant. The intermittent failure is a wrong result, by a little bit.Let's give a bit of context about these queue for those who would be unfamiliar with them (like me before this issue). Do not take this explanation as 100% exact! It is probably just good enough to understand this bug, but it doesn't pretend to be perfectly correct or exhaustive. These queues can keep two kind of items in them: either elements, or requests. When some thread is trying to received an object, and the queue contains some objects, the objects is "matched" and given to the requesting thread. But if the queue is empty of elements, then the request is enqueued instead, and the thread is made to wait. When an object is sent on the queue, if a request is already there, the element is given to the requesting thread, otherwise, the element is added to the queue. Clearly, a queue should contain only requests or objects (ignoring matched items that are still in the queue and removed lazily for performance reasons).
The queue we are having is an extreme case of that: it has no capacity, it will just match threads offering a value with threads wanting to receive one.
These queues have many good properties of lock-freeness, performance and so on, but in our context, the interesting thing is that the main operations (putting an object, or taking one) can be implemented with a single function, here called
xfer. The implementation ofput(e)will call (something like)xfer(e), andtake()will callxfer(null).xfer(e)should returnnullexcept if the putting thread was interrupted (then, it returnseI think, but I don't think it is specified). Similarly,xfer(null)should not returnnullexcept if the taking thread was interrupted.See
valhalla/src/java.base/share/classes/java/util/concurrent/SynchronousQueue.java
Lines 263 to 271 in 559e0e4
and
valhalla/src/java.base/share/classes/java/util/concurrent/SynchronousQueue.java
Lines 315 to 323 in 559e0e4
In case of interruption, the implementation of the custom queue in the test is just retrying until it works (same for
receivebut less relevant here, and let's save space):valhalla/test/jdk/java/lang/Thread/virtual/stress/Skynet.java
Lines 89 to 101 in 559e0e4
Useful Scriptures: the Matter of Patricio
What Patricio discovered is that the call
q.put(e);throwsInterruptedException, then the loop loops, and we try toputagain. But nothing in this test is interrupting any thread! Something is wrong here. What happens is that theputadds an element to the queue, but fails to show its success, so we enqueue it again, and then the same element is given twice to receiving threads instead of another element. At the end of the program, we enqueued more elements than we received, so some virtual threads must still be blocked on their ownsend, waiting for areceiver, but the main thread already gathered its result and terminating. Always according to Patricio,putthrowsInterruptedExceptionbecause instead of returningnull,xfer(e)returnse. This happens only whenxferis C2-compiled, andSynchronousQueue$Transferer::xferLifoandLinkedTransferQueue$DualNode::awaitare inlinedvalhalla/src/java.base/share/classes/java/util/concurrent/SynchronousQueue.java
Line 167 in 559e0e4
valhalla/src/java.base/share/classes/java/util/concurrent/LinkedTransferQueue.java
Line 423 in 559e0e4
It also seems that in
xferLifoafter the call toawait,eis returned, andeis replaced bynull(which should not happen). Somehow,awaitbehave like it was taking instead of putting, which leads to the wrong idea that the thread was interrupted.The Argument of the Tragedy
I'm not sure we need
SynchronousQueue::xferto be compiled (but it's really small and I didn't try to spare it), but we needSynchronousQueue$Transferer::xferLifoto be, andLinkedTransferQueue$DualNode::<init>andLinkedTransferQueue$DualNode::awaitto be inlined. At the end something likeis enough, with
xferbeing the only compiled method. Even better: inlining less seems to increase the likelihood of crash (from ~3% with a simpleCompileOnlyandinlineforxferLifoandawait, to ~12% with alldontinline)!In
xferLifo, we havevalhalla/src/java.base/share/classes/java/util/concurrent/SynchronousQueue.java
Lines 190 to 195 in 559e0e4
The constructor of
DualNodeis not very interestingvalhalla/src/java.base/share/classes/java/util/concurrent/LinkedTransferQueue.java
Lines 364 to 367 in 559e0e4
but there,
eis profiled to be alwaysnull(in the failing cases). This is correct so far:e(oritemin the ctor) is an unknown Object, whose speculative type isnull.Thanks to inlining, this is propagated into
awaitvalhalla/src/java.base/share/classes/java/util/concurrent/LinkedTransferQueue.java
Lines 423 to 432 in 559e0e4
Before the head of the loop,
ecan indeed be observed not to be null. Then comes the loop condition:item == e, this is the now famous (to me)acmp. In the Valhalla world,acmpis hard. Here, we end up doing something likeThe first branch gets us in the loop, without much hypothesis. The second branch makes us skip the loop, but knowing that
eis null. That is correct. When what is not? We need to look at the C++ code implementing that, and the generated graph. It will be clearer with pictures. This is done indo_acmpvalhalla/src/hotspot/share/opto/parse2.cpp
Lines 2213 to 2217 in 559e0e4
and then in
valhalla/src/hotspot/share/opto/parse2.cpp
Lines 1992 to 2010 in 559e0e4
Let's look at how the graph looks like:
This partial graph is taken shortly after parsing
acmp.The node
761 Ifis the pointer comparison, if equal (IfFalseprojection), we go in the loop, but if not, we go to820 Ifwhich is checking whether413 CheckCastPP(that representse) is indeed null, as speculated. If false, we deopt in823 CallStaticJava. Ifeis indeed null830 Regionis the "not equal" case (which leads nowhere in this screenshot, but parsing isn't done yet).Before the call to
null_check_commoninacmp_always_null_input, the control is762 IfTrue(theIfTrueunder761 If, condensed on the screenshot), and the currentmapis747 SafePoint. At this point, the input of themapforeis indeed413 CheckCastPP.After
null_check_common,mapis the same, but the control is now821 IfTrue(under820 Ifcondensed on the screenshot). That is we are in the case wheree == null(and soitem != e). At this point, the input of themapforehas been replaced by48 ConP #ptr:null. This is still correct, but I suspect, unexpected. In our case, we have aif_acmpne(bci=113 inawait), so we enter the first branch. Here, we save the state, get a new cloned map829 SafePoint(with control812 IfTrue) replace the input (e) by the casted result (null) in this map, we merge this state with the destination of theacmpand we restore the JVM state. Themapis again747 SafePoint, whose control is still812 IfTrue. But then, we change the control toeq_region, that is a region under763 IfFalse(under761 If), that is in the case whereitem == eas pointers. In this branch, it is incorrect to assume thate == null. It is possible ifitem == null, but it could also be false if both objects are equal and non-null. Yet, the currentmapis now in this branch, with the assumptione == nullcoming from the call tonull_check_common, from here:valhalla/src/hotspot/share/opto/graphKit.cpp
Lines 1556 to 1560 in 4bdc262
So, there are 2 ways to solve that:
null_check_commonnot necessarily do thereplace_in_map. This is doable, for in the way it's done withnull_check_oopvalhalla/src/hotspot/share/opto/graphKit.cpp
Lines 2795 to 2823 in 4bdc262
we can give
null_check_commona new parameterbool safe_for_replacethat would inhibit thereplace_in_map. That would only kick in whenassert_null == true. That is the case only in the acmp usage we are looking at and invalhalla/src/hotspot/share/opto/graphKit.hpp
Lines 409 to 411 in 4bdc262
where it would be ok to just add a
truefor this new parameter.null_check_commondo thereplace_in_map, but with a copy of the map we won't touch.I went for the second solution, as it doesn't introduce new execution paths, but simply make use of functions to do what they should. In the fixed version
null_check_commonis called in the saved state, and747 SafePointis not changed by this call. After restoring the previous state, the old safepoint is unchanged, and it is safe to move it. It also means that we don't need anyreplace_in_mapinacmp_always_null_inputat all, we are delegating it.Contemplating Hotspot's Mysteries
On
null_check_commonAs I said, right now,
null_check_commonis used withnull_assert == trueonly in theacmplogic and innull_assert. In the usages ofnull_assertwe are not trying to make some fancy branching: we have an object, we want to make sure the value isnull, trap otherwise, but we continue straight. In this case, editing the currentmapis exactly what we expect. The sin ofacmpis to do this side jump after callingnull_check_common.In the Test
Why does this test crash so rarely, and crash rather more with fewer compilation and inlining? Of course, it's speculative types, it's profiling, it's timing all the way down. But I explained earlier that
xferis called withnullor real object, and that we have as manysendasreceive.My understanding is that the function
skynetwill spowndivvirtual threads, then dodivreceive. Probably because of how the scheduler words, the thread hit and blocks on the firstreceivebefore anything is in the queue. Then, only the sub-thread can execute. They will do the same, spawn threads and block onreceive. Eventually,size == 1and a thread sends something, but it took1_000_000 / 10 = 100_000levels of recursions. So, we probably have thousands of threads that tried toreceivesomething (callingxfer(null)) before anything was put in the queue. That is more than enough to profilee == null. If we compile fewer methods, it's more likely to come to compilexferearlier, before anysendis run, and changing the profile to "emight be null, but not always". I think that is also why this problem is much harder to reproduce when logging too much or in slowdebug: anything that leave time to the threads to run at least onesendreduce the likelihood of hitting the problem.Another experiment that support the explanation is that if one inhibits the path using the "always null" profile in
do_acmp, I also can't reproduce the failure anymore.Isn't it crazy we never found this problem before?! The problem is that in
a == b, we wrongly assume thataisnullif it is profiled to benullin the case wherea == bbuta != null(otherwise, either we deopt, or the assumption is correct). That seems oddly specific. There probably aren't that many cases where an operand is null during the whole warmup, then suddenly, it isn't, but it is equal to the other operand. Also, we need to have a case where thisabecomes non-null without triggering any trap before reaching theacmp. I can see why in the wild, it doesn't happen a lot. Also, assuming something isn't null when it's null can easily lead to segfault and NPE, but assuming something is null when it's not is less likely to make something crash.To Make the Raven Happy: Nevermore
While there are many steps between the cause and the test failure, it is rather direct to write a test for this bug, when one knows what to look for.
I tried to see whether the symmetric situation is also buggy: we profile
bto be an identiy object (instead of null), but the pointer comparisona == bfails, so we try to see if the profiling was correct. Ifbis indeed the said object type (instead of null), the equality doesn't hold. Otherwise, we need to check better, and in this "otherwise" branch, we shouldn't assume thatbis an identity object (instead ofnull). So, this situation is very similar, but not the same code at all is involved. It doesn't look wrong when just looking at it. I've still added a test that would trigger is we assume the profiling in the wrong branch.Thanks,
Marc
Progress
Issue
Reviewers
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/valhalla.git pull/2283/head:pull/2283$ git checkout pull/2283Update a local copy of the PR:
$ git checkout pull/2283$ git pull https://git.openjdk.org/valhalla.git pull/2283/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 2283View PR using the GUI difftool:
$ git pr show -t 2283Using diff file
Download this PR as a diff file:
https://git.openjdk.org/valhalla/pull/2283.diff
Using Webrev
Link to Webrev Comment