Hello LTP team!
My team has noticed that cpuhotplug03.sh experiences intermittent failures. We often encounter the following error message:
TFAIL: No cpuhotplug_do_spin_loop processes found on CPU1
I looked into the test and found this:
|
sleep 1 |
|
|
|
# Verify at least one process has migrated to the new CPU |
It appears that the test only waits 1 second for tasks to move between cpus before checking the final result. My understanding of the modern Linux scheduler is limited, but I believe it tends to be conservative about moving processes across cpus, especially when they cross core or NUMA boundaries, due to switching costs. So, I am concerned that the test in its current form may be failing early, leading to false-negatives in our test results. (By false-negatives, I mean test failures in an environment (HW config, kernel, and test implementation) that behaves correctly and thus should have passed.)
I have some ideas on how to maximize the chance that the process moves to the newly-online cpu, but in order to better understand if they improve the outcome of this test, I need better knowledge about what the test is actually verifying. My current understanding is: The test brings one cpu offline, and then back online, and sees if the scheduler naturally moves a process to it.
Here are a few proposals I have that I believe might improve the false-negative rate:
-
Sleep longer
- This increases the chance that the process might move to the newly-online cpu, and thus might reduce the chance of checking before the process has moved
- But, it assumes a moved process will stay on the newly-online cpu, which may or may not be true.
-
Instead of a sleep 1, check for moved processes in a loop, with a specified timeout value.
- This allows us to sample at a higher frequency, which means if a process is moved to the newly-online cpu, and then for some reason moved off, we are more likely to catch it.
The next ones technically change what we are testing, so I need some input as to if they are acceptable for the purposes of this test or not.
-
Set a cpu affinity to the processes so that the scheduler must move them to the newly-online cpu.
- This no longer checks if the scheduler naturally moves the process to the new core, which may be unacceptable/out of scope of this test plan.
- But, it would essentially guarantee no false-negatives.
-
Offline all but one cpu, start the idle processes on it, and then online all other cpus.
- Technically this is different than the original test, which brings just one cpu off and back on. So this may not be appropriate.
- I don't know if this is a better situation in the scheduler's eyes than just onlining one additional core (the original test)
- It's definitely possible that processes won't move, so this may also require a timeout loop that samples each core.
-
Offline all but 2 cpus, then offline one of them, start idle processes on the other, and bring that 1 cpu back online.
- This is essentially proposal 4 but with an additional first step of offlining all but 2 cpus.
- Interestingly, this is actually sort of conformant to the original test plan, because we only bring one cpu offline and online, we've just changed the setup environment.
- Unless there is some complex hotplug state I don't know about that this test changes that could affect the result.
other considerations
- I also wish to explore other things we can tweak to further push the scheduler to move the process, like nice values. Is there a configuration of nice values that increase the likelihood of the scheduler to move across cores?
- Does it make sense to specifically target cpus that are within the same NUMA boundary? The same core (thread siblings)?
Can you please help me better understand the goals of the test, so that together we can narrow down a solution that improves false-negatives?
I am happy to whip up some example implementations of these as well if we'd like to see how they run.
Thank you so much! I look forward to discussing. 😄
Hello LTP team!
My team has noticed that cpuhotplug03.sh experiences intermittent failures. We often encounter the following error message:
I looked into the test and found this:
ltp/testcases/kernel/hotplug/cpu_hotplug/functional/cpuhotplug03.sh
Lines 123 to 125 in 443a59c
It appears that the test only waits 1 second for tasks to move between cpus before checking the final result. My understanding of the modern Linux scheduler is limited, but I believe it tends to be conservative about moving processes across cpus, especially when they cross core or NUMA boundaries, due to switching costs. So, I am concerned that the test in its current form may be failing early, leading to false-negatives in our test results. (By false-negatives, I mean test failures in an environment (HW config, kernel, and test implementation) that behaves correctly and thus should have passed.)
I have some ideas on how to maximize the chance that the process moves to the newly-online cpu, but in order to better understand if they improve the outcome of this test, I need better knowledge about what the test is actually verifying. My current understanding is: The test brings one cpu offline, and then back online, and sees if the scheduler naturally moves a process to it.
Here are a few proposals I have that I believe might improve the false-negative rate:
Sleep longer
Instead of a
sleep 1, check for moved processes in a loop, with a specified timeout value.The next ones technically change what we are testing, so I need some input as to if they are acceptable for the purposes of this test or not.
Set a cpu affinity to the processes so that the scheduler must move them to the newly-online cpu.
Offline all but one cpu, start the idle processes on it, and then online all other cpus.
Offline all but 2 cpus, then offline one of them, start idle processes on the other, and bring that 1 cpu back online.
other considerations
Can you please help me better understand the goals of the test, so that together we can narrow down a solution that improves false-negatives?
I am happy to whip up some example implementations of these as well if we'd like to see how they run.
Thank you so much! I look forward to discussing. 😄