cpuhotplug03: Ideas to address intermittent failures

Hello LTP team!

My team has noticed that cpuhotplug03.sh experiences intermittent failures. We often encounter the following error message:
```
TFAIL: No cpuhotplug_do_spin_loop processes found on CPU1
```

I looked into the test and found this:

https://github.com/linux-test-project/ltp/blob/443a59cd543983d38652d4ae7fe5b3c673a0fb96/testcases/kernel/hotplug/cpu_hotplug/functional/cpuhotplug03.sh#L123-L125

It appears that the test only waits 1 second for tasks to move between cpus before checking the final result. My understanding of the modern Linux scheduler is limited, but I believe it tends to be conservative about moving processes across cpus, especially when they cross core or NUMA boundaries, due to switching costs. So, I am concerned that the test in its current form may be failing early, leading to false-negatives in our test results. (By false-negatives, I mean test failures in an environment (HW config, kernel, and test implementation) that behaves correctly and thus should have passed.)

I have some ideas on how to maximize the chance that the process moves to the newly-online cpu, but in order to better understand if they improve the outcome of this test, I need better knowledge about what the test is actually verifying. My current understanding is: The test brings one cpu offline, and then back online, and sees if the scheduler naturally moves a process to it.

### Here are a few proposals I have that I believe might improve the false-negative rate:

1. Sleep longer
    - This increases the chance that the process might move to the newly-online cpu, and thus might reduce the chance of checking before the process has moved
    - But, it assumes a moved process will stay on the newly-online cpu, which may or may not be true.

2. Instead of a `sleep 1`, check for moved processes in a loop, with a specified timeout value.
    - This allows us to sample at a higher frequency, which means if a process is moved to the newly-online cpu, and then for some reason moved off, we are more likely to catch it.

The next ones technically change what we are testing, so I need some input as to if they are acceptable for the purposes of this test or not.

3. Set a cpu affinity to the processes so that the scheduler _must_ move them to the newly-online cpu.
    - This no longer checks if the scheduler naturally moves the process to the new core, which may be unacceptable/out of scope of this test plan.
    - But, it would essentially guarantee no false-negatives.

4. Offline all but one cpu, start the idle processes on it, and then online all other cpus. 
    - Technically this is different than the original test, which brings just one cpu off and back on. So this may not be appropriate.
    - I don't know if this is a better situation in the scheduler's eyes than just onlining one additional core (the original test)
    - It's definitely possible that processes won't move, so this may also require a timeout loop that samples each core.

5. Offline all but 2 cpus, then offline one of them, start idle processes on the other, and bring that 1 cpu back online.
    - This is essentially proposal 4 but with an additional first step of offlining all but 2 cpus.
    - Interestingly, this is actually sort of conformant to the original test plan, because we only bring one cpu offline and online, we've just changed the setup environment.
        - Unless there is some complex hotplug state I don't know about that this test changes that could affect the result.

### other considerations
- I also wish to explore other things we can tweak to further push the scheduler to move the process, like nice values. Is there a configuration of nice values that increase the likelihood of the scheduler to move across cores?
- Does it make sense to specifically target cpus that are within the same NUMA boundary? The same core (thread siblings)?

---

Can you please help me better understand the goals of the test, so that together we can narrow down a solution that improves false-negatives?

I am happy to whip up some example implementations of these as well if we'd like to see how they run.

Thank you so much! I look forward to discussing. 😄 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cpuhotplug03: Ideas to address intermittent failures #1281

Here are a few proposals I have that I believe might improve the false-negative rate:

other considerations

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	sleep 1

	# Verify at least one process has migrated to the new CPU

cpuhotplug03: Ideas to address intermittent failures #1281

Description

Here are a few proposals I have that I believe might improve the false-negative rate:

other considerations

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions