-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scx_bpfland: primary domain hurts burst multi-core performance and prevents some workloads from using all cores #1145
Comments
Hi, bpfland always tries to keep tasks running in the primary domain as much as possible. If all the CPUs in the primary domain are busy, tasks can overflow to the other CPUs. By default the primary domain is determined in function of the energy profile, so in systems with P-cores and E-cores, if the current energy profile is "performance oriented", bpfland will try to run tasks only on the P-cores (with some of them potentially overflowing on the E-cores). E-cores are not used, because, by default, it's prioritizing performance predictability over max throughput. With a "powersave profile" you should see the opposite: tasks running mostly on the E-cores (with few tasks overflowing to the P-cores). So, it's a bit odd that in your case For maximum throughput, if you don't care about predictability, you should always use Moreover, the difference between |
The logic is the following: in The "primary domain" feature is intended to provide stable performance in presence of hybrid cores, so tasks will likely run on the subset of cores of the same type, instead of mixing E-cores and P-cores for example, which could lead to inconsistent performance / power consumption. So, it is not intended to maximize throughput. If you need throughput you can use |
Thanks for your clear explanation. Can I summarize the cause of the "delay" this way:
I am unsure if my guess is true. My guess seems unable to explain why changing the time slice won't help much as it should have made the accumulation of probability grow faster IIUC.
Yes, I understand that and didn't mean to oppose the design. Is there a possibility that it could be optimized/refined so that the "task overflow" process can happen sooner after the primary domain is loaded?
Thanks for your suggestion. Though it does eliminate the "delayed overflow" issue, P-cores will not be prioritized even when scheduling single-thread workloads in this case - I understand this is intended. Besides that, using That being said, I am not criticizing |
Correct.
Tasks in the shared DSQ are consumed on the first CPU that is not idle. This means it's more likely to be a primary domain CPU. However, if a non-primary CPU is awake and about to go idle, it can consume a task from the shared DSQ.
Keep in mind that the min granularity of the time slice is limited by CONFIG_HZ (so 1ms if you have CONFIG_HZ=1000), but a context switch can also happen due to a voluntary preemption event, like cond_resched() in the kernel, etc. I'm not sure I understand your comment about the accumulation of probability, but reducing the time slice should give more opportunities to tasks to be processed by the primary CPUs.
Tasks are immediately overflowing when there's no idle CPU in the primary domain. But then the scheduler tries to bring them back to the primary domain when tasks wake up or are re-enqueued (by kicking idle CPUs in the primary domain).
Right, with Maybe we could try to pick idle non-primary CPUs as a last resort in ops.select_cpu() (task wakeup) and keep kicking only idle primary CPUs in ops.enqueue(). This should make tasks less conservative and improve overall core utilization while still prioritizing the primary - domain. That's how bpfland was working initially, but this wasn't that great on performance stability because tasks were more likely moving between slow / fast cores, making performance less predictable.Maybe we should add a new option to set this behavior? |
Make it easier for tasks to overflow beyond the primary domain in a more aggressive way, using all available idle CPUs as a last resort while still prioritizing idle CPUs within the primary domain. This should address issue #1145. Signed-off-by: Andrea Righi <[email protected]>
@Rongronggg9 can you test if the changes that I pushed to bpfland-next are making this any better? Thanks! |
Make it easier for tasks to overflow beyond the primary domain in a more aggressive way, using all available idle CPUs as a last resort while still prioritizing idle CPUs within the primary domain. This should address issue #1145. Signed-off-by: Andrea Righi <[email protected]>
Make it easier for tasks to overflow beyond the primary domain in a more aggressive way, using all available idle CPUs as a last resort while still prioritizing idle CPUs within the primary domain. This should address issue #1145. Signed-off-by: Andrea Righi <[email protected]>
Update: I've made some wrong assumptions here, so the original description has been collapsed to avoid further misunderstanding. Check #1145 (comment) for the lastest ones.
I noticed a weird phenomenon on my laptop when
scx_bpfland
enabled (with default options):When many CPU-consuming processes are running, all CPU cores are consumable, which is fine. However, CPU-consuming threads from the same process cannot consume all CPU cores.
Geekbench 6 is such a program that uses many threads instead of processes to measure the multi-core performance. Here's how the behavior does harm to the multi-core score:
I am using Xanmod kernel (6.12.7-x64v3-xanmod1) with "x86 Heterogeneous design identification" patchset applied (otherwise Zen 5 P-cores will not be turbo boosted and run on a lower frequency).
After some diving, I guess the issue is in the "primary domain" feature. By default, it is automatically calculated according to the current energy profile. If the current energy profile is
performance
orbalance_performance
(my case), it will be all P-cores. On my laptop,scx_bpfland
selects0x03c0f
(0-3,10-13, all P-cores) out-of-box, resulting in such programs only consuming all P-cores and several E-cores.scx_bpfland
A process with CPU-consuming threads only consumes all P-cores and several E-cores (poor throughput).
scx_bpfland --primary-domain performance
Identical to default (poor throughput).
scx_bpfland --primary-domain all
Everything seems fine (good throughput).
scx_bpfland --primary-domain none
Identical to
all
(good throughput).I am confused about the purpose of introducing such an option. Is there any practical difference between
all
andnone
? Is it just for debugging purposes?scx_bpfland --primary-domain powersave
I thought this would be the opposite of
performance
(poor throughput but in a different manner), but the fact is: Identical toall
(good throughput).Vanilla
Everything seems fine (good throughput).
The text was updated successfully, but these errors were encountered: