Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scx_bpfland: primary domain hurts burst multi-core performance and prevents some workloads from using all cores #1145

Open
Rongronggg9 opened this issue Jan 1, 2025 · 6 comments

Comments

@Rongronggg9
Copy link

Rongronggg9 commented Jan 1, 2025

Update: I've made some wrong assumptions here, so the original description has been collapsed to avoid further misunderstanding. Check #1145 (comment) for the lastest ones.

I noticed a weird phenomenon on my laptop when scx_bpfland enabled (with default options):
When many CPU-consuming processes are running, all CPU cores are consumable, which is fine. However, CPU-consuming threads from the same process cannot consume all CPU cores.

Geekbench 6 is such a program that uses many threads instead of processes to measure the multi-core performance. Here's how the behavior does harm to the multi-core score:

I am using Xanmod kernel (6.12.7-x64v3-xanmod1) with "x86 Heterogeneous design identification" patchset applied (otherwise Zen 5 P-cores will not be turbo boosted and run on a lower frequency).


After some diving, I guess the issue is in the "primary domain" feature. By default, it is automatically calculated according to the current energy profile. If the current energy profile is performance or balance_performance (my case), it will be all P-cores. On my laptop, scx_bpfland selects 0x03c0f (0-3,10-13, all P-cores) out-of-box, resulting in such programs only consuming all P-cores and several E-cores.

scx_bpfland

[INFO] scx_bpfland 1.0.9-ge68c4ded x86_64-unknown-linux-gnu SMT on
[INFO] primary CPU domain = 0x03c0f
[INFO] cpufreq performance level: auto
[INFO] L2 cache ID 0: sibling CPUs: [0, 10]
[INFO] L2 cache ID 1: sibling CPUs: [1, 11]
[INFO] L2 cache ID 2: sibling CPUs: [2, 12]
[INFO] L2 cache ID 3: sibling CPUs: [3, 13]
[INFO] L2 cache ID 8: sibling CPUs: [4, 14]
[INFO] L2 cache ID 9: sibling CPUs: [5, 15]
[INFO] L2 cache ID 10: sibling CPUs: [6, 16]
[INFO] L2 cache ID 11: sibling CPUs: [7, 17]
[INFO] L2 cache ID 12: sibling CPUs: [8, 18]
[INFO] L2 cache ID 13: sibling CPUs: [9, 19]
[INFO] L3 cache ID 0: sibling CPUs: [0, 10, 1, 11, 2, 12, 3, 13]
[INFO] L3 cache ID 1: sibling CPUs: [4, 14, 5, 15, 6, 16, 7, 17, 8, 18, 9, 19]

A process with CPU-consuming threads only consumes all P-cores and several E-cores (poor throughput).

$ perf stat -e 'task-clock' -- stress-ng -q -t 10 --loadavg 1 --loadavg-max 100  # 1 process * 100 threads

 Performance counter stats for 'stress-ng -q -t 10 --loadavg 1 --loadavg-max 100':

        106,513.34 msec task-clock:u                     #   10.619 CPUs utilized

      10.029988136 seconds time elapsed

       2.756511000 seconds user
     103.757580000 seconds sys
$ perf stat -e 'task-clock' -- stress-ng -q -t 10 --loadavg 20 --loadavg-max 100  # 20 processes * 5 threads

 Performance counter stats for 'stress-ng -q -t 10 --loadavg 20 --loadavg-max 100':

        174,936.73 msec task-clock:u                     #   17.430 CPUs utilized

      10.036645675 seconds time elapsed

       6.555938000 seconds user
     167.949288000 seconds sys

scx_bpfland --primary-domain performance

[INFO] primary CPU domain = 0x03c0f
[INFO] cpufreq performance level: max

Identical to default (poor throughput).

scx_bpfland --primary-domain all

[INFO] primary CPU domain = 0xfffff
[INFO] cpufreq performance level: auto

Everything seems fine (good throughput).

$ perf stat -e 'task-clock' -- stress-ng -q -t 10 --loadavg 1 --loadavg-max 100

 Performance counter stats for 'stress-ng -q -t 10 --loadavg 1 --loadavg-max 100':

        190,531.64 msec task-clock:u                     #   18.990 CPUs utilized

      10.033456690 seconds time elapsed

       2.494023000 seconds user
     188.034495000 seconds sys
$ perf stat -e 'task-clock' -- stress-ng -q -t 10 --loadavg 20 --loadavg-max 100

 Performance counter stats for 'stress-ng -q -t 10 --loadavg 20 --loadavg-max 100':

        191,256.78 msec task-clock:u                     #   19.051 CPUs utilized

      10.039446520 seconds time elapsed

       6.765876000 seconds user
     184.333524000 seconds sys

scx_bpfland --primary-domain none

[INFO] primary CPU domain = 0x00000
[INFO] cpufreq performance level: auto

Identical to all (good throughput).

I am confused about the purpose of introducing such an option. Is there any practical difference between all and none? Is it just for debugging purposes?

scx_bpfland --primary-domain powersave

[INFO] primary CPU domain = 0xfc3f0
[INFO] cpufreq performance level: min

I thought this would be the opposite of performance (poor throughput but in a different manner), but the fact is: Identical to all (good throughput).

Vanilla

Everything seems fine (good throughput).

$ perf stat -e 'task-clock' -- stress-ng -q -t 10 --loadavg 1 --loadavg-max 100  # 1 process * 100 threads

 Performance counter stats for 'stress-ng -q -t 10 --loadavg 1 --loadavg-max 100':

        183,721.47 msec task-clock:u                     #   18.309 CPUs utilized

      10.034233477 seconds time elapsed

       1.273312000 seconds user
     182.154045000 seconds sys
$ perf stat -e 'task-clock' -- stress-ng -q -t 10 --loadavg 20 --loadavg-max 100  # 20 processes * 5 threads

 Performance counter stats for 'stress-ng -q -t 10 --loadavg 20 --loadavg-max 100':

        194,427.18 msec task-clock:u                     #   19.283 CPUs utilized

      10.083031087 seconds time elapsed

       4.449802000 seconds user
     189.932841000 seconds sys
@arighi
Copy link
Contributor

arighi commented Jan 1, 2025

Hi, bpfland always tries to keep tasks running in the primary domain as much as possible. If all the CPUs in the primary domain are busy, tasks can overflow to the other CPUs.

By default the primary domain is determined in function of the energy profile, so in systems with P-cores and E-cores, if the current energy profile is "performance oriented", bpfland will try to run tasks only on the P-cores (with some of them potentially overflowing on the E-cores). E-cores are not used, because, by default, it's prioritizing performance predictability over max throughput.

With a "powersave profile" you should see the opposite: tasks running mostly on the E-cores (with few tasks overflowing to the P-cores). So, it's a bit odd that in your case --primary-domain powersave behaves in the same way as --primary-domain all. The cpumasks seem correct (0xfc3f0 vs 0xfffff). Do you see the same cores being allocated with any workload?

For maximum throughput, if you don't care about predictability, you should always use --primary-domain all. This ensures that bpfland will always attempt to pick idle CPUs from all the CPUs available in the system.

Moreover, the difference between --all / 0xffffff and --none 0x0 is that in the first case tasks that are waking up will be dispatched on any idle CPU, in the second case tasks will be queued to a shared queue and dispatched to the first CPU available (that is already up & running).

@Rongronggg9
Copy link
Author

Rongronggg9 commented Jan 3, 2025

Thanks for your timely reply! It did answer some of my questions.

Sadly, I found my assumption was wrong (sorry for that!) after reading your reply and doing more experiments... So let's start the story again.

GeekBench 6 multi-core score loss when the primary domain is effective (!= all/none)

When GeekBench 6 measures the multi-core score, it can eventually consume all CPU cores. However, most of its sub-benchmarks only need <1s to finish, and will wait several seconds before the next sub-benchmark.

It turns out that <1s is simply not enough for "If all the CPUs in the primary domain are busy, tasks can overflow to the other CPUs" to happen.

And this shows the CPU consumption pattern of GeekBench 6 multi-core sub-benchmarks when using scx_bpfland --primary-domain performance.

So the primary domain feature actually hurts the burst multi-core performance. Is this intended?

Given that the default value of --slice-us is 20000 (20ms), I assume the wait time before "overflow" can happen shouldn't be that long... And scx_bpfland --primary-domain performance --slice-us 1 --slice-us-min 1 --slice-us-lag -1000 didn't make things better.

The "overflow" pattern is reproducible using 1.0.9-gdda32c72 (scx v1.0.8) and 1.0.9-g423c860d.

stress-ng --loadavg 1 --loadavg-max 100 cannot consume all CPU cores when using scx_bpfland --primary-domain performance

The weird behavior is probably related to the intensive use of some syscalls (especially sched_yield?). The worker process also sets itself (and thus all its threads) the highest niceness (19).

Every worker thread loops over these syscalls:

lseek(4, 33576, SEEK_SET)  = 33576
write(4, "\22", 1)         = 1
sched_yield()              = 0
rt_sigpending([], 8)       = 0

Using 1.0.9-gdda32c72 (scx v1.0.8), it can only consume ~50% of the overall CPUs. If I renice -n -20 all its threads, it can consume ~70%.

It is unreproducible (can consume >90% CPU constantly) as long as one of these conditions is met (using 1.0.9-gdda32c72, i.e., scx v1.0.8):

  • Change the scheduler
    • CFS
    • scx_bpfland --primary-domain all
    • scx_bpfland --primary-domain none
    • scx_bpfland --primary-domain powersave
    • scx_lavd
  • Change the stressor
    • stress-ng --loadavg 20 --loadavg-max 100: 20 processes * 5 threads, niceness 19, syscall intensive
    • nice -n19 sysbench --threads=100 cpu run: 1 process * 100 threads, niceness 19, pure userspace stressor
    • nice -n19 stress-ng --cpu 100: 100 processes * 1 thread, niceness 19, pure userspace stressor

Using 1.0.9-g423c860d, it can consume more after some time, but the consumption becomes extremely unstable.

@arighi
Copy link
Contributor

arighi commented Jan 5, 2025

The logic is the following: in ops.select_cpu() we pick idle CPUs only from the primary domain, if we can't find an idle CPU in the primary domain the task will be queued to a global shared DSQ and dispatched on the first CPU that becomes available (that could be also a CPU outside the primary domain, it depends which one is ready to dispatch tasks first). So, the overflow is not time based, so changing time slice, etc. won't help much for this I think.

The "primary domain" feature is intended to provide stable performance in presence of hybrid cores, so tasks will likely run on the subset of cores of the same type, instead of mixing E-cores and P-cores for example, which could lead to inconsistent performance / power consumption. So, it is not intended to maximize throughput.

If you need throughput you can use -m all and force the primary domain to include all the CPUs in the system.

@Rongronggg9
Copy link
Author

Rongronggg9 commented Jan 9, 2025

if we can't find an idle CPU in the primary domain the task will be queued to a global shared DSQ and dispatched on the first CPU that becomes available... So, the overflow is not time based, so changing time slice, etc. won't help much for this I think.

Thanks for your clear explanation. Can I summarize the cause of the "delay" this way:

  • Even when the primary domain is highly loaded, there is a chance that some primary-domain CPUs are available somehow when the scheduling tick arrives. As a result, the "task overflow" process is more like an accumulation of probability.
  • What makes it worse is that tasks in the global shared DSQ are dispatched on the first available CPU, which is not necessarily a non-primary-domain one. Thus, there is another chance that the tasks will again be stuck in the above process for some time until it is lucky enough again to meet the scenario that no primary-domain CPU is available.

I am unsure if my guess is true. My guess seems unable to explain why changing the time slice won't help much as it should have made the accumulation of probability grow faster IIUC.

The "primary domain" feature is intended to provide stable performance in presence of hybrid cores, so tasks will likely run on the subset of cores of the same type, instead of mixing E-cores and P-cores for example, which could lead to inconsistent performance / power consumption. So, it is not intended to maximize throughput.

Yes, I understand that and didn't mean to oppose the design.

Is there a possibility that it could be optimized/refined so that the "task overflow" process can happen sooner after the primary domain is loaded?

If you need throughput you can use -m all and force the primary domain to include all the CPUs in the system.

Thanks for your suggestion. Though it does eliminate the "delayed overflow" issue, P-cores will not be prioritized even when scheduling single-thread workloads in this case - I understand this is intended.

Besides that, using -m all seems to give me neither better responsiveness nor better throughput than CFS (EEVDF) when the system load is high, just because of the behavior that it picks the first available CPU. The behavior introduces a lot of CPU migrations, e.g., for stress-ng --cpu 20 -t 10, it is 38.341 times per second versus 0.344. In fact, not only stress-ng, all tasks are migrated randomly among P-cores and E-cores in such a scenario. Given the fact that P-cores and E-cores are on different chiplets on my CPU, the latency hurts.

That being said, I am not criticizing bpfland. When I don't need burst throughput and the system is not heavily loaded, it fulfills some of my demands. Thanks again for your wonderful work and reply.

@Rongronggg9 Rongronggg9 changed the title scx_bpfland: primary domain prevents processes with many threads from consuming all CPU cores scx_bpfland: primary domain hurts burst multi-core performance and prevent some workload from using all cores Jan 12, 2025
@Rongronggg9 Rongronggg9 changed the title scx_bpfland: primary domain hurts burst multi-core performance and prevent some workload from using all cores scx_bpfland: primary domain hurts burst multi-core performance and prevents some workloads from using all cores Jan 12, 2025
@arighi
Copy link
Contributor

arighi commented Jan 16, 2025

Thanks for your clear explanation. Can I summarize the cause of the "delay" this way:

  • Even when the primary domain is highly loaded, there is a chance that some primary-domain CPUs are available somehow when the scheduling tick arrives. As a result, the "task overflow" process is more like an accumulation of probability.

Correct.

  • What makes it worse is that tasks in the global shared DSQ are dispatched on the first available CPU, which is not necessarily a non-primary-domain one. Thus, there is another chance that the tasks will again be stuck in the above process for some time until it is lucky enough again to meet the scenario that no primary-domain CPU is available.

Tasks in the shared DSQ are consumed on the first CPU that is not idle. This means it's more likely to be a primary domain CPU. However, if a non-primary CPU is awake and about to go idle, it can consume a task from the shared DSQ.

I am unsure if my guess is true. My guess seems unable to explain why changing the time slice won't help much as it should have made the accumulation of probability grow faster IIUC.

Keep in mind that the min granularity of the time slice is limited by CONFIG_HZ (so 1ms if you have CONFIG_HZ=1000), but a context switch can also happen due to a voluntary preemption event, like cond_resched() in the kernel, etc.

I'm not sure I understand your comment about the accumulation of probability, but reducing the time slice should give more opportunities to tasks to be processed by the primary CPUs.

The "primary domain" feature is intended to provide stable performance in presence of hybrid cores, so tasks will likely run on the subset of cores of the same type, instead of mixing E-cores and P-cores for example, which could lead to inconsistent performance / power consumption. So, it is not intended to maximize throughput.

Yes, I understand that and didn't mean to oppose the design.

Is there a possibility that it could be optimized/refined so that the "task overflow" process can happen sooner after the primary domain is loaded?

Tasks are immediately overflowing when there's no idle CPU in the primary domain. But then the scheduler tries to bring them back to the primary domain when tasks wake up or are re-enqueued (by kicking idle CPUs in the primary domain).

If you need throughput you can use -m all and force the primary domain to include all the CPUs in the system.

Thanks for your suggestion. Though it does eliminate the "delayed overflow" issue, P-cores will not be prioritized even when scheduling single-thread workloads in this case - I understand this is intended.

Right, with -m all P-cores are not prioritizes.

Maybe we could try to pick idle non-primary CPUs as a last resort in ops.select_cpu() (task wakeup) and keep kicking only idle primary CPUs in ops.enqueue(). This should make tasks less conservative and improve overall core utilization while still prioritizing the primary - domain. That's how bpfland was working initially, but this wasn't that great on performance stability because tasks were more likely moving between slow / fast cores, making performance less predictable.Maybe we should add a new option to set this behavior?

arighi added a commit that referenced this issue Jan 18, 2025
Make it easier for tasks to overflow beyond the primary domain in a more
aggressive way, using all available idle CPUs as a last resort while
still prioritizing idle CPUs within the primary domain.

This should address issue #1145.

Signed-off-by: Andrea Righi <[email protected]>
@arighi
Copy link
Contributor

arighi commented Jan 18, 2025

@Rongronggg9 can you test if the changes that I pushed to bpfland-next are making this any better? Thanks!

https://github.com/sched-ext/scx/tree/bpfland-next

arighi added a commit that referenced this issue Jan 18, 2025
Make it easier for tasks to overflow beyond the primary domain in a more
aggressive way, using all available idle CPUs as a last resort while
still prioritizing idle CPUs within the primary domain.

This should address issue #1145.

Signed-off-by: Andrea Righi <[email protected]>
arighi added a commit that referenced this issue Jan 18, 2025
Make it easier for tasks to overflow beyond the primary domain in a more
aggressive way, using all available idle CPUs as a last resort while
still prioritizing idle CPUs within the primary domain.

This should address issue #1145.

Signed-off-by: Andrea Righi <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants