Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] scx_rusty: Perform task migration immediately #994

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

vax-r
Copy link
Contributor

@vax-r vax-r commented Nov 27, 2024

Summary

According to #611, the load balancer produce new domain id into "lb_data" for every load balancing period, however, sometimes these tasks isn't going to be scheduled in the coming scheduling period. So there will be no task migration actually performed.

Utilize BPF_PROG_TYPE_SYSCALL so the load balancer can update the task context's domain right away and transfer related information between push domain and pull domain immediately.

The implementation still has some points to be considered.

  • When the task picked to be migrated is running, do we ignore it or how do we shut it down ?
  • Currently dom_active_tptrs has some problems , according to scx_rusty: Temporary fix of duplicate active tptr #941 , we should re-design the structure and keep it synchronized with each load balancing decision. Once dom_active_tptrs is always up-to-date, the tasks information in userspace can be easily keep up, and should be safe to regenerate dom.tasks each time.

(Personally , I think we should fix the problem of dom_active_tptrs first, as it will eventually need to be synchronize with task migrations. )

Test

( Test cases and metrics to be considered might not be enough coverage now, please let me know if you have any suggestions or ideas, I'll be happy to add more tests to verify everything. )
The test is done on AMD Ryzen 7 5700X3D 8-Core Processor , architecture is x86_64

Kernel Compilation workload

We ran the rusty scheduler with the following command

$ sudo ./build/debug/scx_rusty --stats 300 --cpumasks 0xf000 0x0f00 0x00f0 0x000f

Take kernel compilation as a test workload, the number of migration and the total time of kernel compilation needed is shown below

###### Wed, 27 Nov 2024 17:56:41 +0800, load balance @  -677.0ms ######
cpu=  24.64 load=    4.15 mig=1907 task_err=0 lb_data_err=0 time_used=928.6ms
tot= 213982 sync_prev_idle= 1.05 wsync= 1.06
prev_idle=57.46 greedy_idle= 4.90 pin= 0.00
dir=23.10 dir_greedy= 0.10 dir_greedy_far= 0.44
dsq=11.07 greedy_local= 0.82 greedy_xnuma= 0.00
kick_greedy= 0.34 rep= 0.15
dl_clamp= 7.87 dl_preset= 4.02
slice=20000us
direct_greedy_cpus=ffff
  kick_greedy_cpus=ffff
  NODE[00] load=  4.15 imbal=  +0.00 delta=  +0.00
   DOM[00] load=  1.13 imbal=  +0.09 delta=  -0.06
   DOM[01] load=  0.93 imbal=  -0.11 delta=  +0.00
   DOM[02] load=  1.28 imbal=  +0.25 delta=  -0.02
   DOM[03] load=  0.81 imbal=  -0.22 delta=  +0.08

And the total kernel compilation time

real	6m3.684s
user	0m0.027s
sys	0m0.111s

After the change, the stats are shown below

###### Wed, 27 Nov 2024 17:05:13 +0800, load balance @  -689.0ms ######
cpu=  24.69 load=    4.19 mig=1583 task_err=0 lb_data_err=0 time_used=1008.6ms
tot= 225426 sync_prev_idle= 0.97 wsync= 0.92
prev_idle=56.13 greedy_idle= 6.49 pin= 0.00
dir=24.05 dir_greedy= 0.11 dir_greedy_far= 0.46
dsq=10.52 greedy_local= 0.35 greedy_xnuma= 0.00
kick_greedy= 0.04 rep= 0.19
dl_clamp= 0.00 dl_preset=10.87
slice=20000us
direct_greedy_cpus=ffff
  kick_greedy_cpus=ffff
  NODE[00] load=  4.19 imbal=  +0.00 delta=  +0.00
   DOM[00] load=  1.40 imbal=  +0.36 delta=  -0.04
   DOM[01] load=  0.93 imbal=  -0.12 delta=  +0.00
   DOM[02] load=  0.75 imbal=  -0.30 delta=  +0.44
   DOM[03] load=  1.10 imbal=  +0.06 delta=  -0.40
real	5m59.663s
user	0m0.030s
sys	0m0.113s

While kernel compilation time shrink alittle, we can observe that the number of migrations has been decreased alot, which implies that the migration operation is more effective to decrease the load unbalance situation between domains, so the load balancer wouldn't have to perform so many useless operation.

Stress test

Run the rusty scheduler ( collect metrics every 50 secs )

$ sudo ./build/debug/scx_rusty --stats 50 --cpumasks 0xf000 0x0f00 0x00f0 0x000f

Test the change with stress-ng

$ stress-ng --cpu 12 --iomix 12 --timeout 60s --metrics

The metrics result shown below

###### Wed, 27 Nov 2024 18:08:13 +0800, load balance @  -676.4ms ######
cpu=  78.26 load=   14.72 mig=14 task_err=0 lb_data_err=0 time_used=142.1ms
tot= 183819 sync_prev_idle= 0.01 wsync= 0.02
prev_idle= 6.47 greedy_idle= 1.70 pin= 0.00
dir=34.49 dir_greedy= 2.15 dir_greedy_far= 1.84
dsq=31.31 greedy_local=22.00 greedy_xnuma= 0.00
kick_greedy= 0.22 rep= 0.66
dl_clamp= 0.01 dl_preset=53.31
slice=20000us
direct_greedy_cpus=f0ff
  kick_greedy_cpus=ffff
  NODE[00] load= 14.72 imbal=  -0.00 delta=  +0.00
   DOM[00] load=  3.61 imbal=  -0.07 delta=  +0.00
   DOM[01] load=  3.55 imbal=  -0.13 delta=  +0.00
   DOM[02] load=  3.89 imbal=  +0.21 delta=  +0.00
   DOM[03] load=  3.67 imbal=  -0.01 delta=  +0.00

It performs almost the same with and without the change for heavy workload on every cores.

Related Issue

#611

Under severe load unbalance scenario such as mixtures of CPU-insensive
workload and I/O-intensive worload, same tptr may be written into the
same dom_active_tptrs's array.

It will lead to load balancer's failure because when the tptr task
contains large enough load, it tends be to selected so warnings about
same tptr being set in "lb_data" will continue to pop up.

Use a workaround for now , which is to keep a HashSet in userspace
recording the current active tptr under a domain, and do not generate
the same task repeatedly.

Signed-off-by: I Hsin Cheng <[email protected]>
According to sched-ext#611, the load balancer produce new domain id into
"lb_data" for every load balancing period, however, sometimes these
tasks isn't going to be scheduled in the coming scheduling period. So
there will be no task migration actually performed.

Utilize BPF_PROG_TYPE_SYSCALL so the load balancer can update the task
context's domain right away and transfer related information between
push domain and pull domain immediately.

Signed-off-by: I Hsin Cheng <[email protected]>
@vax-r vax-r force-pushed the rusty_migrate_immediately branch from 3fab6d0 to d5c28e2 Compare November 27, 2024 10:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant