Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Use spawner processes to speed up spawning #3719

Closed
wants to merge 6 commits into from

Conversation

Jongy
Copy link

@Jongy Jongy commented Jun 28, 2020

SCons tends to grow large in RAM usage. It's expensive for large processes to fork().
The best solution would be vfork(), but CPython doesn't support it properly (see this issue for latest status).

Until CPython has it (and in order to support older versions as well), I added the --process-spawner SCons options, which lets each Worker create its private "spawner" process. Further exec_subprocess() calls by that Worker will send the Popen requests to the spawner via simple IPC (pickle messages over stdin/stdout), and the spawner will Popen the requested process, wait for its result and send it back to the Worker.

Overall it's quite simple, yet for RAM-heavy SCons it adds a huge speedup.

Discussed (with some performance benching) in https://pairlist4.pair.net/pipermail/scons-users/2020-June/008286.html and #3703 (specifically this comment, this one and this)

Tasks:

  • Take a look at replacing the IPC with mutltiprocessing module
  • Pipe back exceptions raised in the spawner (and add a test for that)
  • Run a full-build benchmark, on a real project (and not partial builds of MongoDB / SCons artificial benchmarks)
  • Write tests (or: modify some existing tests to use --process-wrapper)
  • Make sure all spawner extra code is only executed if on POSIX

Contributor Checklist:

I still have to finish these (I just want an ack from one of you the maintainers that this looks okay)

  • I have created a new test or updated the unit tests to cover the new/changed functionality.
  • I have updated CHANGES.txt (and read the README.rst)
  • I have updated the appropriate documentation

SCons' process (and all Python stuff you push into it) may become
quite heavy in RAM. It's not cheap for RAM-heavy processes to do
fork()s and exec()s.

This PoC commit lets each worker of SCons have a private "spawner"
process which has a tiny memory footprint. Instead of using
subprocess.Popen directly, the exec_subprocess() will request the
spawner process private to the current thread to do the spawning.
This makes spawning much faster.
pickle.load reads 1 pickle message from the buffer object and returns,
so theres' no need to precede the message with size.
if process_spawner:
from SCons.Job import spawner_tls

def exec_subprocess(l, env):
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modified only exec_subprocess, and not exec_popen3 / _subproc because they are more complex to wrap (although not impossible), and the "big win" for performance is from this function, which is the most commonly called (AFAIK)

@@ -54,6 +54,10 @@
import SCons.Subst
import SCons.Tool

# Gets set to True in Main._main() when the '--process-spawner'
# option was given on the command-line for speeding up process spawning.
process_spawner = False
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is the right spot to place it? It's POSIX only so it should be in posix.py maybe? I followed #3703 on this one.

@bdbaddog
Copy link
Contributor

Please fix sider issues.
Is this PR in your opinion ready to merge, or still a work in progress?

@Jongy
Copy link
Author

Jongy commented Jun 29, 2020

Is this PR in your opinion ready to merge, or still a work in progress?

The concept is not going to change much. I am planning to add a few more features before I think it's ready to merge, though. I'll update the description with tasks. I'll complete them further on this week.

@Jongy Jongy changed the title Use spawner processes to speed up spawning WIP: Use spawner processes to speed up spawning Jun 29, 2020
@Jongy
Copy link
Author

Jongy commented Jul 3, 2020

About multiprocessing: I created a modified version that uses multiprocessing.Pool with the spawn context (so workers are not heavy forks of the main SCons Python process, but lightweight scripts). You can see it here. According to my tests, it performs a bit worse (benchmarks of perl scons_bld.pl builds, default settings).

multiprocessing

real    0m50.058s
user    2m38.115s
sys     1m0.131s

current version

real    0m45.071s
user    2m32.627s
sys     0m55.708s

Given these results I'll stick with the plain version.

@Jongy
Copy link
Author

Jongy commented Jul 3, 2020

I benched the build of Godot:

godot with spawner

real    4m52.019s
user    32m16.345s
sys     2m14.654s

godot master

real    4m53.794s
user    32m13.860s
sys     2m27.330s

This amounts to a tiny 0.5% improvement (user+sys time). I wasn't expecting to see much because Godot's SCons doesn't use too much RAM (starts at 60MB, climbs up to 150MB). Impact becomes more obvious as RAM usages grows.

@@ -264,6 +268,37 @@ def run(self):

self.resultsQueue.put((task, ok))

class Spawner:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should move to posix.py

@bdbaddog bdbaddog changed the title WIP: Use spawner processes to speed up spawning [WIP] Use spawner processes to speed up spawning Jul 3, 2020
@bdbaddog
Copy link
Contributor

I'm not seeing any compelling results above.
Do you have any examples where you have notable improvements?
Also, probably good to show peak memory usage as that's the main stated reason to use vfork?

@Jongy
Copy link
Author

Jongy commented Jul 11, 2020

Do you have any examples where you have notable improvements?

I don't have an open-source example with heavy memory consumption at hand (if you know of any project that might fit, please let me know and I'll try on it).

I was thinking this week to take a more accurate measuring: the problem with a simple time is that it also takes into account the time spent by GCC and all other tools executed by SCons. What we really want to measure here is the time of SCons only (and when using the process spawner, the time of SCons + all spawner processes). This will give the percentage of improvement over current SCons, which is really the metric we seek.

Obviously, if that metric shows no notable improvement, then it's not worth it. Though, like I said, for projects with heavy memory consumption, this is proved useful (I can share some results of Godot build where I forced SCons to take, say, 500MB. Difference becomes notable)

@bdbaddog
Copy link
Contributor

Did you try --debug=time ?

@Jongy
Copy link
Author

Jongy commented Jul 11, 2020

Did you try --debug=time ?

Nope, cool! Will try that, thanks for the tip.

@Jongy
Copy link
Author

Jongy commented Jul 20, 2020

Okay, I tried with --debug=time, but I didn't completely understand how it accounts times between SCons and executed commands, and anyway, it is based on clock time and not CPU time.

So I tried the following - I added this small snippet:

import resource
r = resource.getrusage(resource.RUSAGE_SELF)
print("main", r.ru_utime, r.ru_stime, file=sys.stderr)

to SCons' main() and to the spawner.py code - right before they exit. This should give accurate accounting for CPU time for the SCons process and the child spawners.

Trying again with Godot. On this branch without --process-spawner, the result is: main 34.841785 8.357901, so a total of 43.07 CPU seconds for SCons.

On this branch with --process-spawner, the result is main 33.671169 1.498248 plus 8 spawner processes (I used -j 8):

spawner 0.095034 0.14987699999999998
spawner 0.142122 0.13424999999999998
spawner 0.103977 0.096798
spawner 0.091567 0.122125
spawner 0.12446299999999999 0.111069
spawner 0.084918 0.127081
spawner 0.107025 0.121252
spawner 0.083006 0.162874

that account to 1.85 seconds. So the total is 37.02.

I admit this doesn't look notable comparing to the total build time for Godot (which is about 2k CPU seconds) but it is a notable portion of the SCons time (about 15%), as and as SCons' time grows, this part grows as well. With SCons at 700MB~ ,the --process-spawner version cuts by half the SCons time (it remains at 37-38 seconds, while the standard version climbs to 75-80).

@acmorrow
Copy link
Contributor

@Jongy - The MongoDB build might be another one to try if you want some additional data points. We too would like to see SCons be faster.

@Jongy
Copy link
Author

Jongy commented Jul 20, 2020

@acmorrow I tested build of mongo-r4.2.8.

Build commandline I used: SCONS_VERSION=test python3 buildscripts/scons.py MONGO_VERSION=4.2.8 mongos --disable-warnings-as-errors (I made sure to place my SCons under src/third_party/scons-test/scons-local-test).

Without --process-spawner, SCons time is main 59.585993 16.084736 - 75.67 seconds in SCons.

With --process-spawner, the result is main 56.936538 3.723293 (plus 8 spawners that total on 1.8559) - 62.51 seconds in SCons.

@Jongy
Copy link
Author

Jongy commented Jul 23, 2020

@bdbaddog What do you think about these results? There is no huge effect on the build time, but that's as far as it could go, since SCons itself is only a small portion of that time. But the improvement in SCons-only time is definitely notable, and like I said this is aimed at projects consuming more RAM, where this change is much more effective.

Also you can see in the this comment - when SCons is under stress, this change has positive effects.

Waiting for your approval (or disapproval :) before I go on with this.

@bdbaddog
Copy link
Contributor

Curious how much RAM are you seeing SCons take.
I've seen cmake take > 5G of RAM..
There's chances that other things you're doing in your build are causing larger than necessary scons footprints.

@Jongy
Copy link
Author

Jongy commented Jul 24, 2020

IIRC it was about 1.5G, but I don't have access to the code at the moment, so can't verify. It's indeed very likely the build doesn't really need those 1.5G - like every other program, it can be optimized. Nonetheless, this idea proposed here had notable effects on the build time, at practically zero cost.

@grossag
Copy link
Contributor

grossag commented Feb 8, 2023

Just FYI as of Python 3.10 Linux Python's subprocessing module uses vfork() instead of fork() unless the caller specifically disables it. So this pull request may not be needed anymore.

@bdbaddog
Copy link
Contributor

bdbaddog commented Feb 8, 2023

@grossag - Thanks for they update and keeping an eye on developments related to this PR. :)

@Jongy
Copy link
Author

Jongy commented Feb 8, 2023

Just FYI as of Python 3.10 Linux Python's subprocessing module uses vfork() instead of fork() unless the caller specifically disables it. So this pull request may not be needed anymore.

Yup, as noted in the description - tracking ticket was https://bugs.python.org/issue35823 and it got merged in 3.10. This PR is far less relevant now

@bdbaddog
Copy link
Contributor

bdbaddog commented Feb 8, 2023

@Jongy - so safe to say the recommended fix now is to use python 3.10+? If so, let's close this PR.

@Jongy
Copy link
Author

Jongy commented Feb 8, 2023

@Jongy - so safe to say the recommended fix now is to use python 3.10+? If so, let's close this PR.

Yes. It will cover the SCons case (that is, the subprocess usage of SCons).

I'll close the PR, I see no need to introduce the extra complexity, given that people can just upgrade to 3.10

@Jongy Jongy closed this Feb 8, 2023
@Jongy Jongy deleted the process-spawner branch February 8, 2023 23:01
@Jongy Jongy restored the process-spawner branch February 8, 2023 23:02
@mwichmann
Copy link
Collaborator

I'm generally fine with this resolution, too, but... it's not true that everyone can just upgrade to 3.10+. We keep hearing from people who can't "because" (usually an IT/Security thing restricting from using a Python not provided by the Linux distribution in question). We also need to get off using os.spawnve on Windows, but that's got its own trackers...

@bdbaddog
Copy link
Contributor

bdbaddog commented Feb 9, 2023

@Jongy - One question for you. On the system(s) where the spawner was making a notable difference, how much RAM did the system(s) have?

@Jongy
Copy link
Author

Jongy commented Feb 9, 2023

@Jongy - One question for you. On the system(s) where the spawner was making a notable difference, how much RAM did the system(s) have?

16GB. However the physical RAM size doesn't matter, what matters is the amount of pages in the pagetables in the app - which is the RSS of the app divided by page size. Using hugepages, for examples, will very likely reduce the slowdown drastically. RSS I described here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants