train rf model in background thread #1178

bbudescu · 2024-12-05T06:43:26Z

work in progress on #1170

…ersion - minor fix: type hint

eddiebergman · 2024-12-05T08:16:22Z

Heyo, just a question why this is done as a thread and not a subprocess. Thread is easier by many stretches but has quite a few downsides compared to a process, at least in Python:

Thread:

Will not be parallel, i.e. fights for computation of main process (it's a Python thing). It's more that it will go-back-and-forth between main process computation and this thread. This is due to the GIL.
Uses same memory as main process, easier to transfer data for training.
Thread's can not be be interrupted and can cause the main process to not close, handling Ctrl+C or other system signals can be difficult.

Process:

Will be parallel.
Has seperate memory from main process, requires sending data to subprocess. Usually quite cheap and handled trivially using something like the builtin ProcessPoolExecutor.
System level signals propogate to subprocesses, no special handling required.

bbudescu · 2024-12-05T12:25:19Z

Will not be parallel, i.e. fights for computation of main process (it's a Python thing). It's more that it will go-back-and-forth between main process computation and this thread. This is due to the GIL.

Yes, that's what I was on about in #1170 (comment):

Of course, because of the GIL, the main optimization loop gets less cpu time, but although I haven't tested, I expect that the main thread would spend most time in the training section anyway. And I don't think the other parts in the loop are very CPU intensive anyway. Or are they? Do you know if, perhaps the inference or something takes really long?

So, actually, I think with running it in a separate thread, instead of having e.g., 90% of the time of the main process doing training, we now might end up giving it only 50%, and leaving the other 50% for other stuff like reporting and inference etc. if there's a demand for that.

Also, having threads instead of processes (either through multiprocessing or dask), doesn't take away one core from the workers doing cost function evaluation. Perhaps that's not much of an advantage, though.

Thread's can not be be interrupted and can cause the main process to not close, handling Ctrl+C or other system signals can be difficult.

I think this should already be fixed by daemonizing the thread (docs, further explanation).

Has seperate memory from main process, requires sending data to subprocess. Usually quite cheap and handled trivially using something like the builtin ProcessPoolExecutor.

Not sure how random forest serialization is handled. I know that for Queues objects are pickled, but I'm not sure how concurrent.futures pass return values. I know that pyrfr.binary_rss_forest exposes ascii_string_representation(), which can be used for serialization.

LE: Oh, wait... I just noticed that it also implements __getstate__ and __setstate__, which rely on ascii_string_representation internally, so it should be pickleable.

Also, for the training data (X and y and the derived data), as I also wrote in #1170 (comment), one could also use shared memory arrays because they're contiguous and homogeneous numpy arrays, which can be buffered in multiprocessing.Array-like objects.

bbudescu · 2024-12-05T12:36:41Z

Thread's can not be be interrupted and can cause the main process to not close, handling Ctrl+C or other system signals can be difficult.

Ah, damn, I wasn't aware of python/cpython#80116. (also this)

eddiebergman · 2024-12-06T12:03:04Z

Heyo, some extra feedback:

Daemonizing threads is a dangerous route too. They effectively detach themselves from the main process, and so if there is a bug and the thread runs forever, well, good luck. I've also explored a lot of this in pynisher, which is consequently what SMAC uses to run evaluations in a subprocess with the time (and memory?) limits. I just remember threads being the bane of lots of issues in the tests for that library.
Re serialization for processes: You're indeed right that mmep of arrays is efficient and useful. Rather than re-solve the wheel, you could use joblib, which is already a dependancy of scikit-learn. It already wraps a lot of the Python native ways of concurrent running of things, as well as implementing their own wrapper around ProcessPoolExecutor which natively handles mmep on objects that can be mem-mapped. It's a bit less low level, and may prevent certain operations but give the ubiqutous use of scikit-learn and how stable they've made it, it's probably best to rely on their implementations. (They also have threads fyi)

Tldr; concurrent programming in python is a mystical art with so many-caveats, and really experimentation is the only way to identify what works consistently. Try out joblib and if it works, I think that's the simplest solution with minimal chances for surprises later on.

…ry (backed by mmap on posix-compliant OSs) for data, and queues for synchronization

…f re-opening every time

…ated subpackage

- work in progress: switch from threads to processes

…ed within background thread - finish first version of switch from threads to processes

bbudescu · 2024-12-11T15:37:08Z

Hi @eddiebergman,

I couldn't figure out a way to use joblib to streamline the implementation of the behavior I wanted, namely, to have a loop that continuously trains on new data as it arrives, and that always makes the current model immediately available whenever _predict is called (i.e., even if another model is currently training on more recent data), so I ended up keeping the old code structure, but switched from threading to the builtin multiprocessing.

Now, the code I pushed isn't yet tested, neither manually/visually, nor by unit tests. Also, I'd like to add a flag that would allow the user to switch to the old, synchronous behavior, which I think would help test things, and also allow users to fall back on the old code until the new code is reliable enough. However, all the logic should be there (unless I made some gross mistakes), so that should give us at least a basis for discussion.

In this stage, I wanted to ask you about how the AbstractModel._rng member (a np.random.RandomState that is overwritten by a pyrfr.regression.default_random_engine within RandomForest.__init__) is supposed to be used outside of the training logic (I imagine it might be serialized when saving the experiment state?), and how to handle it best. Currently, I've basically made the main thread agnostic of the rng used for training (except for the seed that is fed to it on initialization). Should I pass it back to the main process through the Queue I'm using for models? In this case, do you have any idea on how to handle serialization / [un-]pickling?

Also, given the non-deterministic nature of OS scheduling, running the same optimization session twice, even with the same rng seed might result in different results, especially if running with a different number of workers, but I understand that's already the case , as per the warning in the Parallelism section of the Advanced Usage/Parallelism pagein the docs:

Warning: When using multiple workers, SMAC is not reproducible anymore.

Other than that, if you want to throw an eye on what I've implemented so far and, if you, perhaps, find any flagrant mistakes, any feedback and support would be greatly appreciated.

Thanks,

Bogdan

… wait until training is done before being able to query the model to suggest a new config to try

- better synchronization / signalling between optimization loop and training loop - refactor: - improve shared array semantics - encapsulation: reuse more allocation / cleanup code - defensive: extra checks - other minor fixes / improvements

… multiple sessions in parallel, e.g., when running tests

…ests to terminate the training loop process gracefully (to as high as an extent as possible) - add (and then disable) some code that prints to console to help debug inter-process synchronization - refactor: renames for improved legibility - refactor: encapsulate and reuse - add option to run testing code without pytest for debug - modify some testing code to avoid deprecation warnings

- refactor: - renames: improved legibility - easier debug printing for sync between opt and train loop processes

bbudescu · 2024-12-13T12:21:56Z

Hi @eddiebergman,

I managed to bring the code to a point where it passes all the tests without hanging. Indeed, it wasn't a pretty job :).

Could you now, please, advise on what's needed for this PR to be merged?

Here are some potential improvements I could think of:

1. UPDATE: DONE: Option to Disable Concurrency

Add a flag to revert to old implementation, and not just behavior. Note, I've already added the option to wait for training to be completed after every submission of fresh training data, but I was thinking of disabling multiprocessing altogether)

2. Unit Tests

Would you like me to add some unit tests? How do you think they should look like?

3. Training Data IPC

In order to decrease risks, I could eliminate my own homebrew mmap-backed implementation of a dynamic array and use some other way to pass the training data between the optimization loop and the training loop process.

You suggested joblib above, but I don't think the futures-like interface can really suit my need to implement a loop discarding all but the latest data (or, at least, I wasn't able to figure out how to use it to attain my goals). However, upon further investigation, I found out that, under the hood, joblib uses cloudpickle (pypi, src) for more complex objects (like numpy.arrays) that can't be handled by the builtin pickle. An alternative to cloudpickle would be dill (pypi, docs, src), which has been around for quite some time.

I know you would prefer not to add further dependencies to the project, but I thought it doesn't hurt to ask :). I found the multiprocess package (pypi, docs, src) that acts a drop-in replacement for the multiprocessing builtin module that uses dill for serialization. Both the project's PyPI release history and github commit history show that it still has regular maintenance updates, and it has had activity since 2015, being part of another project, with some ngo backing and some academic connection (at least in the beginning of the project, the project author and maintainer was affiliated with CalTech and other institutes).

This would also simplify the inter-process signalling required to release and unlink the shared memory, and would avoid potential leaks (see 5., below).

4. RNG

Still haven't heard back from you about the RandomForest._rng member that overrides AbstractModel._rng

5. Leaks, hangs

As per the code here, in some complex dependency graphs, the garbage collector doesn't get to call the __del__'s of various objects, so execution just hangs unless the user explicitly calls close() on the model or objects containing it.

This shouldn't normally be an issue neither in the common use cases, nor when running tests with pytest (I needed to add code to explicitly kill the background process in the fixture teardown), but still, I think if this should be fixed if possible.

…ault to old behavior)

bbudescu · 2024-12-13T15:03:48Z

Hi again,

I just added an option to disable multiprocessing altogether and switching back to the old behavior. Would this, perhaps, help allow the merging of the pull request, and somehow expose this to the user as an experimental feature, at his own risk?

Thanks!

bbudescu · 2024-12-18T07:18:34Z

I ran a session similar to the one the data from which the cpu load plot in #1170 (comment) was generated, and also the performance reported in the motivation section of #1169 (comment):

I was checking out an optimization session running on 64 cores and noticed the cpu load drop below 50% after 12 hours. Not sure yet exactly how many trials it had gotten through at that point (after 22 hours it got through 12600 trials, and cpu load is under 40%).

Currently, it doesn't seem that the code I wrote has managed to fix it.

So here are some plots:

1. The original 46h session

2. Another 48h session (smooth)

LE: 20.4k trials on 17k distinct configs in 48h

3.a. UPDATED: Current 24h Session

3.b. UPDATED: Current 24h Session (smooth)

Observation

After 24h the session got through 14450 trials on 64 cores, and cpu load is about 42%.

LE: 22k trials on 18.4k distinct configs in 48h

Conclusion

Either the code is broken, or the bottleneck happens for a different reason, e.g., because of Hyperband.

LE: There's a non-insignificant gain of about 10% in the number of trials/configs when enabling the code in this PR (i.e., between 2 and 3 above). So, in light of these new observations, I think it's that a new hypothesis, that this code fixes a bottleneck but there might be another one somewhere else, seems to become more likely.

Next steps:

reproduce behavior quicker, i.e., without requiring a 64 core machine and wait for at least 12 hours to get a definite observation of a drop in cpu load
do some profiling to measure on which lock most workers spend most time waiting on (LE: [Feature Request] Profiling #1181); EVEN LATER EDIT: actually, given the execution model, I think it's enough to profile the main process running the optimization loop
LE: a quick way to eliminate the hyperband intensifier would be to run the same session without it

…more easily identifiable in monitoring tools

bbudescu · 2024-12-21T10:02:35Z

Additional CPU load graphs

Hi guys. I ran another bunch of optimization sessions turning off various components of SMAC to try and identify where the bottleneck lies. Here are the results:

1. Hyperband disabled

1.1. With Background Training

6.3k trials in 48 hours

1.2. Without Background Training (my smac3 clone)

6.16k trials in 48 hours

1.4. Without Background Training (original smac3 repo)

This was just to test if my code introduces any bottlenecks.

6.24k trials in 48 hours

2. Random Sampling

IN PROGRESS
no Random Forest Surrogate Model

2.1. Hyperband Disabled

~ 5.5k-6k trials in 46 hours
still running - has another 2 hours budget

2.2. With Hyperband Enabled

not started yet

Conclusions

It seems now that the problem lies neither with training the RF model (when doing random sampling, there are fewer trials done than when using RF; mind, however, that trial time, although deterministic, varies with the chosen params), neither with Hyperband. I guess that some profiling is highly necessary to identifiy the performance bottlenecks in a parallel setting.

Later, I will update the above graphs.

- train rf model in the background while worker processes use older v…

f5d8822

…ersion - minor fix: type hint

bbudescu mentioned this pull request Dec 5, 2024

Avoid workers waiting for training of surrogate model to finish #1170

Open

Bogdan Budescu added 6 commits December 10, 2024 17:04

work in progress: replace threads by multiprocessing, use shared memo…

642d8fc

…ry (backed by mmap on posix-compliant OSs) for data, and queues for synchronization

work in progress: keep handle to shared memory until change instead o…

7492c48

…f re-opening every time

refactor: encapsulation: move multiprocessing implementation to dedic…

d9c3867

…ated subpackage

delete metadata

6a449b9

- refactor: encapsulation of some code in growing array class

413e905

- work in progress: switch from threads to processes

- refactor: encapsulation: move training data init where it can be us…

984c59f

…ed within background thread - finish first version of switch from threads to processes

Bogdan Budescu added 10 commits December 11, 2024 17:58

minor fixes in sync semantics

75eb34e

minor fixes in sync

ed50624

add flag to allow user switch to emulating the old behavior, i.e., to…

65edb8d

… wait until training is done before being able to query the model to suggest a new config to try

- fix bugs in arrays backed by shared memory

818ffd0

- better synchronization / signalling between optimization loop and training loop - refactor: - improve shared array semantics - encapsulation: reuse more allocation / cleanup code - defensive: extra checks - other minor fixes / improvements

disable resource tracking for shared memory instances to avoid warnings

cc89e16

use uuids instead of incremental ids to avoid collisions when running…

880e0cc

… multiple sessions in parallel, e.g., when running tests

minor aesthetic refactors

8458d41

more minor aesthetic refactors

ae9b821

- improve sync between optimization and training loops

cb76f8c

- refactor: - renames: improved legibility - easier debug printing for sync between opt and train loop processes

Bogdan Budescu added 2 commits December 13, 2024 16:46

add flag to switch off (concurrent) background training entirely (def…

6198d8f

…ault to old behavior)

fix typo that caused bug

d956bfb

bbudescu mentioned this pull request Dec 16, 2024

[Feature Request] ASHA #1169

Open

default enable background training to run tests

396d2f5

bbudescu mentioned this pull request Dec 17, 2024

[Feature Request] Objectives RoI #1172

Open

bbudescu mentioned this pull request Dec 18, 2024

[Feature Request] Profiling #1181

Open

use setproctitle, if available to rename rf trainer process, so it's …

4b30167

…more easily identifiable in monitoring tools

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

train rf model in background thread #1178

train rf model in background thread #1178

bbudescu commented Dec 5, 2024

eddiebergman commented Dec 5, 2024 •

edited

Loading

bbudescu commented Dec 5, 2024 •

edited

Loading

bbudescu commented Dec 5, 2024 •

edited

Loading

eddiebergman commented Dec 6, 2024

bbudescu commented Dec 11, 2024 •

edited

Loading

bbudescu commented Dec 13, 2024 •

edited

Loading

bbudescu commented Dec 13, 2024 •

edited

Loading

bbudescu commented Dec 18, 2024 •

edited

Loading

bbudescu commented Dec 21, 2024 •

edited

Loading

train rf model in background thread #1178

Are you sure you want to change the base?

train rf model in background thread #1178

Conversation

bbudescu commented Dec 5, 2024

eddiebergman commented Dec 5, 2024 • edited Loading

bbudescu commented Dec 5, 2024 • edited Loading

bbudescu commented Dec 5, 2024 • edited Loading

eddiebergman commented Dec 6, 2024

bbudescu commented Dec 11, 2024 • edited Loading

bbudescu commented Dec 13, 2024 • edited Loading

1. UPDATE: DONE: Option to Disable Concurrency

2. Unit Tests

3. Training Data IPC

4. RNG

5. Leaks, hangs

bbudescu commented Dec 13, 2024 • edited Loading

bbudescu commented Dec 18, 2024 • edited Loading

1. The original 46h session

2. Another 48h session (smooth)

3.a. UPDATED: Current 24h Session

3.b. UPDATED: Current 24h Session (smooth)

Observation

Conclusion

bbudescu commented Dec 21, 2024 • edited Loading

Additional CPU load graphs

1. Hyperband disabled

1.1. With Background Training

1.2. Without Background Training (my smac3 clone)

1.4. Without Background Training (original smac3 repo)

2. Random Sampling

2.1. Hyperband Disabled

2.2. With Hyperband Enabled

Conclusions

eddiebergman commented Dec 5, 2024 •

edited

Loading

bbudescu commented Dec 5, 2024 •

edited

Loading

bbudescu commented Dec 5, 2024 •

edited

Loading

bbudescu commented Dec 11, 2024 •

edited

Loading

bbudescu commented Dec 13, 2024 •

edited

Loading

bbudescu commented Dec 13, 2024 •

edited

Loading

bbudescu commented Dec 18, 2024 •

edited

Loading

bbudescu commented Dec 21, 2024 •

edited

Loading