Skip to content

Conversation

@marius311
Copy link
Contributor

@marius311 marius311 commented Feb 25, 2021

Here's my maybe naive way to fix #882.

Trying to aquire the GIL from the finalizer led to deadlocks. This is instead basically option 3 from here. Tests pass locally and this fixes the original issue, but I'm not an expert so maybe this has other consequences I'm not thinking of. Comments welcome.

@codecov-io
Copy link

codecov-io commented Feb 25, 2021

Codecov Report

Merging #883 (bdc0edf) into master (5d227fc) will decrease coverage by 0.07%.
The diff coverage is 88.23%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #883      +/-   ##
==========================================
- Coverage   67.61%   67.54%   -0.08%     
==========================================
  Files          20       20              
  Lines        1967     1981      +14     
==========================================
+ Hits         1330     1338       +8     
- Misses        637      643       +6     
Flag Coverage Δ
unittests 67.54% <88.23%> (-0.08%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/pyinit.jl 82.24% <77.77%> (-0.42%) ⬇️
src/PyCall.jl 70.30% <100.00%> (+0.54%) ⬆️
src/pybuffer.jl 61.44% <100.00%> (-0.46%) ⬇️
src/gc.jl 69.23% <0.00%> (-30.77%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5d227fc...bdc0edf. Read the comment docs.

@stevengj
Copy link
Member

stevengj commented Feb 25, 2021

See my comment here: #882 (comment)

I think we probably need to copy the FFTW.jl solution here, which was rather painfully developed with the help of @vtjnash.

(I wonder if we should refactor that solution out into a package so that both FFTW.jl and PyCall.jl can use the same code? The logic is pretty subtle to get right. Maybe put the package in https://github.com/JuliaParallel?)

@stevengj
Copy link
Member

Hmm, I'm not sure we can factor the FFTW.jl solution that cleanly into a package, since trylock(fftwlock) probably needs to be replaced with the GIL lock (or we need to protect every call to Python with our own lock, which would be unfortunate). Unfortunately, there doesn't seem to be any non-blocking analogue of trylock for the GIL? So, we might still need to re-think the logic a bit.

@stevengj
Copy link
Member

stevengj commented Feb 25, 2021

Reading the Python documentation some more, it seems like the solution may be simpler — we're not really allowed to call Python from anything other than the main thread unless we do a lot more work, so if Threads.threadid() != 1 (we aren't in the main thread) (or maybe check PyGILState_Check() == 1?) in the finalizer, just push the PyPtr to a queue of objects to decref (protected by a lock as in FFTW.jl), and otherwise both decref the objects and flush the queue.

@marius311
Copy link
Contributor Author

marius311 commented Feb 26, 2021

I'm not sure I like the thought of putting the CPU in a spinloop

Yea youre right, I'm showing some of my ignorance on this stuff here.

Re: the last comment, I think I see your point and maybe a simple way to do it, also basically amounts (I think) to option two from that Julia doc link instead of option 3. Will give it a go.

@marius311
Copy link
Contributor Author

Ok, this is quite simple and seems to work.

@marius311
Copy link
Contributor Author

Actually, sorry, I realize this doesn't actually solve the segfault I was seeing in #881 (comment). I was getting confused since that only happens on this one system. So fwiw, I don't have an evironment where this PR fixes anything, either its not needed, or I get those segfaults regardless 🤷 granted it seems like a right thing to do.

@mkitti
Copy link
Member

mkitti commented Feb 26, 2021

Is it sufficient to guarantee that Thread.threadid() == 1? Perhaps we also need confirm that Base.current_task() === Base.roottask to ensure the stack is in a valid state for Python.

In JavaCall.jl, this is a necessary condition for some operations due to how Julia addresses the stack for each Task and how Java uses signalling. See the JULIA_COPY_STACKS environment variable and ALWAYS_COPY_STACKS option.
https://julialang.org/blog/2019/07/multithreading/#allocating_and_switching_task_stacks

@marius311
Copy link
Contributor Author

Fixed the return value. Up to you to decide what to do with this PR. Personally I'm far from sure on what the right solution is, if one is needed at the moment at all .

@stevengj
Copy link
Member

What does PythonCall.jl do, @cjdoris?

@cjdoris
Copy link

cjdoris commented Aug 19, 2023

What does PythonCall.jl do, @cjdoris?

All the finalisers in PythonCall go via enqueue or enqueue_all here.

These functions either immediately decref the pointer, or add it to a list of pointers to decref later. However, this is currently purely controlled by explicit PythonCall.GC.disable()/enable(), and like most other PythonCall functions should only be called from thread 1. It could be changed to check the thread and only immediately decref on thread 1 - which I think is the subject of the current PR, so I'll wait to see how you do it safely!

@stevengj
Copy link
Member

stevengj commented Aug 19, 2023

So fwiw, I don't have an environment where this PR fixes anything, either its not needed, or I get those segfaults regardless

@marius311, is there any update on whether this PR observably fixes anything?

I'm guessing that it may be more crucial now that Julia's GC is multi-threaded (JuliaLang/julia#48600) in Julia 1.10.

@MilesCranmer
Copy link
Contributor

I've been seeing a lot more segfaults on recent Julia + Python versions (x-ref #1072) so I was wondering if I could help revive this PR – @marius311 are you up for rebasing this and getting it ready for some new reviews?

@MilesCranmer
Copy link
Contributor

I also think even though this PR might not be the optimal GC strategy, at least it's safe. We can always improve it later.

@ericphanson
Copy link

btw I have a similar PR for PythonCall here: JuliaPy/PythonCall.jl#520. My version is a bit more complicated as I'm using a dedicated task on thread 1 to process the GC queue (and some of the complication is just around testing that it's working properly, and since PythonCall has enable/disable functionality for it's GC).

@codecov-commenter
Copy link

Codecov Report

❌ Patch coverage is 85.71429% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 67.61%. Comparing base (5d227fc) to head (c707032).
⚠️ Report is 50 commits behind head on master.

Files with missing lines Patch % Lines
src/PyCall.jl 83.33% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #883      +/-   ##
==========================================
- Coverage   67.61%   67.61%   -0.01%     
==========================================
  Files          20       20              
  Lines        1967     1970       +3     
==========================================
+ Hits         1330     1332       +2     
- Misses        637      638       +1     
Flag Coverage Δ
unittests 67.61% <85.71%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

thread safety

9 participants