Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Independent SubInterpreters are still not concurrent with python 3.12+ #593

Open
novos40 opened this issue Jan 23, 2025 · 13 comments
Open

Comments

@novos40
Copy link

novos40 commented Jan 23, 2025

Describe the bug
I'm running latest python 3.13.1 on 8 core VM. I'm starting 8 java threads. Each thread creates it's own SubInterpreter and runs CPU-only function like

def cpu_bound(number):
    # print(f">>> cpu_bound({number = })")                   # debug print
    # res = sum(i * i for i in range(number))                # CPU load
    print(f">>> cpu_bound({number = }): while loop")
    res = 0
    i = 0
    while(i < number):
        res += i * i
        i += 1
    print(f"<<< cpu_bound({number = }): while loop")
    return res

I've removed any functions call even functions like sum(). It seems like SubInterpreter creations and even function starts are concurrent, but overall it's still synchronized on something (despite the fact that python 3.12+ should support per-interpreter GIL). The output looks like this (threads started with consequent numbers so we can match start and stop)

>>> cpu_bound(number = 100000001): while loop
>>> cpu_bound(number = 100000003): while loop
>>> cpu_bound(number = 100000004): while loop
>>> cpu_bound(number = 100000005): while loop
>>> cpu_bound(number = 100000007): while loop
>>> cpu_bound(number = 100000002): while loop
>>> cpu_bound(number = 100000006): while loop
>>> cpu_bound(number = 100000000): while loop
<<< cpu_bound(number = 100000003): while loop
<<< cpu_bound(number = 100000001): while loop
<<< cpu_bound(number = 100000007): while loop
<<< cpu_bound(number = 100000005): while loop
<<< cpu_bound(number = 100000004): while loop
<<< cpu_bound(number = 100000006): while loop
<<< cpu_bound(number = 100000002): while loop
<<< cpu_bound(number = 100000000): while loop

which tells me that all functions do start concurrently, but then they wait for each other in some random order. In multiple runs the order of starts and stops can vary, but first thread finish is always only after last thread start. The CPU utilization is never exceeding 12% (except very short start period for code compilation, I guess) i.e. exactly like single thread execution. For that matter you can start python threads and get exactly the same performance or rather lack thereof.

What am doing wrong here? How do I make sub-interpreters execute concurrently?

To Reproduce

  1. Create and start java platform thread
  2. Create independent SubInterpreter in each thread
  3. Call python function

Expected behavior
Fully independent SubInterpreters should run concurrently allowing for 100% CPU utilization.

Environment (please complete the following information):

  • Window10, Linux
  • Python v3.12.1, v3.13.1
  • java 21
  • Jep v4.2.2
  • Python packages used (e.g. numpy, pandas, tensorflow): pure python CPU-only code

Additional context
Add any other context about the problem here.

@ndjensen
Copy link
Member

Just to be clear, what SubInterpreterOptions did you set on your JepConfig when creating the SubInterpreters? Jep is just passing the options to the CPython interpreter so if there really is a problem I don't know if there's much we can do about it.

@novos40
Copy link
Author

novos40 commented Jan 23, 2025

Oh, sorry, totally miss sub interpreter options.
However using SubInterpreterOptions.isolated() just kills the JVM process without any messages or memory dump. It looks like somebody just called System.exit()
Using legacy options and manually setting

    final SubInterpreterOptions sio = SubInterpreterOptions.legacy();
    sio.setCheckMultiInterpExtensions(true);
    sio.setUseMainObmalloc(false);
    sio.setOwnGIL(true);
    jepConfig.setSubInterpreterOptions(sio);

produces the same result: JVM just dies on first Interpreter.set(name, value) call which sets a non-primitive java value. That is, in the following sequence (cntx is an instance of Interpreter)

    cntx.set("pythonEngine", "JEP");     // set String value
    cntx.set("undefined", noValue);      // set java object instance value
    cntx.set("appContext", appContext);  // set java lambda value
    cntx.set("ML", ML.class);            // set java class value

first two sets are working fine and JVM dies on third (java lambda value). It will also die on setting ML.class value if I flip last two lines.
It looks like it dies in java.lang.Class.getDeclaredFields() method while reflecting the value or at least it's the last place I can see in java debugger.
FYI:
appContext is just a reference to a static method of some java class. Nothing from me, all pure java.
ML class is actually quite large with 1500+ methods (it's our universal data structure, most of the methods are for JIT optimizations), but it should not matter unless there is some limited space allocated somewhere.

Everything works fine with legacy options so I doubt that it's a buffer overflow issue.

Any ideas how to trace the reason?

@bsteffensmeier
Copy link
Member

Thank you so much for the code examples you included. I initially tried to replicate the problem by adding some calls to set() in the isolated interpreter test but I could not get it to crash. When I took exactly what you have and pasted it in an java main it crashed immediately. It took some digging to find the problem but eventually I found that setting PYTHONMALLOC=pymalloc_debug would allow java to create an hs_err_pid file which had a native stack trace with some helpful clues in it.

The problem is that PyJObject(this is the python class that represents java.lang.Object) is a statically allocated type which is shared between sub-interpreters. I had thought sharing would be safe because immutable objects are allowed to be shared but I failed to take into account that creating a subclass mutates the super class. The failure occurs when the a subclass of PyJObject is created in an isolated sub-interpreter. We create a subclass every time a new Java class is used in Python. Not every subclass causes a crash, it seems to only happen when the dict holding subclasses reaches a capacity threshold where it needs to grow. I assume the reason I couldn't crash it in our test case is because the tests do a lot of testing in SharedInterpreters and the subclass map has already grown large enough that further growth is infrequent. With an isolated main program it crashes after only a handful of subclasses. PEP-684 specifically discusses how subclasses can cause problems but until you found this crash I didn't realize it was going to impact us.

Unfortunately I don't think we can fix this in jep 4.2. Mostly because it is going to be a pretty big change but also because we don't want to drop support for older Python versions in a point release. We use the python buffer protocols in subclasses of PyJObject and Before python 3.8 classes using the buffer protocol had to be statically defined. On the dev_4.3 branch I've already made changes so the buffer type is allocated on the heap but I was not planing on changing the way PyJObject and PyJClass are allocated. Your discovery definitely puts changing the way those are allocated on the agenda for the 4.3 release.

If you want to experiment with isolated interpreters I found that setting PYTHONMALLOC to a different allocator would prevent the crash for me. I think there is still a possibility of problems with other allocators so I do not recommend doing this in production but if you want to test how isolated interpreters perform it might give you some idea.

@novos40
Copy link
Author

novos40 commented Jan 24, 2025

Thanks a lot for a quick response on this!
I'm certainly willing to try whatever you can give me. One of the main goals of bringing python into java ecosystem for us is real multithreading. I understand all this is very new and unstable, but this is too important for us. Could you please provide instructions for me how I can do this? Can I just take dev_4.3 and try or do I need to tweak something? I'm a bit rusty on C/C++ but I can figure things out provided with some guidance. We already have a potential client lined up for this capability so I'd like to make it work ASAP even if it's not a production quality yet. In any case it will take some time for them to evaluate and setup their dev process for that. By that time, hopefully, we can have a permanent fix.

From you comment about dict crashing when it need to grow there are some quick fix ideas:

  1. Can you just pre-allocate dict size to some large enough capacity so it does not need to grow? Initial size can be a parameter provided via a variable or something. Clearly a temporary and not pretty, but allows to move forward before permanent fix is in place
  2. Can you just move the registry to [concurrent] java map? This is what I usually do. Whenever I have problems with python code limitations, I move the functionality to java and simply provide python wrapper. Works like a charm. Practically all shared object in our system are java objects wrapped in python interfaces including dict API. I find python implementations to be very capricious most of the time so I just get rid of them. Nobody has to know implementation details of some internal data structure :-)

Please let me know if I can help in any way.

P.S. It's just occurred to me: Can I fix the problem by first creating a shared interpreter with all initializations (i.e. allow dict to grow to correct size) and then work with sub-interpreters?

@novos40
Copy link
Author

novos40 commented Jan 24, 2025

Just tried first to create a shared interpreter first and then work with sub-interpreters. It works! :-) Kinda :-(

  • Good: I've got expected 100% CPU utilization and correspondent 8 times reduction in test run time
  • Bad: After few consequent runs of the same test JVM still died b/c access violation, but this time with a log file (see attached)

hs_err_pid2160.log

Can you please take a look and tell me what else I can do to make it work.
Thanks

P.S. first log might not be totally useful b/c I was using visualvm and it would instrument the code potentially changing classes
Here are a couple more logs

hs_err_pid8900.log
hs_err_pid15656.log

I'm not an expert, but it seems both of them are trying to write something to a null pointer (freed memory?). Last one took about 10 test runs before dying. A few second pause between runs seems to help to keep it running so it might be related to GC, I guess

@bsteffensmeier
Copy link
Member

Could you please provide instructions for me how I can do this? Can I just take dev_4.3 and try or do I need to tweak something? I'm a bit rusty on C/C++ but I can figure things out provided with some guidance.

The changes currently on the dev_4.3 branch are only a small step in the right direction. In Jep 4.2 I think there are 4 statically allocated types that need to be moved to heap allocated types to be safe in sub-interpreters. Those types are PyJObject, PyJClass, PyJArray, and PyJBuffer. The existing changes handle PyJBuffer leaving only 3 more. Unfortunatly byjbuffer is by far the easiest of the 4 since the type is only referenced during initialization. Since the other 3 are referenced more often we will have to save off the types in an interpreter specific data structure. I am still trying to understand what needs to change myself but from what I have seen I think PEP-630 describes all the things we need to do.

  1. Can you just pre-allocate dict size to some large enough capacity so it does not need to grow? Initial size can be a parameter provided via a variable or something. Clearly a temporary and not pretty, but allows to move forward before permanent fix is in place

I am not aware of any API for pre-sizing dicts. Also the crash while resizing is just a symptom of a larger problem, it is not safe to concurrently access a dict and even if pre-sizing prevents crashes it will not prevent interpreters from concurrently modifying the dict and potentially creating invalid state. I suspect this is why you still see crashes in your tests where you use a shared interpreter.

2. Can you just move the registry to [concurrent] java map? This is what I usually do. Whenever I have problems with python code limitations, I move the functionality to java and simply provide python wrapper. Works like a charm. Practically all shared object in our system are java objects wrapped in python interfaces including dict API. I find python implementations to be very capricious most of the time so I just get rid of them. Nobody has to know implementation details of some internal data structure :-)

The problem is occurring in the tp_sublasses dict is in the cpython code. It is marked as internal to cpython. Jep does not allocate, modify, or even access it. You would have to change the cpython code to use anything other than a dict.

P.S. It's just occurred to me: Can I fix the problem by first creating a shared interpreter with all initializations (i.e. allow dict to grow to correct size) and then work with sub-interpreters?

Every sub-interpreter creates a hierarchy of java types that mirrors the java class hierarchy of any Java class used form python. Right now the java.lang.Object type is shared in all sub-interpreters. When sub-interpreters are creating their own types for subclasses of java.lang.Object everything breaks. If you could create a type hierarchy beforehand for every java class that will ever be used from python and ensure all of those types are immutable then that types could be shared between sub-interpreters and the problems should go away. I don't think that is a very practical solution for most use cases but it might be something you could get working if you happen to know every java class needed in python beforehand.

@bsteffensmeier
Copy link
Member

@novos40 Since you have mentioned on other issues that you are using shared modules and also that compatibility with other python modules is important to you I also want to make sure you are aware that there is no plans to support shared modules in isolated sub-interpreters and while I can't speak authoritatively for the entire python ecosystem I suspect that a vast majority of python modules that include native code will not work with isolated sub-interpreters. I know that numpy is a popular extension module that has been hesitant to support sub-interpreters in the past so I was curious if they are moving to support isolated sub-interpreters and found this issue saying they do not currently support it and are not actively working to support it and also this post which shares my opinion that most extension modules do not work with isolated sub-interpreters.

@bsteffensmeier
Copy link
Member

#594 has been merged into the dev_4.3 branch which resolves the problems with isolated sub-interpreters when calling set() multiple times. I don't currently have a use case that needs sub-interpreters so my testing is limited to the problem case presented here. If you are planning to use isolated sub-interpreters I would recommend you continue to test against the dev_4.3 branch.

@novos40
Copy link
Author

novos40 commented Feb 22, 2025

Thanks! I'll definitely check it out. Please keep me posted about further developments.

@novos40
Copy link
Author

novos40 commented Feb 25, 2025

I did some testing with dev_4.3 branch:

  • Isolated sub-interpreters seems to be stable on windows now (at least with pre-loading of all code into a shared interpreter first so all internal structures would grow to correct size before isolated interpreters are created). I had one crash, but I could not reproduce it. It still does not work on ARM Mac: immediately crashes
  • Non-isolated sub-interpreters and shared interpreters seems to be stable on both platforms

Here are some other problems I did not mention before

  • Is it possible to implement Interpreter.enter() and leave() methods to attach and detach interpreter instance to/from current thread (same as in graal)? I could not find any references to current thread kept in the C code so I guess it should be possible. My initial idea that I can have a thread-local pool of pre-initialized interpreters seems to be working, but it highly depends on how standard java thread pools are managing the threads. It appeared that often pool prefer to destroy current thread and create a new one. Since it's a brand new thread instance, it triggers new interpreter instance creation and initialization which renders the whole idea rather moot. Having enter()/leave() method would allow to use same interpreter instance with multiple threads, but only one at a time. I'll probably try to implement it myself so any pointers would be appreciated
  • Build script on Mac generated *.so library while original jep package provides *.jnilib file. I did rename *.so to *.jnilib and it seems to work, but maybe you want to modify build script
  • It does not seem to work with virtual python environments at all. This is especially hard to solve on Mac as it has it's own unique way of installing python :-(. The only way to make it work was to manually resolve all [shim] links to actual physical locations and explicitly set python home, executable and all import directories. I've just made a first run on linux and I see the same problem: python failed to initialize due to some system modules/libraries not found. I guess it takes given python executable location and implies PYTHONHOME from it. If you use shim link then actual location never gets resolved properly. If library would resolve symbolic links first and then use actual locations internally, it would solve the problem, I guess. Right now it's a major pain to investigate correct locations in split installations (which is what pyenv does) and configure everything manually.

@bsteffensmeier
Copy link
Member

  • Isolated sub-interpreters seems to be stable on windows now (at least with pre-loading of all code into a shared interpreter first so all internal structures would grow to correct size before isolated interpreters are created). I had one crash, but I could not reproduce it. It still does not work on ARM Mac: immediately crashes

I have not done any testing with isolated sub-interpreters on windows, so I am glad that is working. I have done all my testing on an ARM Mac and I am no longer encountering crashes on dev_4.3 after the recent changes. If you could provide a test case that is crashing for you then I can test it on my machine. It should not be necessary to preload code in a shared interpreter, do you see more crashes if you don't do that?

  • Is it possible to implement Interpreter.enter() and leave() methods to attach and detach interpreter instance to/from current thread (same as in graal)? I could not find any references to current thread kept in the C code so I guess it should be possible. My initial idea that I can have a thread-local pool of pre-initialized interpreters seems to be working, but it highly depends on how standard java thread pools are managing the threads. It appeared that often pool prefer to destroy current thread and create a new one. Since it's a brand new thread instance, it triggers new interpreter instance creation and initialization which renders the whole idea rather moot. Having enter()/leave() method would allow to use same interpreter instance with multiple threads, but only one at a time. I'll probably try to implement it myself so any pointers would be appreciated

This looks like the idea presented in #589, I think it would take substantial effort to get jep working that way but would prefer to keep that discussion in #589

  • Build script on Mac generated *.so library while original jep package provides *.jnilib file. I did rename *.so to *.jnilib and it seems to work, but maybe you want to modify build script

We have code to link libjep.jnilib to the .so that is generated during the build. I just tested it on my Mac using python3 setup.py install and also pip install . and in both classes libjep.jnilib exists in the jep python directory. What commands are you using to install jep?

  • It does not seem to work with virtual python environments at all. This is especially hard to solve on Mac as it has it's own unique way of installing python :-(. The only way to make it work was to manually resolve all [shim] links to actual physical locations and explicitly set python home, executable and all import directories. I've just made a first run on linux and I see the same problem: python failed to initialize due to some system modules/libraries not found. I guess it takes given python executable location and implies PYTHONHOME from it. If you use shim link then actual location never gets resolved properly. If library would resolve symbolic links first and then use actual locations internally, it would solve the problem, I guess. Right now it's a major pain to investigate correct locations in split installations (which is what pyenv does) and configure everything manually.

In my experience setting PYTHONHOME when using a virtual environment never works. To use jep in a virtual environment you need to activate the environment before starting java by sourcing the activate script. Jep works fine for me on linux and Mac as long as the venv is active before starting java.

@novos40
Copy link
Author

novos40 commented Feb 27, 2025

We have code to link libjep.jnilib to the .so that is generated during the build. I just tested it on my Mac using python3 setup.py install and also pip install . and in both classes libjep.jnilib exists in the jep python directory. What commands are you using to install jep?

Ah.. I guess I'm spoiled by java/osgi where you just drop a bundle jar into auto-launch folder and everything works. It never occurred to me that I need to do formal install to get libraries. I only did python setup.py build and then repackage the jar. I'm running everything in OSGi so I can't use jar/library generated by build script. I need to repackage it into an OSGi bundle with proper manifest headers and include binaries for all platforms (i.e. windows, linux and mac) into the bundle jar itself so the same jar works everywhere. Library loading is done by OSGi container based on detected OS/hardware so I don't really need binaries installed elsewhere.
This brings another question though: Do I need a formal pip install jep so everything else beside the binaries also works? Maybe this is the reason of crash on mac since I still have jep 4.2 installed in python?

To use jep in a virtual environment you need to activate the environment before starting java by sourcing the activate script. Jep works fine for me on linux and Mac as long as the venv is active before starting java.

OK, I see how it would work from command line. How do I properly activate python environment if I run java from eclipse or other IDE for that matter?

Finally, since we're getting close to a working system, the next level question: How do I debug python code with jep? Graal provides google chrome and vscode debug hooks. I can't testify about vscode since I'm not using it, but chrome browser interface, while not stellar, is rather decent. It provides basic source code level breakpoints and inspections. Can I debug python part in eclipse somehow? I remember debugging python with print statements in 2001, but I don't think anybody would agree to this in 2025 :-)

If you prefer to keep debugging discussion separate I can start another thread for that

@bsteffensmeier
Copy link
Member

This brings another question though: Do I need a formal pip install jep so everything else beside the binaries also works? Maybe this is the reason of crash on mac since I still have jep 4.2 installed in python?

That is the only officially supported way of installing jep. I know many people have tried to package the python code and libraries in jars for easier distribution in Java but as far as I know none of them have made their work open source. You might be able to find more details on what others have done by searching through the jep issues.

OK, I see how it would work from command line. How do I properly activate python environment if I run java from eclipse or other IDE for that matter?

You could probably activate the vent before starting the IDE and subprocesses should inherit the settings. I think the activate script just sets up some env vars so you could also try to set the same env vars in your run configuration. I don't really use venv and have never tried from an IDE so I can't offer much more help than that. In general using venv from an embedded interpreter is not as easy as it should be but I don't think it is within the scope of jep to fix that although we will try to expose new options as they become available in cpython.

Finally, since we're getting close to a working system, the next level question: How do I debug python code with jep?

I have never done that. Here is a link to a thread on our old mailing list which describes a process that has worked for some others.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants