Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: improve build script; enable openmp #67

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

bwinkel
Copy link

@bwinkel bwinkel commented Jul 4, 2020

Hi,

I use sgp4 in some of my projects and realized that the new accelerated wrapper is OpenMP enabled. However, at least for the conda packages, which I usually use, parallelization seems to be inactive.

I noticed in your setup.py that the OpenMP associated compiler flag is commented out and that there is a comment, which asks whether there would be a way to find proper compiler arguments for the different platforms. In several of my own Cython projects, I'm doing this based on the platform module from the standard lib. This PR is a work-in-progress proposal to add OpenMP support in your setup.py.

Unfortunately, I could not test, if it really works on all platforms, as I have no idea how you build, e.g., your PyPI wheels. It doesn't seem to happen in your .travis.yml - do you do that manually?

I should also add, that on my Linux machine, the compiler flag properly works (the warning during build time, that the pragma omp is ignored went away), but testing with a large array does still seem to be bound to a single CPU core. Maybe I'm missing something?

In any case, perhaps we can use this as a base to improve OpenMP handling in sgp4.

Benjamin Winkel added 2 commits July 4, 2020 09:44
TIL: trailing comma for **kwargs is not allowed in old python versions
@brandon-rhodes
Copy link
Owner

I use sgp4 in some of my projects and realized that the new accelerated wrapper is OpenMP enabled. …

Thanks for your interest in making the project faster! It will be interesting to see if we can enable multiprocessing without breaking the project’s ability to build on some platforms. (Like: what if folks are using clang instead of gcc for their Python? Would the option still work? What about clang on macOS? And so forth.)

Unfortunately, I could not test, if it really works on all platforms, as I have no idea how you build, e.g., your PyPI wheels. It doesn't seem to happen in your .travis.yml - do you do that manually?

Good question! Check out the .travis.yml over on the release branch, it’s where the building and uploading takes place.

I should also add, that on my Linux machine, the compiler flag properly works (the warning during build time, that the pragma omp is ignored went away), but testing with a large array does still seem to be bound to a single CPU core. Maybe I'm missing something?

Parallel processing would only work with n satellites. It cannot work, so far as I can see, with one satellite and m times, because the process of generating a position updates (I think?) internal fields in the Satrec struct, and if two processors were trying to generate two positions for the same Satrec they would overwrite each other’s work. (I think? Feel free to double check that, it’s been a few months since I read the code.)

@bwinkel
Copy link
Author

bwinkel commented Jul 4, 2020

It will be interesting to see if we can enable multiprocessing without breaking the project’s ability to build on some platforms. (Like: what if folks are using clang instead of gcc for their Python? Would the option still work? What about clang on macOS? And so forth.)

Good point. After all, this is what this PR is about. I should have mentioned that the following platforms should be supported by the new method (needs testing of course):

  • Linux (gcc, clang?)
  • MacOS (gcc, clang, llvm; the special MacOS treatment in the get_compile_args function is for Apple's vanilla clang, which doesn't support OpenMP)
  • MS Windows (MS compiler)

I didn't think about other architectures though. (Is ARM something you support?) However, as you define the C extension to be optional at this point, it should even install (the slower Python implementation), if the compilation fails for some reason.

Good question! Check out the .travis.yml over on the release branch, it’s where the building and uploading takes place.

Thanks for the hint. I wasn't aware that there is a special release branch. What would be the preferred way to test this PR against the wheel building? If I make the PR against the release branch, the travis VM would try to upload wheels (if I'm not mistaken) - which is for sure not appreciated by you 😉. Shall I add some other architectures to the .travis.yml on this branch/PR. Perhaps you would even like to keep that in your main branch?

Parallel processing would only work with n satellites. It cannot work, so far as I can see, with one satellite and m times, because the process of generating a position updates (I think?) internal fields in the Satrec struct, and if two processors were trying to generate two positions for the same Satrec they would overwrite each other’s work. (I think? Feel free to double check that, it’s been a few months since I read the code.)

That is correct. I overlooked the SatrecArray interface. I can confirm that it will use the available CPU cores (up the the number of satellites in the array). You are probably right about the problem of concurrent member access, which makes parallelizing over the temporal axis difficult. I had the same issues with my cysgp4 library and in the end chose to make deep copies of the Satellite instance in each thread. It adds some redundancy, but has a lot of advantages as well.

@bwinkel
Copy link
Author

bwinkel commented Jul 4, 2020

If I make the PR against the release branch, the travis VM would try to upload wheels (if I'm not mistaken) - which is for sure not appreciated by you 😉.

Turns out, I was mistaken. Travis does not expose secrets to PRs from forks.

@bwinkel
Copy link
Author

bwinkel commented Jul 5, 2020

A little update. I worked on the .travis.yml in my fork of the repo. It turns out, that MacOS is very difficult to handle correctly. On my own projects, I use the anaconda tools, which ship with their own compilers since a while. This makes it much easier to achieve proper OpenMP support. I wrongly assumed that my version of the setup.py should also in principle work. While that may or may not be true, it is relatively complicated to get a certain MacOS compiler work with OpenMP and a relatively generic setup.py that perhaps one should not try to enable OpenMP on MacOS. In the latest version of setup.py (on my travis branch in the fork), I check if the string "clang" appears in $CXX -v, which is the case for both the vanilla clang but also for LLVM and remove the -openmp compiler flag. If a user has the gcc installed, it will work though (see here).

The good news is that it works on both, Linux and Windows, without problems. I encountered some issues with the tests though (I guess, these slipped through because you only test against Linux). One is that on Windows the line ending is \r\n, which leads to a test failing in function run_satellite_against_tcppver. The line

if 'xx' in actual_line:
    similar = (actual_line == expected_line)

should be replaced with

if 'xx' in actual_line:
    similar = (actual_line.strip() == expected_line.strip())

Another issue is with the accuracy of some tests, see e.g., the output of the Linux test suite. Perhaps one should compare float results up to a given accuracy? I often use numpy.allclose for this.

Note also that both, clang and LLVM, complain about linking against a wrong architecture (X64 vs. i386). I'm not sure, if this problem is caused by my changes or was hidden in your own travis logs. I added

env:
  global:
    - CIBW_TEST_COMMAND="python -m sgp4.tests"
    - CIBW_BUILD_VERBOSITY=1
    - CIBW_BUILD="cp36-*"  # can restrict to certain python versions for testing

in which CIBW_BUILD_VERBOSITY makes sure that the compiler outputs are shown. I'd also recommend that you add the CIBW_TEST_COMMAND to your release .travis.yml, because it will run the tests for every arch.

@brandon-rhodes
Copy link
Owner

Neat, I don’t remember seeing CIBW_TEST_COMMAND when I last read the documentation! Yes, it will be helpful to other projects for the tests to be running on all platforms — the Anaconda folks were just working on that, and I landed daf7965 to fix a problem they identified under macOS.

This week I will hopefully have time to take a look at your branch and read over its setup.py.

I wonder if I should think about switching to cibuildwheel over on the main branch, just to run tests? It seemed very heavyweight compared to the current .travis.yml and I wanted to keep CI expense fairly light except when necessary. But I wonder if keeping the master .travis.yml simple will then mean having to go through a whole second round of debugging each time I turn to the release branch and need to get its tests passing again?

@bwinkel
Copy link
Author

bwinkel commented Jul 5, 2020

I wonder if I should think about switching to cibuildwheel over on the main branch, just to run tests? It seemed very heavyweight compared to the current .travis.yml and I wanted to keep CI expense fairly light except when necessary. But I wonder if keeping the master .travis.yml simple will then mean having to go through a whole second round of debugging each time I turn to the release branch and need to get its tests passing again?

My experience with this is that CI/CD issues take a lot of the development time. Even more so, if one has a lot of dependencies ( which is not really the case for python-sgp4). So, indeed, revealing issues early in the process can be an advantage. It will still take time to fix them, of course. But in the alternative approach, when there is a relatively large delay between implementing a feature and fixing possible bugs, one needs to "dive into" the feature again.

@glangford
Copy link
Contributor

glangford commented Aug 22, 2020

it is relatively complicated to get a certain MacOS compiler work with OpenMP and a relatively generic setup.py that perhaps one should not try to enable OpenMP on MacOS. In the latest version of setup.py (on my travis branch in the fork), I check if the string "clang" appears in $CXX -v, which is the case for both the vanilla clang but also for LLVM and remove the -openmp compiler flag.

There is an interesting conversation and a possible workaround in a different project , FYI in case it is helpful.
oneapi-src/oneDNN#591

@brandon-rhodes
Copy link
Owner

@bwinkel — Since we last discussed CI, I've simplified the approach ("release" is now simply a tag that lives somewhere on the master branch), and have migrated away from Travis CI and over to GitHub Actions.

If I recall, you found that turning this compile option on didn't actually seem to make the Linux-built library start using multiple cores? Were you ever able to get behavior that felt multi-core?

What are your experiences around libraries that use multiple cores without users asking? Someone opened an issue on Skyfield the other day because, without the user's asking, it tried to use all their CPU cores, such that (a) the program didn't really go any faster, but (b) everything else on the system slowed down. It turns out that it was NumPy's fault, and that needing to turn off its assault on multi-CPU machines is common:

https://stackoverflow.com/questions/30791550/limit-number-of-threads-in-numpy

Given that experience, I wonder if multiple CPUs is something that users should have to ask for, rather than something that the library would attempt unilaterally without asking? Do you have an opinion either way?

@bwinkel
Copy link
Author

bwinkel commented Jul 5, 2021

I'm probably the wrong person to ask, because most of my own packages are much less used and few people would get back to me.

Personally, I'm fine if a package - especially, if it does number crunching such as numpy - is using all resources it can get. For GUI software, or stuff that is running in the "background" (e.g., realtime displays), I would agree though that a single-core use can be better.

Although in my packages (number-crunching), I use all the resources that I can get by default, I added a function to change the number of OMP cores, which are used, e.g. in cysgp4. That only works with Cython's prange wrapper, however, it should be possible to do something similar with vanilla OMP.

@brandon-rhodes
Copy link
Owner

Although in my packages (number-crunching), I use all the resources that I can get by default, I added a function to change the number of OMP cores, which are used, e.g. in cysgp4. That only works with Cython's prange wrapper, however, it should be possible to do something similar with vanilla OMP.

Thanks for sharing your thoughts! When I finally have time to circle back and do more investigation here, I'll specifically look into how the OpenMP pragmas might be able to be turned on or off, or at runtime parameterized with a number of CPUs. Maybe after figuring out why large arrays aren't any faster when it's turned on.

@barracuda156
Copy link

@bwinkel Why to hardcode -mmacosx-version-min=10.7? Something here makes it impossible to build on 10.5 and 10.6? Also, it is desirable to have this flag set by environment (unless there are reasons to hardcode it).

@brandon-rhodes
Copy link
Owner

@bwinkel — I have finally found a bit of time to return to sgp4 and look over its open issues and pull requests. I'm still undecided on whether sgp4 should hammer user CPUs by default by turning on OpenMP, but I wanted to check in about the status of this pull request.

Do we know whether there's a way for tests to know if a computation was done in parallel? If we do add compile options like this, it would be nice for the test suite to be able to verify that the compile options were, in fact, producing in-parallel computations when array operations take place.

@bwinkel
Copy link
Author

bwinkel commented Apr 28, 2023

Honestly, I have no clue how to check this in a test, especially, if this should work platform independent.

Regarding "hammering user CPUs", one could compile with MP support, but set the number of cores to be used to 1 by default and expose a function to the user to increase the number during runtime. With Cython one has several options to achieve this, which might give you some ideas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants