Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Paper: How the Scientific Python ecosystem helps answer fundamental questions of the Universe #924

Merged

Conversation

matthewfeickert
Copy link
Collaborator

@matthewfeickert matthewfeickert commented May 31, 2024

This PR adds the myst.md paper source for the SciPy 2024 proceedings for the talk "How the Scientific Python ecosystem helps answer fundamental questions of the Universe".

Editor: Hongsup Shin @hongsupshin

Reviewers:

@matthewfeickert
Copy link
Collaborator Author

This has been placed in draft mode as I'm still continuing to write. I'm opening the PR now to ensure that it is created before the May 31 deadline.

@matthewfeickert
Copy link
Collaborator Author

The curvenote / publish / check checks are failing as

 ❯ content (22/29 tests passed)
    × DOI Exists: Citation does not have DOI: S2I2HEPSP
    × DOI Exists: Citation does not have DOI: PyHEP
    × DOI Exists: Citation does not have DOI: nanobind
    × DOI Exists: Citation does not have DOI: ATL-SOFT-PUB-2021-001
    × DOI Exists: Citation does not have DOI: Dask
    × DOI Exists: Citation does not have DOI: Dask-awkward
    × DOI Exists: Citation does not have DOI: Dask-histogram

If I'm citing things that don't have DOIs (like software that doesn't have an archive on Zenodo) is there a way to communicate this to Curvenote to make it clear this is intended?

@matthewfeickert matthewfeickert self-assigned this May 31, 2024
@cbcunc cbcunc added the paper This indicates that the PR in question is a paper label Jun 1, 2024
@leewujung
Copy link
Contributor

@matthewfeickert you can add citation keys you want to ignore in myst.yml under error_rules:

  error_rules:
    - rule: doi-exists
      severity: ignore
      keys:
        - Atr03
        - terradesert

@ameyxd
Copy link
Contributor

ameyxd commented Jun 4, 2024

@matthewfeickert can you make this update so we can run checks?

Copy link

github-actions bot commented Jun 7, 2024

Curvenote Preview

Directory Preview Checks Updated (UTC)
papers/matthew_feickert 🔍 Inspect 78 checks passed (18 optional) Sep 24, 2024, 8:57 PM

@cbcunc
Copy link
Member

cbcunc commented Jun 7, 2024

@matthewfeickert Just a reminder that first submissions must be compete by today. Please do what you need to do to get your PR out of the draft state so we can mark it ready for review and assign a reviewer. We assign reviewers to papers all at once and not piecemeal paper by paper.

@matthewfeickert
Copy link
Collaborator Author

We assign reviewers to papers all at once and not piecemeal paper by paper.

I don't understand this last comment, but I'll assume that it is not something that affects actionable information on my side. Thanks for the reminder that the deadline is today at 23:59 Pacific. 👍

@matthewfeickert matthewfeickert force-pushed the feat/add-atlas-scipy-paper branch 5 times, most recently from 2297ef9 to 1580bc8 Compare June 8, 2024 06:40
@matthewfeickert matthewfeickert marked this pull request as ready for review June 8, 2024 06:42
@hongsupshin
Copy link
Contributor

We assign reviewers to papers all at once and not piecemeal paper by paper.

I don't understand this last comment, but I'll assume that it is not something that affects actionable information on my side. Thanks for the reminder that the deadline is today at 23:59 Pacific. 👍

No worries @matthewfeickert, we assigned two reviewers to your paper :)

The data structure of each event consists of variable length lists of physics objects (e.g. electrons, collections of tracks from charged objects).
To study the properties of the physics objects in a statistical manner, a fixed event analysis procedure is repeated over billions of events.
This has traditionally motivated the use of "event loops" that implicitly construct event-level quantities of interest and leveraged the `C++` compiler to produce efficient iterative code.
This precedent made it difficult to take advantage of array programming paradigms that are common in Scientific Python given NumPy [@numpy] vector operations.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be beneficial to the audience to include a sentence explaining what array programming is and why it is useful in this context.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm hesitant to have this paper have the responsibility of explaining these concepts to the audience, given that we already provide a reference to the NumPy paper Array programming with NumPy. The NumPy paper doesn't make any attempt to further cite the concept of array programming or to elaborate on its definition any further than the first sentence of the paper, I would assume, given how widespread these concepts are in scientific computing. We do also mention in the introduction

The use of dataframes and array programming for data analysis has enhanced the user experience while providing efficient computations without the need of coding optimized low-level routines.


The most famous and revolutionary discovery in particle physics this century is the discovery of the Higgs boson — the particle corresponding to the quantum field that gives mass to fundamental particles through the Brout-Englert-Higgs mechanism — by the ATLAS and CMS experimental collaborations in 2012. [@HIGG-2012-27;@CMS-HIG-12-028]
This discovery work was done using large amounts of customized `C++` software, but in the following decade the state of the PyHEP community has advanced enough that the workflow can now be done using community Python tooling.
To provide an overview of the tooling and functionality, a high level summary of a simplified analysis workflow of a Higgs "decay" to two intermediate $Z$ bosons that decay to charged leptons $(\ell)$ (i.e. electrons ($e$) and muons ($\mu$)), $H \to Z Z^{*} \to 4 \ell$, on ATLAS open data [@ATLAS-open-data] is summarized in this section.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Optional) Restructure the first paragraph in this section by putting the message of the last sentence first. For example, "In this section we will demonstrate how the scientific Python ecosystem allows for ... with an example of a simplified analysis workflow of a Higgs Boson decay." My reasoning for this suggestion is because during my first read through the paper I missed the subtle transition into the example; it felt like the paper jumped straight into it after giving background regarding the Higgs Boson phenomenon. So by moving the message of the final sentence up, I think it will stand-out more to the reader.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand your point, but I don't think we want to restructure that paragraph as this now means awkwardly introducing the Higgs in the middle of another sentence. The focus of the paper is the software, but we still need to motivate it with the science which is the primary driver and so some text needs to be devoted to the physics. Though to try to make the final sentence of the paragraph more clear we've repeated that it is the Pythonic tooling being used. Though if you feel that this still isn't helpful I'm happy to wordsmith this more.

Copy link
Collaborator

@Marcdh3 Marcdh3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this paper does a good job of telling the story that complex, large-scale physics research is now feasible in Python due to recent advancements in the scientific Python ecosystem. The background information clearly defined the technological gap that needed to be addressed to make this type of research possible. The authors were thorough in indicating how each library addressed the former challenges, and how such code should be used in an analysis workflow. Finally, I believe the Higgs boson example was a good choice because it is a convincing real-world problem that demonstrated the ease at which this research can now be conducted in Python.

@matthewfeickert
Copy link
Collaborator Author

Thanks very much for your review @Marcdh3! We will get to work on incorporating your feedback shortly (hopefully before SciPy starts).

Copy link
Collaborator

@cliffckerr cliffckerr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this paper was clear and easy to follow. Some very minor comments:

  • Should the abstract be updated to refer to the paper instead of the talk?
  • You could include an explanation of why ROOT is preferable to other formats (e.g. Arrow?)
  • Typo, "a specific calorimeter subsystems"
  • The description of the role of "nanobind" probably doesn't need to be repeated between sections.
  • Do all computations currently run on CPUs, and if so, is there interest in exploring GPU implementations?

@matthewfeickert
Copy link
Collaborator Author

Thanks @cliffckerr for the review! I will address these post haste, but am in week 5 of travel for work and teaching at the moment so it probably won't be until next week.

@hongsupshin
Copy link
Contributor

Hi @matthewfeickert , just a friendly reminder that all initial reviews are in. I highly recommend you start responding to the comments soon since it'd take time for the reviewers to respond to the changes. Remember that the open review period ends on Sep 2, and you will not be able to make any changes to the manuscript after that point. If you have any questions, please let me know!

matthewfeickert and others added 5 commits August 23, 2024 01:03
Co-authored-by: Vangelis Kourlitis <[email protected]>
Directive options are parsed line-by-line, so if a directive is split across multiple lines then only the first line will be captured.

Co-authored-by: Franklin Koch <[email protected]>
* Add quotes around 'bunches'.
* Reemphasize that the tooling in question is Pythonic.
* Refer to 'paper' instead of 'talk' in the abstract.
* Note ROOT's almost 30 year history of columnar data structures.
* Add 'Awkward Arrays in Python, C++, and Numba' as a reference.
   - c.f. https://inspirehep.net/literature/1776192
* Correct typo of 'subsystems' to 'subsystem'.
* Add citation to IRIS-HEP AGC Zenodo archive.
   - c.f. https://doi.org/10.5281/zenodo.7274936
@matthewfeickert
Copy link
Collaborator Author

matthewfeickert commented Aug 23, 2024

Some very minor comments:

  • Should the abstract be updated to refer to the paper instead of the talk?

Done.

  • You could include an explanation of why ROOT is preferable to other formats (e.g. Arrow?)

I've added a short mention of it providing columnar data structures with good serialization compression for almost 30 years (1997) (Arrow wasn't released until 2016) along with an additional reference. Though the realities are more complex and well outside of the scope of this short paper, I will note that when the field has well over an exabyte of data stored in the same file format which is used for basically everything there's rather large inertia to switch regardless of motivation.

  • Typo, "a specific calorimeter subsystems"

Thanks! @cliffckerr I'm sorry that there were so many typos that you and @Marcdh3 had to catch (and this is even with the codespell pre-commit hook on). 😬

  • The description of the role of "nanobind" probably doesn't need to be repeated between sections.

Can you elaborate a bit more on what is repeated? "nanobind" only appears in the main text in the following two sentences:

The ATLAS collaboration is further extending this ecosystem of tooling to include high-level custom Python bindings to the low level C++ frameworks using nanobind Jakob, 2022.

and

To expose these C++ libraries to the Pythonic tooling layer, custom Python bindings are written using nanobind for high efficiency, as seen in Figure 1.

The first sentence is introducing the concept of custom Python bindings, and the second is highlighting the fact that nanobind in particular was chosen given that the actual bindings it generates are more efficient than alternative binding generation tools. While connected, these ideas are distinct. Do you feel that these concepts need to be emphasized more?

  • Do all computations currently run on CPUs, and if so, is there interest in exploring GPU implementations?

Yes. At the moment the focus has been scaling out these workflows at dedicated "analysis facilities" to achieve required data processing throughput rates for the next iteration of the LHC (reference presentation). There is ongoing work to be able to utilize hardware acceleration across the tooling ecosystem. This work is still very much in the research stage and at varying stages of maturity across the ecosystem, but Awkward Array computations work on GPUs and the statistical libraries support hardware acceleration.

(Note: I rebased this branch off of the current HEAD of the 2024 branch and then did a git push --force-with-lease, so my apologies if any comments on the file views got removed from the current view. I've tried to go back and respond to any comments that were added on files.)

@cliffckerr
Copy link
Collaborator

Thanks @matthewfeickert -- my original comment no longer seems so clear to me 😂 It's probably fine as-is. If you do want to revisit the nanobind references, I suppose there is a bit of repetition, with each mention of it going into about the same level of detail:

  • The ATLAS collaboration is further extending this ecosystem of tooling to include high-level custom Python bindings to the low level C++ frameworks using nanobind [@Nanobind].
  • To expose these C++ libraries to the Pythonic tooling layer, custom Python bindings are written using nanobind for high efficiency, as seen in @fig:access_layer_diagram.
  • The interface takes advantage of nanobind's efficient bindings and zero-copy exchange protocols to achieve viable performance. [@Kourlitis:2890478]
  • computed on-the-fly using the nanobind Python bindings to the ATLAS C++ correction tools.

But it's really not an issue, for some reason I thought there was a full paragraph or at least full sentence repeated.

* Apply Cliff Kerr's suggestions of avoiding redundant informtion on nanobind.
@matthewfeickert
Copy link
Collaborator Author

matthewfeickert commented Sep 2, 2024

(Sorry for the slow reply @cliffckerr — been doing a week of international work travel so I'm behind on a bit of everything.)

  • The interface takes advantage of nanobind's efficient bindings and zero-copy exchange protocols to achieve viable performance. [@Kourlitis:2890478]
  • computed on-the-fly using the nanobind Python bindings to the ATLAS C++ correction tools.

These two come from figure captions, and I generally like them to be as standalone as possible. Though as you do point out, the sentence preceding Figure 1

  • To expose these C++ libraries to the Pythonic tooling layer, custom Python bindings are written using nanobind for high efficiency, as seen in @fig:access_layer_diagram.

makes the mention in the Figure 1 caption

  • The interface takes advantage of nanobind's efficient bindings and zero-copy exchange protocols to achieve viable performance. [@Kourlitis:2890478]

a bit redundant as the focus of Figure 1 is the interface, and while the interface decisions enable performance, the performance isn't the focus. So I'm going to remove this line from Figure 1. 👍

@hongsupshin
Copy link
Contributor

@cliffckerr @Marcdh3 Hello reviewers, the author @matthewfeickert have responded to your feedback. If you don't have any further comments, could you please approve the PR? Thanks for your time and effort for the review!

@hongsupshin
Copy link
Contributor

Thanks @Marcdh3 @cliffckerr for reviewing the paper!

@cliffckerr
Copy link
Collaborator

cliffckerr commented Sep 23, 2024

I don't seem to have the option to re-request review, but I approve!

@matthewfeickert
Copy link
Collaborator Author

I don't seem to have the option to re-request review, but I approve!

Thanks @cliffckerr! If you'd like to formally hit the button I've triggered review request from you. :) We will take no action to mean "approve" though. 👍

Copy link
Collaborator

@cliffckerr cliffckerr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, nice work!

@cbcunc cbcunc merged commit 3537660 into scipy-conference:2024 Sep 25, 2024
4 checks passed
@matthewfeickert matthewfeickert deleted the feat/add-atlas-scipy-paper branch September 25, 2024 18:49
@matthewfeickert
Copy link
Collaborator Author

Thanks very much to @cliffckerr @Marcdh3 for their hard work as reviewers to help improve the paper — it is truly appreciated. Thanks also to @hongsupshin and the rest of the Proceedings Committee for the heroic amounts of work that they did to once again serve the SciPy community and provide a wonderful service to the conference. 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
paper This indicates that the PR in question is a paper ready-for-review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants