Paper: How the Scientific Python ecosystem helps answer fundamental questions of the Universe #924

matthewfeickert · 2024-05-31T22:57:07Z

This PR adds the myst.md paper source for the SciPy 2024 proceedings for the talk "How the Scientific Python ecosystem helps answer fundamental questions of the Universe".

Editor: Hongsup Shin @hongsupshin

Reviewers:

Cainã Max Couto da Silva @cmcouto-silva
Marcus Hill @Marcdh3

matthewfeickert · 2024-05-31T22:58:04Z

This has been placed in draft mode as I'm still continuing to write. I'm opening the PR now to ensure that it is created before the May 31 deadline.

matthewfeickert · 2024-05-31T23:00:23Z

The curvenote / publish / check checks are failing as

 ❯ content (22/29 tests passed)
    × DOI Exists: Citation does not have DOI: S2I2HEPSP
    × DOI Exists: Citation does not have DOI: PyHEP
    × DOI Exists: Citation does not have DOI: nanobind
    × DOI Exists: Citation does not have DOI: ATL-SOFT-PUB-2021-001
    × DOI Exists: Citation does not have DOI: Dask
    × DOI Exists: Citation does not have DOI: Dask-awkward
    × DOI Exists: Citation does not have DOI: Dask-histogram

If I'm citing things that don't have DOIs (like software that doesn't have an archive on Zenodo) is there a way to communicate this to Curvenote to make it clear this is intended?

leewujung · 2024-06-01T06:38:28Z

@matthewfeickert you can add citation keys you want to ignore in myst.yml under error_rules:

  error_rules:
    - rule: doi-exists
      severity: ignore
      keys:
        - Atr03
        - terradesert

ameyxd · 2024-06-04T18:20:47Z

@matthewfeickert can you make this update so we can run checks?

github-actions · 2024-06-07T05:50:19Z

Curvenote Preview

Directory	Preview	Checks	Updated (UTC)
papers/matthew_feickert	🔍 Inspect	✅ 78 checks passed (18 optional)	Sep 24, 2024, 8:57 PM

cbcunc · 2024-06-07T17:18:21Z

@matthewfeickert Just a reminder that first submissions must be compete by today. Please do what you need to do to get your PR out of the draft state so we can mark it ready for review and assign a reviewer. We assign reviewers to papers all at once and not piecemeal paper by paper.

matthewfeickert · 2024-06-07T17:27:09Z

We assign reviewers to papers all at once and not piecemeal paper by paper.

I don't understand this last comment, but I'll assume that it is not something that affects actionable information on my side. Thanks for the reminder that the deadline is today at 23:59 Pacific. 👍

hongsupshin · 2024-06-12T18:44:38Z

We assign reviewers to papers all at once and not piecemeal paper by paper.

I don't understand this last comment, but I'll assume that it is not something that affects actionable information on my side. Thanks for the reminder that the deadline is today at 23:59 Pacific. 👍

No worries @matthewfeickert, we assigned two reviewers to your paper :)

papers/matthew_feickert/uncovering-higgs.md

papers/matthew_feickert/introduction.md

Marcdh3 · 2024-06-27T22:21:59Z

papers/matthew_feickert/scientific-python-ecosystem.md

+The data structure of each event consists of variable length lists of physics objects (e.g. electrons, collections of tracks from charged objects).
+To study the properties of the physics objects in a statistical manner, a fixed event analysis procedure is repeated over billions of events.
+This has traditionally motivated the use of "event loops" that implicitly construct event-level quantities of interest and leveraged the `C++` compiler to produce efficient iterative code.
+This precedent made it difficult to take advantage of array programming paradigms that are common in Scientific Python given NumPy [@numpy] vector operations.


It would be beneficial to the audience to include a sentence explaining what array programming is and why it is useful in this context.

I'm hesitant to have this paper have the responsibility of explaining these concepts to the audience, given that we already provide a reference to the NumPy paper Array programming with NumPy. The NumPy paper doesn't make any attempt to further cite the concept of array programming or to elaborate on its definition any further than the first sentence of the paper, I would assume, given how widespread these concepts are in scientific computing. We do also mention in the introduction

The use of dataframes and array programming for data analysis has enhanced the user experience while providing efficient computations without the need of coding optimized low-level routines.

Marcdh3 · 2024-06-27T22:23:42Z

papers/matthew_feickert/uncovering-higgs.md

+
+The most famous and revolutionary discovery in particle physics this century is the discovery of the Higgs boson &mdash; the particle corresponding to the quantum field that gives mass to fundamental particles through the Brout-Englert-Higgs mechanism &mdash; by the ATLAS and CMS experimental collaborations in 2012. [@HIGG-2012-27;@CMS-HIG-12-028]
+This discovery work was done using large amounts of customized `C++` software, but in the following decade the state of the PyHEP community has advanced enough that the workflow can now be done using community Python tooling.
+To provide an overview of the tooling and functionality, a high level summary of a simplified analysis workflow of a Higgs "decay" to two intermediate $Z$ bosons that decay to charged leptons $(\ell)$ (i.e. electrons ($e$) and muons ($\mu$)), $H \to Z Z^{*} \to 4 \ell$, on ATLAS open data [@ATLAS-open-data] is summarized in this section.


(Optional) Restructure the first paragraph in this section by putting the message of the last sentence first. For example, "In this section we will demonstrate how the scientific Python ecosystem allows for ... with an example of a simplified analysis workflow of a Higgs Boson decay." My reasoning for this suggestion is because during my first read through the paper I missed the subtle transition into the example; it felt like the paper jumped straight into it after giving background regarding the Higgs Boson phenomenon. So by moving the message of the final sentence up, I think it will stand-out more to the reader.

I understand your point, but I don't think we want to restructure that paragraph as this now means awkwardly introducing the Higgs in the middle of another sentence. The focus of the paper is the software, but we still need to motivate it with the science which is the primary driver and so some text needs to be devoted to the physics. Though to try to make the final sentence of the paragraph more clear we've repeated that it is the Pythonic tooling being used. Though if you feel that this still isn't helpful I'm happy to wordsmith this more.

papers/matthew_feickert/uncovering-higgs.md

Marcdh3

I believe this paper does a good job of telling the story that complex, large-scale physics research is now feasible in Python due to recent advancements in the scientific Python ecosystem. The background information clearly defined the technological gap that needed to be addressed to make this type of research possible. The authors were thorough in indicating how each library addressed the former challenges, and how such code should be used in an analysis workflow. Finally, I believe the Higgs boson example was a good choice because it is a convincing real-world problem that demonstrated the ease at which this research can now be conducted in Python.

matthewfeickert · 2024-06-27T22:28:33Z

Thanks very much for your review @Marcdh3! We will get to work on incorporating your feedback shortly (hopefully before SciPy starts).

cliffckerr

I thought this paper was clear and easy to follow. Some very minor comments:

Should the abstract be updated to refer to the paper instead of the talk?
You could include an explanation of why ROOT is preferable to other formats (e.g. Arrow?)
Typo, "a specific calorimeter subsystems"
The description of the role of "nanobind" probably doesn't need to be repeated between sections.
Do all computations currently run on CPUs, and if so, is there interest in exploring GPU implementations?

matthewfeickert · 2024-07-31T13:46:58Z

Thanks @cliffckerr for the review! I will address these post haste, but am in week 5 of travel for work and teaching at the moment so it probably won't be until next week.

hongsupshin · 2024-08-17T16:30:47Z

Hi @matthewfeickert , just a friendly reminder that all initial reviews are in. I highly recommend you start responding to the comments soon since it'd take time for the reviewers to respond to the changes. Remember that the open review period ends on Sep 2, and you will not be able to make any changes to the manuscript after that point. If you have any questions, please let me know!

Co-authored-by: Vangelis Kourlitis <[email protected]>

Directive options are parsed line-by-line, so if a directive is split across multiple lines then only the first line will be captured. Co-authored-by: Franklin Koch <[email protected]>

Co-authored-by: Marcus Hill <[email protected]>

* Add quotes around 'bunches'. * Reemphasize that the tooling in question is Pythonic. * Refer to 'paper' instead of 'talk' in the abstract. * Note ROOT's almost 30 year history of columnar data structures. * Add 'Awkward Arrays in Python, C++, and Numba' as a reference. - c.f. https://inspirehep.net/literature/1776192 * Correct typo of 'subsystems' to 'subsystem'. * Add citation to IRIS-HEP AGC Zenodo archive. - c.f. https://doi.org/10.5281/zenodo.7274936

matthewfeickert · 2024-08-23T06:53:30Z

Some very minor comments:

Should the abstract be updated to refer to the paper instead of the talk?

Done.

You could include an explanation of why ROOT is preferable to other formats (e.g. Arrow?)

I've added a short mention of it providing columnar data structures with good serialization compression for almost 30 years (1997) (Arrow wasn't released until 2016) along with an additional reference. Though the realities are more complex and well outside of the scope of this short paper, I will note that when the field has well over an exabyte of data stored in the same file format which is used for basically everything there's rather large inertia to switch regardless of motivation.

Typo, "a specific calorimeter subsystems"

Thanks! @cliffckerr I'm sorry that there were so many typos that you and @Marcdh3 had to catch (and this is even with the codespell pre-commit hook on). 😬

The description of the role of "nanobind" probably doesn't need to be repeated between sections.

Can you elaborate a bit more on what is repeated? "nanobind" only appears in the main text in the following two sentences:

The ATLAS collaboration is further extending this ecosystem of tooling to include high-level custom Python bindings to the low level C++ frameworks using nanobind Jakob, 2022.

and

To expose these C++ libraries to the Pythonic tooling layer, custom Python bindings are written using nanobind for high efficiency, as seen in Figure 1.

The first sentence is introducing the concept of custom Python bindings, and the second is highlighting the fact that nanobind in particular was chosen given that the actual bindings it generates are more efficient than alternative binding generation tools. While connected, these ideas are distinct. Do you feel that these concepts need to be emphasized more?

Do all computations currently run on CPUs, and if so, is there interest in exploring GPU implementations?

Yes. At the moment the focus has been scaling out these workflows at dedicated "analysis facilities" to achieve required data processing throughput rates for the next iteration of the LHC (reference presentation). There is ongoing work to be able to utilize hardware acceleration across the tooling ecosystem. This work is still very much in the research stage and at varying stages of maturity across the ecosystem, but Awkward Array computations work on GPUs and the statistical libraries support hardware acceleration.

(Note: I rebased this branch off of the current HEAD of the 2024 branch and then did a git push --force-with-lease, so my apologies if any comments on the file views got removed from the current view. I've tried to go back and respond to any comments that were added on files.)

cliffckerr · 2024-08-28T22:39:49Z

Thanks @matthewfeickert -- my original comment no longer seems so clear to me 😂 It's probably fine as-is. If you do want to revisit the nanobind references, I suppose there is a bit of repetition, with each mention of it going into about the same level of detail:

The ATLAS collaboration is further extending this ecosystem of tooling to include high-level custom Python bindings to the low level C++ frameworks using nanobind [@Nanobind].
To expose these C++ libraries to the Pythonic tooling layer, custom Python bindings are written using nanobind for high efficiency, as seen in @fig:access_layer_diagram.
The interface takes advantage of nanobind's efficient bindings and zero-copy exchange protocols to achieve viable performance. [@Kourlitis:2890478]
computed on-the-fly using the nanobind Python bindings to the ATLAS C++ correction tools.

But it's really not an issue, for some reason I thought there was a full paragraph or at least full sentence repeated.

* Apply Cliff Kerr's suggestions of avoiding redundant informtion on nanobind.

matthewfeickert · 2024-09-02T21:36:31Z

(Sorry for the slow reply @cliffckerr — been doing a week of international work travel so I'm behind on a bit of everything.)

The interface takes advantage of nanobind's efficient bindings and zero-copy exchange protocols to achieve viable performance. [@Kourlitis:2890478]

computed on-the-fly using the nanobind Python bindings to the ATLAS C++ correction tools.

These two come from figure captions, and I generally like them to be as standalone as possible. Though as you do point out, the sentence preceding Figure 1

To expose these C++ libraries to the Pythonic tooling layer, custom Python bindings are written using nanobind for high efficiency, as seen in @fig:access_layer_diagram.

makes the mention in the Figure 1 caption

The interface takes advantage of nanobind's efficient bindings and zero-copy exchange protocols to achieve viable performance. [@Kourlitis:2890478]

a bit redundant as the focus of Figure 1 is the interface, and while the interface decisions enable performance, the performance isn't the focus. So I'm going to remove this line from Figure 1. 👍

hongsupshin · 2024-09-05T14:17:30Z

@cliffckerr @Marcdh3 Hello reviewers, the author @matthewfeickert have responded to your feedback. If you don't have any further comments, could you please approve the PR? Thanks for your time and effort for the review!

hongsupshin · 2024-09-23T21:20:29Z

Thanks @Marcdh3 @cliffckerr for reviewing the paper!

cliffckerr · 2024-09-23T21:53:21Z

I don't seem to have the option to re-request review, but I approve!

matthewfeickert · 2024-09-23T23:56:12Z

I don't seem to have the option to re-request review, but I approve!

Thanks @cliffckerr! If you'd like to formally hit the button I've triggered review request from you. :) We will take no action to mean "approve" though. 👍

cliffckerr

Looks great, nice work!

papers/matthew_feickert/mybib.bib

* c.f. https://indico.cern.ch/event/1330797/contributions/5796636/

matthewfeickert · 2024-09-25T18:52:20Z

Thanks very much to @cliffckerr @Marcdh3 for their hard work as reviewers to help improve the paper — it is truly appreciated. Thanks also to @hongsupshin and the rest of the Proceedings Committee for the heroic amounts of work that they did to once again serve the SciPy community and provide a wonderful service to the conference. 🙏

matthewfeickert self-assigned this May 31, 2024

matthewfeickert force-pushed the feat/add-atlas-scipy-paper branch from 8c70188 to 553b629 Compare May 31, 2024 23:13

cbcunc added the paper This indicates that the PR in question is a paper label Jun 1, 2024

matthewfeickert force-pushed the feat/add-atlas-scipy-paper branch from 553b629 to c37450b Compare June 7, 2024 05:47

matthewfeickert force-pushed the feat/add-atlas-scipy-paper branch 5 times, most recently from 2297ef9 to 1580bc8 Compare June 8, 2024 06:40

matthewfeickert marked this pull request as ready for review June 8, 2024 06:42

matthewfeickert force-pushed the feat/add-atlas-scipy-paper branch from 046bba9 to 14cb4e8 Compare June 8, 2024 07:09

ameyxd added the ready-for-review label Jun 10, 2024

ameyxd unassigned matthewfeickert Jun 11, 2024

cbcunc assigned hongsupshin Jun 11, 2024

fwkoch reviewed Jun 19, 2024

View reviewed changes

papers/matthew_feickert/uncovering-higgs.md Outdated Show resolved Hide resolved