Skip to content

Commit

Permalink
Minor updates
Browse files Browse the repository at this point in the history
  • Loading branch information
choldgraf committed Jan 7, 2025
1 parent 6fa521d commit 5fd6227
Showing 1 changed file with 31 additions and 18 deletions.
49 changes: 31 additions & 18 deletions content/blog/2025/frictionless-reproducibility/index.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
date: 2024-12-16
title: Towards frictionless reproducibility with BinderHub and 2i2c
title: Towards frictionless, sharable, and sustainable reproducibility with Binder and JupyterHub
---

<style>
Expand All @@ -17,29 +17,34 @@ title: Towards frictionless reproducibility with BinderHub and 2i2c
}
</style>

2i2c has been experimenting with ways to enable communities to share their work with one another more effectively. This post is an exploration of the basic workflow we're interested in, how it could be enabled with open technology in the Jupyter ecosystem, how it might fit in with the broader ecosystem of services around scientific communication, and how we might try to sustain it.
2i2c has been experimenting with empowering communities to share their work with one another. This post is an exploration of the basic workflow we're interested in, how it could be enabled with open technology in the Jupyter ecosystem, how it might fit in with the broader ecosystem of services around scientific communication, and how we might try to sustain it.

## The core pieces of data-driven discovery

For any individual to ask questions and create knowledge with data, they need access to a few key resources in order to do their work.
For any individual to create and share knowledge with data, they need access to a few key resources:

- **Data** is the raw material that drives computational ideas.
- **Computation** provides the core hardware for manipulating data with code.
- **Software** the tools that code uses to leverage data and computation for specific outcomes.
- **Code** utilizes software and computation to reproduce a computation.
- **Narrative** allows all of the above to be communicated in a compelling and understandable way.
- **Computation** provides the mechanism for manipulating data with code.
- **Software** provides tools that leverage data and computation for specific outcomes.
- **Code** orchestrates software and computation to analyze and explore data.
- **Narratives** allow all of the above to be communicated in a compelling and understandable way.
- **Interfaces** bring together all of the above to facilitate workflows in data-driven discovery and sharing.

**Jupyter**'s goal is to weave all of the above together with an ecosystem of tools that are modular, flexible, and interoperable. **2i2c**'s [community hub platform](/platform) gives community leaders the ability to integrate and customize all of these resources to be useful for particular communities in research and education. This usually comes in the form of a **Jupyter Book** to teach communtiy workflows to users, with connections to a **JupyterHub** that provides the data, compute, and software tools needed for data-driven discovery.
**Jupyter**'s goal is to weave all of the above together with an ecosystem of tools that are modular, flexible, and interoperable. **2i2c**'s [community hub platform](/platform) gives community leaders the ability to integrate and customize all of these resources to be useful for communities in research and education. This usually comes in the form of a **Jupyter Book** to teach communtiy workflows to users, with connections to a **JupyterHub** that provides the data, compute, and software tools needed for data-driven discovery.

{{< figure src="images/discover.png" title="2i2c community hubs use Jupyter technology to help members learn community workflows, and access all the resources needed for a community's data-driven discovery.">}}

## Sharing ideas means sharing computation to reproduce them

When a community member discovers something new, they want to share it with others in their community. This is usually an ongoing process by which community members are giving information and receiving information in return to iteratively drive their exploration and development.
When a community member discovers something new, they want to share it with others in their community. This is an ongoing process where community members are giving and receiving information to iteratively drive their exploration and development.

These ideas are driven by data and computation, and so it is critical to **share the resources needed to reproduce the computation** when sharing new ideas with community members. This allows others to reproduce the main ideas in order to validate the core claims, as well as interact with the code and software to better understand and build upon the ideas.
These ideas are driven by data and computation, and so it is critical to **share the resources needed to reproduce the computation**. This has two benefits:

We can leverage cloud infrastructure to facilitate this process. At a basic level, community hubs are a good way to standardize the software and resource environment so that content produced in one user's session can be easily reproduced by another user. [nbgitpuller](https://nbgitpuller.readthedocs.io/en/latest/) is the most common tool for distributing content within a JupyterHub. As long as the content was made on a JupyterHub, and shared with another user on the same hub, they'll be able to get their own copy and interact with it.
1. It allows others to **validate the core claims** by inspecting code and reproducing the results.
2. It allows others to **learn more effectively** by providing a more rich, interactive space to engage with the ideas.
3. It allows others to **build on top of work more easily** by providing an interactive starting point that lets readers get further with new ideas.

We can leverage cloud infrastructure to facilitate this process. At a basic level, community hubs are a good way to standardize the environment so that content produced in one user's session can be easily reproduced by another user. [nbgitpuller](https://nbgitpuller.readthedocs.io/en/latest/) is the most common tool for distributing content within a JupyterHub. As long as the content was made on a JupyterHub, and shared with another user on the same hub, they'll be able to get their own copy and interact with it.

{{< figure src="images/share.png" title="The simplest sharing workflow uses nbgitpuller to distribute content via links that community members click. Doing so will give them their own copy of the content running in the hub's environment.">}}

Expand All @@ -48,20 +53,26 @@ We can leverage cloud infrastructure to facilitate this process. At a basic leve
However, many authors need more control over the environment needed to reproduce a computation. For example, they might need a _specific version_ of a software package, or they may need a non-standard library that isn't available on their community hub's base environment. How can they share their computational narratives in a way that _bundles the environment_ with the content?

This is what [**the Binder Project**](https://mybinder.org) was built for. It provides a tool called [BinderHub](https://binderhub.readthedocs.io/en/latest/), which is an **engine for sharable, reproducible computational environments**. It allows an author to package up their computational narratives and code, and share access to a cloud environment that allows readers to reproduce and interact with the computation.
Earlier this year our team has been prototyping how to [connect BinderHub services with a JupyterHub](../../2024/jupyterhub-binderhub-gesis/) and how to [deploy BinderHub services alongside each of our community hubs](../../2024/nasa-ephemeral-hubs/).
In 2024 our team prototyped how to [connect BinderHub services with a JupyterHub](../../2024/jupyterhub-binderhub-gesis/) and how to [deploy BinderHub services alongside each of our community hubs](../../2024/nasa-ephemeral-hubs/).
Doing so would allow community members to **build an environment needed to reproduce their computation and share it via their hub**.

{{< figure src="images/reproduce.png" title="BinderHub allows an author to package narrative, code, computational resources, and a software environment into a link that others can use to access a fully reproducible environment for their ideas.">}}

## Empowering communities to bring their own reproducibility when publishing
## Could we empower communities to bring their own reproducibility when publishing?

When community members have a big idea that they'd like to share more formally, want to **publish** rather than just **share**. Publishing is more structured, invites particular kinds of feedback, and requires more quality assurance. There's a huge ecosystem of publishers and services that support formal publishing, and they ensure things like discoverability, long-term archivability, versioning, peer review, DOI referencing, etc.
When community members have a polished idea to share more formally, they want to **publish** rather than just **share**. Publishing is more structured, invites particular kinds of feedback, and requires more quality assurance. There's a huge ecosystem of publishers and services that support formal publishing, and they ensure things like discoverability, long-term archivability, versioning, peer review, DOI referencing, etc.

However, publishers don't have access to the core infrastructure used to make the original discovery, nor do they have the expertise to manage this kind of infrastructure. But 2i2c's community members already have a reproducible environment where they've been doing their discovery. Could we leverage this infrastructure for publishing?

2i2c has an opportunity to empower publishers to reproduce computational ideas by **allowing authors to bring their reproducible environments with them**. By leveraging the same core technologies, authors can re-use their hub infrastructure to provide the resources needed for reproducibility.

This would require treating a community's hub as an **external service for reproducibility** that could be leveraged by one of the many publishing platforms out there. For example, [Curvenote](https://curvenote.com) already builds heavily on the Jupyter and MyST ecosystems and could use a community's hub to provide interactivity to a Curvenote document. Authors could publish pre-prints on [Biorxiv](https://www.biorxiv.org/) that included Binder links back to their community hub [Binder links in arxiv has already been piloted with the Loren Frank lab and HHMI](../../2024/hhmi-spyglass-mysql/)). Larger publishers like [PLOS](https://plos.org/) could allow authors to provide metadata about BinderHubs that can reproduce their content, and weave this into their user interface for displaying publications. These are all _possible_ outcomes that haven't been realized, but I don't think they'd be that far off.
This would require treating a community's hub as an **external service for reproducibility** that could be leveraged by one of the many publishing platforms out there. For example:

- [Curvenote](https://curvenote.com) already builds heavily on the Jupyter and MyST ecosystems and could use a community's hub to provide interactivity to a Curvenote document.
- Authors could publish pre-prints on [Biorxiv](https://www.biorxiv.org/) that included Binder links back to their community hub [Binder links in arxiv has already been piloted with the Loren Frank lab and HHMI](../../2024/hhmi-spyglass-mysql/)).
- Larger publishers like [PLOS](https://plos.org/) could allow authors to provide metadata about BinderHubs that can reproduce their content, and weave this into their user interface for displaying publications.

These are all _possible_ outcomes that haven't been realized, but I don't think they'd be that far off.

{{< figure src="images/publish.png" title="We could re-use the same tools for data-driven discovery to facilitate the reproducibility of computational narratives and publications. The same core infrastructure could be re-used to provide comptutational environments that are re-used in a variety of publishing tools and services. (note this is all hypothetical, no-such service integrations yet exist, and these are just a sample of relevant publishing organizations in this space)">}}

Expand All @@ -80,9 +91,9 @@ A demo of [interactive documents with Thebe](https://readthedocs.org/thebe/en/la

## How to sustain these services?

This brings up an important question: how would you sustain services like these? mybinder.org is built as an entirely free service both to build and to use environments. This is massively accessible, but is not scalable nor sustainable. Would community stakeholders pay for privileged access to BinderHubs that could reproduce and share their computational narratives? Would publishers be willing to pay a percentage of the cloud and management costs associated with reproduction?
This brings up an important question: how would you sustain services like these? [mybinder.org](https://mybinder.org) is an entirely free service both to build and to use environments. This is massively accessible, but is not scalable nor sustainable, and doesn't provide rich computation needed for more complex analyses. Would community stakeholders pay for privileged access to BinderHubs that could reproduce and share their computational narratives? Would publishers be willing to pay a percentage of the cloud and management costs associated with reproduction?

One thing seems clear - there is an imperative to make reproducibility and interaction frictionless. This is both to ensure the scientific integrity of the work being done, and also to make computational ideas more accessible to the world. Technologies and service partnerships like these can help ensure the broader communitys right to reproduce the work of others.
One thing seems clear - there is an imperative to make reproducibility and interaction frictionless. This is both to ensure the scientific integrity of the work being done, and also to make computational ideas more accessible to the world. Technologies and service partnerships like these can help ensure the broader community's right to reproduce the work of others.

Some of these ideas were recently explored in a talk recorded for [AGU 2024](https://agu.confex.com/agu/fm24/meetingapp.cgi/Paper/100644).

Expand All @@ -95,4 +106,6 @@ A talk from 2i2c team member [Jim Colliander](https://2i2c.org/author/jim-collia

## Concluding thoughts

Much of the above is already possible, if a little bit clunky to execute. However, given the growing interest in open infrastructure, iterative science, and improving the publishing ecosystem, I'm hopeful that we'll have opportunities to build these service connections soon. I'd love to those from the more "formal" side of scholarly publishing to understand where these ideas have holes or significant new development. I'd also love feedback on sustainability models to ensure these services can be relied on as part of the publishing ecosystem. In the meantime, hopefully these ideas serve as an inspiration for what is possible, and where we might be heading with 2i2c's service and the broader Jupyter ecosystem.
Much of the above is already possible, if a little bit clunky to execute. However, given the growing interest in open infrastructure, iterative science, and improving the publishing ecosystem, I'm hopeful that we'll have opportunities to build these service connections soon.

I'd love to hear from the more "formal" side of scholarly publishing to understand where these ideas have holes or significant new development. I'd also love feedback on sustainability models to ensure these services can be relied on as part of the publishing ecosystem. In the meantime, hopefully these ideas serve as an inspiration for what is possible, and where we might be heading with 2i2c's service and the broader Jupyter ecosystem.

0 comments on commit 5fd6227

Please sign in to comment.