Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Install nbgitpuller in the base image #2000

Closed
wants to merge 9 commits into from
Closed

Conversation

yuvipanda
Copy link
Contributor

Describe your changes

nbgitpuller is a very popular way to distribute content to users, both on mybinder.org as well as on JupyterHubs. On Binder, this allows for fast launches where the environment is separated from the content (see https://discourse.jupyter.org/t/tip-speed-up-binder-launches-by-pulling-github-content-in-a-binder-link-with-nbgitpuller/922). So by putting nbgitpuller in these images, users can simply pick an image to launch and then just pull in the content they need, instead of having to rebuild their own image. It's also extremely popular when used with JupyterHub, and is included by default in the TLJH distribution.

Checklist (especially for first-time contributors)

  • I have performed a self-review of my code
  • If it is a core feature, I have added thorough tests
  • I will try not to use force-push to make the review process easier for reviewers
  • I have updated the documentation for significant changes

[nbgitpuller](https://github.com/jupyterhub/nbgitpuller/) is a very
popular way to distribute content to users, both on mybinder.org
as well as on JupyterHubs. On Binder, this allows for fast launches
where the environment is separated from the content (see
https://discourse.jupyter.org/t/tip-speed-up-binder-launches-by-pulling-github-content-in-a-binder-link-with-nbgitpuller/922).
So by putting nbgitpuller in these images, users can simply pick an
image to launch and then just pull in the content they need, instead
of having to rebuild their own image. It's also extremely popular when
used with JupyterHub, and is included by default in the
[TLJH](https://tljh.jupyter.org) distribution.
@manics
Copy link
Contributor

manics commented Sep 27, 2023

FYI a previous discussion #751

@yuvipanda
Copy link
Contributor Author

Ah I see! I would really love to see it included in the base image though. As evidence of its usefulness, I can present:

@consideRatio
Copy link
Collaborator

Since I think this project doesn't yet have clear acceptance criterias for introducing new software in images, I'd like to make a thorough effort to establish some precedence on what is worth considering.

  1. Software impact for users of docker-stacks images
    I think @yuvipanda has covered this quite well above already, but I'd like to add that I think nbgitpuller can help reduce the need to build new images. Sometimes its enough to be able to start an image and download something little extra, and if nbgitpuller can make that happen, it can offload users of building a custom image and/or requesting that docker-stacks add something to an image.
  2. Software health, details, and maintenance status
    I consider nbgitpuller to be quite healthy based on:
    • it maintains documentation
    • a changelog is actively maintained
    • semver2 versioning is adopted, and its beyond version 1.0.0
    • has passing tests for various python versions (including py311) and git versions, and tests are passing
    • a release procedure with helpful automation is established
    • multiple people are involved in its maintenance
    • has been around for several years
    • provides a conda-forge package besides a pypi package, where both are kept up to date
    • is a noarch package, it won't be troublesome to build on arm64 etc
  3. Installation consequences
    • nbgitpuller relies on git to be available on the system, jupyter-server, and tornado
    • all of these are available in minimal-notebook already, so adding nbgitpuller there doesn't add something extra
    • it adds 1.5MB to a 1.6GB image according to dive inspecting it
    • it took my computer 15 second to run the additional mamba install step in minimal-notebook, but I expect that most of this time stemmed from doing mamba install at all
  4. docker-stacks "image fit"
  5. Why it shouldn't just be a documented recipie
    • If its not part of the image, it won't need a recipie as its simple to install
    • This package makes sense to have pre-installed or not at all pretty much. For example, a teacher could with it pre-installed ask users to click a link to get started, but without having it pre-installed the teacher wouldn't beneift from having nbgitpuller at all - the point can often be to have students just click a link to get started with curriculum material and not need to ask them to do misc other things first.

My conclusion

Initially I wasn't confident what I thought even though I value nbgitpuller, but after reflecting on the things above, I think its a great idea to add it to minimal-notebook.

Action points

  • Reach agreement to install this or not in some image
    @mathbunnyru @manics what do you think about introducing nbgitpuller?
  • Make at least these updates to this PR:
    • Update nbgitpuller to install in the agreed image
    • Update docs about minimal-notebook to reflect it goes beyond apt packages, but also a general purpose jupyter server extension
    • Update tests suite to also verify nbgitpuller can be imported like done for most other packages I think

References

@manics
Copy link
Contributor

manics commented Oct 1, 2023

Since I think this project doesn't yet have clear acceptance criterias for introducing new software in images, I'd like to make a thorough effort to establish some precedence on what is worth considering.

I'd add a 6th point to those criteria: Impact on security: E.g. Does the package open additional ports, or add new web endpoints, that could be exploited?

  • IIRC (haven't looked at nbgitpuller for ages): nbgitpuller allows arbitrary git repositories to be automatically pulled into the filesystem via a URL. A malicious actor could send an nbgitpuller URL to a user who would launch their singleuser server and clone the malicious repo. Although automatic execution is not possible a naive user could easily load the malicious content.

Based on that I'd be reluctant to automatically push it out to every JupyterHub.

@mathbunnyru
Copy link
Member

I will take a look at this next week. I'm not familiar with nbgitpuller, and I will try it.

@yuvipanda
Copy link
Contributor Author

Thanks @mathbunnyru! Let me know if I can help in any way :)

@mathbunnyru
Copy link
Member

mathbunnyru commented Oct 7, 2023

I agree with almost all the points made by @consideRatio (and thank you for sharing these thoughts, they are extremely useful) and I do agree that impact on security is important as well.

But I have some doubts about this one.

This package makes sense to have pre-installed or not at all pretty much. For example, a teacher could with it pre-installed ask users to click a link to get started, but without having it pre-installed the teacher wouldn't beneift from having nbgitpuller at all - the point can often be to have students just click a link to get started with curriculum material and not need to ask them to do misc other things first.

This assumes that the teacher doesn't want/can't change the single-user image, otherwise, the teacher could have had an image derived from one of our images and installed this package on top (and then shared a link).

I think "our images work perfectly in 95% of use cases" is not feasible. There will always be things/features missing and you only need this one more package.
What is feasible (in my opinion) is "we cover 99% of use cases, but you may need to install a couple of things on top of them, and we're going to make this as easy for you as we can".

That being said, I believe our images should move forward and when new things become widespread (or some things get deprecated), we should adapt our images accordingly. Still, drawing a line between what to include and what not is difficult.

Maybe we should have some kind of vote on what to include/exclude. (and the same for the new images like pytorch-notebook).
I might have some bias when I don't want to add new things to make my life as a maintainer easier, and I definitely can't tell what our community needs (because I'm not always as engaged as other Jupyter members), so I shouldn't a final decision on this question.
Voting should allow us to add new things when there is a certain need in the community.
It takes some time to set up, but it should work better than dealing with these things on an individual basis each time.
I will still have the power to close PRs/feature requests which do not meet some eligible criteria (I think the @consideRatio checklist is a nice example of what these criteria might be), so we only will have this vote happening a few times per year (based on the PRs I've seen for the past few years).
Obviously, people will be able to share their ideas about why this package should/shouldn't be included, so people might change their opinions and will be open for discussion.
At the same time, we will have some deadlines for the vote (like 3-4 weeks), to give everyone a chance to become familiar with the change and vote accordingly.
I would also appreciate it if someone could propose a list of voters. I think @consideRatio, @manics and @yuvipanda might have ideas about who should vote for the changes.
We should also be able to introduce new voters, so sometimes we will have votes for introducing new voters.
If we implement this, obviously #1958 will be solved, as it has exactly the same problem.

Please, tell me what you think. Sorry, I haven't just merged the PR 😆

@mathbunnyru
Copy link
Member

@yuvipanda could you please do things suggested in Action points above, and then I will start the voting process?
We need to think what should we do, when one of the voters suggests a new feature though.

@consideRatio
Copy link
Collaborator

@mathbunnyru I don't think we should ask for the PR to finalize before voting, as I don't think it gives us more information of relevance to the decision on the intent to approve a PR to add ngitpuller. It can be the other way around though!

I suggest that we try to decide if nbgitpuller should be installed to minimal-notebook (currently the PR adds to base-notebook that doesn't have git installed which it needs).

@mathbunnyru
Copy link
Member

Ok, makes sense to me, thanks.
How do you think, what should we do in the case one of the voters suggests a new feature?

@consideRatio
Copy link
Collaborator

consideRatio commented Oct 27, 2023

My take is that nbgitpuller is very reasonable to install in minimal-notebook based on all the criterias 1-5 in the policy, but I'm hesitant about criteria 6 about security.

With nbgitpuller, a new web server endpoint is added that can have files be downloaded into a folder. It can also be used to after download redirect the user to some path, for example to have jupyterlab open a file - or anything else that can be prompted by browsing to some URL.

I think the redirection part isn't a notable concern, that can be done without nbgitpuller by directly providing a link that does something first. I think the download is a concern though. I'm thinking about an exploit could be to craft a link that downloads a folder that will be read by other software, such as configuration folders .config, .local, .aws, .jupyter etc. If they are downloaded into the home folder, they may be unseen for the user not always showing hidden folders, and they may probably lead to malicious code execution by the person providing the malicious nbgitpuller link. nbgitpuller wouldn't create a new folder if it already exists, but could create new folders.

Any package installed in the image being malicious would be trouble, so installing anyting is a risk. Installing nbgitpuller though is introducing another kind of risk though, about clicking links that are maliciously crafted.

I think I'll abstain from voting favorably for now, but think a resolution to jupyterhub/nbgitpuller#330 could make me rethink it.

@mathbunnyru
Copy link
Member

Ok, I think your point is valid and the upstream issue is relevant to solving it, nice work 👍

I'm currently not convinced people aren't able to install this package to benefit from it, so I'm not in favor of merging this package currently.

@mathbunnyru
Copy link
Member

Since I think this project doesn't yet have clear acceptance criterias for introducing new software in images, I'd like to make a thorough effort to establish some precedence on what is worth considering.

I'd add a 6th point to those criteria: Impact on security: E.g. Does the package open additional ports, or add new web endpoints, that could be exploited?

  • IIRC (haven't looked at nbgitpuller for ages): nbgitpuller allows arbitrary git repositories to be automatically pulled into the filesystem via a URL. A malicious actor could send an nbgitpuller URL to a user who would launch their singleuser server and clone the malicious repo. Although automatic execution is not possible a naive user could easily load the malicious content.

Based on that I'd be reluctant to automatically push it out to every JupyterHub.

I think @manics has expressed his opinion here. Please, correct me if your opinion changed.

@manics
Copy link
Contributor

manics commented Oct 27, 2023

After considering jupyterhub/nbgitpuller#330 a bit more I'm even more against it.

A potential mitigation (in addition to the suggestions in jupyterhub/nbgitpuller#330) is to install nbgitpuller but make it's activation conditional on an environment variable. That way anyone running a JupyterHub can include something like NBGITPULLER_ENABLED=1 in their DockerSpawner/KubeSpawner to enable it for all their users.

@manics
Copy link
Contributor

manics commented Oct 27, 2023

Another option- repositories cloned by nbgitpuller most likely need other libraries than what's available in the base/minimal images. The https://jupyter-docker-stacks.readthedocs.io/en/latest/using/selecting.html#jupyter-datascience-notebook image incorporates a lot of libraries, so a datascience-nbgitpuller image is an option, but it means maintaining yet another image.

And one more thought (prob for a different issue?), what can we do to make it easier for people to publish their own images with extensions? For example, could we have a template repository with a GitHub workflow or action where the workflow/action takes the template parameters:

  • image-name
  • docker-stacks image to use as base
  • list of additional conda packages to install
  • optionally list of other custom scripts/config

The workflow could be written to push to ghcr.io/${{ github.repository }} (not sure of the exact var names) which means minimal editing is required when someone uses the template.

@mathbunnyru
Copy link
Member

Another option- repositories cloned by nbgitpuller most likely need other libraries than what's available in the base/minimal images. The https://jupyter-docker-stacks.readthedocs.io/en/latest/using/selecting.html#jupyter-datascience-notebook image incorporates a lot of libraries, so a datascience-nbgitpuller image is an option, but it means maintaining yet another image.

I definitely don't want to add another image just for nbgitpuller, in my opinion it's a bit too much for this package and it doesn't just increase the maintenance, but from user perspective they will have to spend more time choosing an image in this case.

And one more thought (prob for a different issue?), what can we do to make it easier for people to publish their own images with extensions? For example, could we have a template repository with a GitHub workflow or action where the workflow/action takes the template parameters:

  • image-name
  • docker-stacks image to use as base
  • list of additional conda packages to install
  • optionally list of other custom scripts/config

The workflow could be written to push to ghcr.io/${{ github.repository }} (not sure of the exact var names) which means minimal editing is required when someone uses the template.

We actually already have this (in a bit different implementations though).
We have a cookiecutter template repo and docs how to use it.

I also made an attempt to not to use cookiecutter and just create a simple example repo (which can be forked and changed), but it didn't get enough attention, so I didn't push it.
https://github.com/mathbunnyru/custom-notebook

@yuvipanda
Copy link
Contributor Author

+1 for not adding another image.

My drive comes from these two facts:

  1. Literally every single JupyterHub used for education that I have encountered uses and requires nbgitpuller
  2. I'd love for people to just use jupyter/docker-stacks unmodified for those use cases, as I believe it currently satisfies 95% of the use case, and nbgitpuller is part of the last 4%.

That said, I agree it's important that people not get docker images that are insecure because of features they don't use at all, and non-educational cases often don't have any need for nbgitpuller! So how about we implement the following changes in nbgitpuller:

With these, nbgitpuller links would only create directories (it already fails if the directory is not empty) based on the name of the repo being cloned.

If that sounds ok, I'm happy to work on implementing these in nbgitpuller.

@yuvipanda
Copy link
Contributor Author

I started jupyterhub/nbgitpuller#332 as a draft to show how we could disable targetPath by default.

@mathbunnyru
Copy link
Member

Thanks @yuvipanda. I am not as optimistic about 95% without external packages as you are.

Still, I see your point about education and JupyterHub + these docker images are an essential part of many educational programs. And I'm pretty sure you're right about the use of nbgitpuller there.

That being said, if you fix 2 action points above, I think I would be ok with merging nbgitpuller to our images.
It would be also great if you considered the comment above:

A potential mitigation (in addition to the suggestions in jupyterhub/nbgitpuller#330) is to install nbgitpuller but make it's activation conditional on an environment variable. That way anyone running a JupyterHub can include something like NBGITPULLER_ENABLED=1 in their DockerSpawner/KubeSpawner to enable it for all their users.

I have no idea if it's easy, fits Jupyter Hub well, or something like that, but I think it's a nice option to have, and this way we'll be a bit more secure. But for me, 2 action points above are a must, and this one is nice to have.

yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this pull request Nov 3, 2023
Both the images in use at UCMerced don't have nbgitpuller
installed in them, while the previous combined image did.

I've been working on getting nbgitpuller into upstream
jupyter docker-stacks (jupyter/docker-stacks#2000)
for a while, but it looks like it'll take a bit longer. So
I've temporarily created a repo (https://github.com/2i2c-org/scipy-notebook-with-nbgitpuller/)
to use here. It's based on the same tag as before, but with
nbgitpuller installed. We can get rid of this once nbgitpuller
lands upstream.

I have created https://github.com/2i2c-org/rocker-with-nbgitpuller/
to do the same for use with rocker. This too is a temporary repo.
I hopefully will have a better longer term solution speced out
next week.

This unblocks ucmerced currently teaching.

Ref https://2i2c.freshdesk.com/a/tickets/1089
@mathbunnyru mathbunnyru marked this pull request as draft January 7, 2024 06:07
@mathbunnyru
Copy link
Member

Marked this as a draft, since we won't be merging this until nbgitpuller issues are fixed.

@mathbunnyru
Copy link
Member

Closing this PR, as there is no activity here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants