-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide URLs to individual docker image layers #98
Comments
Here are some notes about what I've looked into. tl;dr I can download a blob for a layer I get from inspecting a manifest or OCI-layout directory, but I haven't figured out if there's a way to download a layer.tar file produced by Time conservation notice: These are pretty incomplete notes on incomplete research, and I've probably injected a good dose of confusion into them. notesdirectory with extracted docker save outputdocker pull busybox:latest
mkdir bb-dsave
docker save busybox:latest >bb-dsave/bb.tar
(cd bb-dsave && tar -xf bb.tar) That will result in a diretory that's is similar to what the our docker adapter produces:
The configuration JSON matches the image's hash: docker images --no-trunc --quiet busybox:latest
pulling manifestBased on the documentation for accessing the manifest, I was hoping to be able to use that image ID to pull down a manifest. Here's a unpolished demo script that uses the first argument as the reference. import sys
import json
import requests
repo = "library/busybox"
### https://docs.docker.com/registry/spec/auth/token/
auth_url = ("https://auth.docker.io/token?service=registry.docker.io"
"&scope=repository:{repository}:pull")
resp_auth = requests.get(auth_url.format(repository=repo))
if resp_auth.status_code != 200:
sys.stderr.write("Failed to authenticate: {}\n"
.format(resp_auth.status_code))
sys.exit(1)
headers = {"Authorization": "Bearer " + resp_auth.json()["token"],
"Accept": "application/vnd.docker.distribution.manifest.v2+json"}
ref = sys.argv[1]
# https://docs.docker.com/registry/spec/api/#manifest
man_url = "https://registry-1.docker.io/v2/{repository}/manifests/{reference}"
resp_man = requests.get(man_url.format(repository=repo, reference=ref),
headers=headers)
if resp_man.status_code != 200:
sys.stderr.write("Failed to download manifest: {}\n"
.format(resp_man.status_code))
json.dump(resp_man.json(), sys.stderr)
sys.exit(1)
json.dump(resp_man.json(), sys.stdout) Passing in the image ID doesn't seem to work:
If I use "latest" instead of the sha256 ref, I can pull down the manifest, and it has the matching image ID. So I'm not sure what I'm missing there. jq .config.digest <gmout
layer IDs in manifestThe manifest lists just one layer: jq .layers <gmout
Note that the layer doesn't match anything listed in the find bb-dsave -type f | xargs sha256sum | cut -f1 -d' ' | grep 7c9d It's also not in any of the json files in the archive directory: grep 7c9d bb-dsave/*.json bb-dsave/65836406f9479e26bb2dc27439df3efdae3c298edd1ea781dcb3ac7a7baae542/json OCI layoutWe can convert that docker archive to an OCI-layout with skopeo (gh-106). skopeo copy docker-archive:bb-dsave/bb.tar oci:bb-oci:latest
Here's how that directory looks:
Notice that this has the 7c9d20b9b6cda1c5… layer that was in the downloaded manifest but not the docker archive. The other blobs are json files. downloading blobSimilar to the manifest script, here's a script that downloads a blob (spec here): import sys
import json
import requests
repo = "library/busybox"
# https://docs.docker.com/registry/spec/auth/token/
auth_url = ("https://auth.docker.io/token?service=registry.docker.io"
"&scope=repository:{repository}:pull")
resp_auth = requests.get(auth_url.format(repository=repo))
if resp_auth.status_code != 200:
sys.stderr.write("Failed to authenticate: {}\n"
.format(resp_auth.status_code))
sys.exit(1)
headers = {"Authorization": "Bearer " + resp_auth.json()["token"],
"Accept": "application/vnd.docker.distribution.manifest.v2+json"}
ref = sys.argv[1]
# https://docs.docker.com/registry/spec/api/#manifest
blob_url = "https://registry-1.docker.io/v2/{repository}/blobs/{reference}"
resp_blob = requests.get(blob_url.format(repository=repo, reference=ref),
headers=headers)
if resp_blob.status_code != 200:
sys.stderr.write("Failed to download manifest: {}\n"
.format(resp_blob.status_code))
json.dump(resp_blob.json(), sys.stderr)
sys.exit(1)
sys.stdout.buffer.write(resp_blob.content) Trying many (I think all) of the blobs I can find from the docker archive layer, I get a "BLOB_UNKNOWN" reposnse. I can however download the layer blob from the OCI layout directory: python getblob.py sha256:7c9d20b9b6cda1c58bc4f9d6c401386786f584437abbe87e58910f8a9a15386b >b1
sha256sum b1
The other files in blobs/ (all json files) get "BLOB_UNKNOWN" responses. The OCI layer finally gives us something we could expose through an annex special remote and register as a URL. It seems like using the OCI layer as the default dataset storage for docker then converting to something |
@vsoch -- may be you have ideas/knowledge on how to reference (URLs or request to server) specific layers of a docker image from a hub? would it be different for dockerhub and https://quay.io/? |
Why not just pull to Singularity and store the SIF binary? That's one clean command, gets all the layers, and handles the metadata too. @yarikoptic why would you want only specific layers to save? If that's the case then retrieving them via the Docker API is the way to go. Singularity used to do this, also using Python, if you look at 2.x version. |
As for the layers, you generally need to use the OCI distribution API to request the blobs, as @kyleam was hacking together. That does require getting the manifest, and then querying for each layer. Both Quay.io and Docker Hub use the same OCI distribution format so it shouldn't vary that much between them. |
that is what we did for singularity hub... But good idea to look into Singularity 2.x python code to see how to deal with individual docker layers. |
I mean pull a SIF binary from Docker Hub, a la docker layers. You build a Docker container and then kill two birds with one stone, it can be pulled as docker or Singularity. |
Sorry -- I still don't fully get it: Does Docker Hub contain (provide) SIF binary? Does SIF binary contain the layered structure of docker image(s)? The target is to share later on this git/git-annex repository in such a way that it could pull docker layers from the hub (if still there ;-)) - could be used by people on machines without singularity, just docker. |
Docker hub has layers, so the Singularity client pulls the layers into a sandbox and builds an image from it. I added this in old Singularity (2.1 or so) and it's still the way it rolls :) You can't build the SIF binary without Singularity, and the resulting SIF binary wouldn't have any record of the previous layers, they are dumped into one fliesystem. If you just want to pull docker layers, then just use the API to get the manifest and do that, it's fairly straight forward. |
re original endeavors of @kyleam . did some digging, to request manifest by "digest" you need to get digest (not image id): here I used you script but modified to pass also the repo#!/usr/bin/env python3
import sys
import json
import requests
# repo = "library/busybox"
# repo = "bitnami/wordpress"
# repo = "library/neurodebian" # /sid"
repo = sys.argv[1]
### https://docs.docker.com/registry/spec/auth/token/
auth_url = ("https://auth.docker.io/token?service=registry.docker.io"
"&scope=repository:{repository}:pull")
resp_auth = requests.get(auth_url.format(repository=repo))
if resp_auth.status_code != 200:
sys.stderr.write("Failed to authenticate: {}\n"
.format(resp_auth.status_code))
sys.exit(1)
headers = {"Authorization": "Bearer " + resp_auth.json()["token"],
"Accept": "application/vnd.docker.distribution.manifest.v2+json"}
ref = sys.argv[2]
# https://docs.docker.com/registry/spec/api/#manifest
man_url = "https://registry-1.docker.io/v2/{repository}/manifests/{reference}"
resp_man = requests.get(man_url.format(repository=repo, reference=ref),
headers=headers)
if resp_man.status_code != 200:
sys.stderr.write("Failed to download manifest: {}\n"
.format(resp_man.status_code))
json.dump(resp_man.json(), sys.stderr)
sys.exit(1)
json.dump(resp_man.json(), sys.stdout)
and then you could follow up for the specific architecture, again by digest (didn't see how to match to image id yet)
then the next question is either/how those layers from manifest relate to the ones
unfortunately I have not found a way to associate with any digest we obtain from |
What are you trying to do @yarikoptic ? The digests are generated based on the hash of the config, which even with "the same" image is going to be different with different timestamps. |
overall:
so far the approach was:
|
I think it would be cleaner to go directly from the Registry API, and then retrieve the exact downloads for the layers and config, which already come with the digests. As I understand docker save, it's going to write on the fly, which means new timestamps and thus new hashes. It would be confusing for the user to see a known image locally (e.g., busybox:vX.X.X) and then not see digests that line up with what is on Docker Hub (or another registry). The benefit of not using docker save is that docker does not become a dependency for datalad. It's also messy to have "the same" layers that appear different because of different timestamps. On a higher level, is it really reasonable to start saving container images / layers to git? Those are a lot of huge files! There is something to say for having a registry with URIs that (depending on the registry, can persist) to be the provider of the metadata (digests and links of blobs to images and tags). If datalad aims to become a provider of container layers, and artifacts, you might consider looking at the OCI distribution spec so it can provide the same, standardized / expected interactions to users. I guess it seems like datalad is trying to be the tool for everything instead of a more specific or narrow use case. |
Hi all, I see @asmacdo was assigned to that issue, are there any plan to implement that feature? |
@bpinsard I am assigned so I can investigate but I don't have specific plans yet. FWIW, I tend to agree that we should avoid Its worth noting that we would only be storing metadata in git (the hashes) and the blob bits would be moved around with git-annex via datalad. |
Thanks that's great! I recently opened #199 which is somewhat related. I wonder if the
Indeed that is something I realized was not set by default, hence #204 |
@bpinsard sorry for the delay on this one: I don't think that a special custom remote should be necessary. We "just" need to completely redo how we store docker layers and avoid |
note here: discovered https://github.com/indigo-dc/udocker -- which is quite cool since pure python, and doesn't require installation of docker. But underneath there is some "magic"al downloads etc:
but may be it could also be used just as a library to download/access/manipulate images etc, or even indeed as yet another executor. |
It was postponed until there is some interest in storing/running docker images, but we seems didn't even create an issue for that.
There was some interest from users expressed so here is this issue: added docker image layers do not have URLs to point to the docker hub so they could later be fetched on another box:
also may be there should be .gitattributes created in the image directory to instruct .json files to be committed directly to git not annex
The text was updated successfully, but these errors were encountered: