Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFE: stop oc image mirror creating duplicate files when mirroring to disk for an airgap install #1388

Open
m-g-k opened this issue Mar 26, 2023 · 2 comments
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@m-g-k
Copy link

m-g-k commented Mar 26, 2023

When running a command like:
oc image mirror -f images-mapping-to-filesystem.txt --filter-by-os '.*' --skip-multiple-scopes --max-per-registry=1

some manifest and blob files are duplicated into different folders. For example, if I run this command from inside the root v2 folder after the mirror is complete I see:

find -name sha256:f2b490814c92873a4f533992ccbc1e625e6afbaf01046a25debf0ed487e38ebc -printf "%p %s\n"
./v2/<path1/to/image>/blobs/sha256:f2b490814c92873a4f533992ccbc1e625e6afbaf01046a25debf0ed487e38ebc 43875821
./v2/<path2/to/image>/blobs/sha256:f2b490814c92873a4f533992ccbc1e625e6afbaf01046a25debf0ed487e38ebc 43875821
./v2/<path3/to/image>/blobs/sha256:f2b490814c92873a4f533992ccbc1e625e6afbaf01046a25debf0ed487e38ebc 43875821
./v2/<path4/to/image>/blobs/sha256:f2b490814c92873a4f533992ccbc1e625e6afbaf01046a25debf0ed487e38ebc 43875821
./v2/<path5/to/image>/blobs/sha256:f2b490814c92873a4f533992ccbc1e625e6afbaf01046a25debf0ed487e38ebc 43875821
./v2/<path6/to/image>/blobs/sha256:f2b490814c92873a4f533992ccbc1e625e6afbaf01046a25debf0ed487e38ebc 43875821
./v2/<path7/to/image>/blobs/sha256:f2b490814c92873a4f533992ccbc1e625e6afbaf01046a25debf0ed487e38ebc 43875821
./v2/<path8/to/image>/blobs/sha256:f2b490814c92873a4f533992ccbc1e625e6afbaf01046a25debf0ed487e38ebc 43875821
./v2/<path9/to/image>/blobs/sha256:f2b490814c92873a4f533992ccbc1e625e6afbaf01046a25debf0ed487e38ebc 43875821
./v2/<path10/to/image>/blobs/sha256:f2b490814c92873a4f533992ccbc1e625e6afbaf01046a25debf0ed487e38ebc 43875821

This shows that out of ~438Mb downloaded, ~394Mb are duplicates. Obviously this is an extreme case, but over a whole airgap mirror I'm seeing on average that about 1/3 of the size is taken up in duplicate files, and in some I see over 100GB of duplicates for large mirrors.

If the command below is run from the root of a mirrored folder on disk (inside the v2 folder) it will provide a list of all the duplicates files preceded by a count of how many times each one is duplicated and is followed by the size of each image:

find -name sha256:* -printf "%f %s\n" | sort | uniq -dc | sort -n
 ...
 9 blobs/sha256:5d9ff8920718132b2498fcbe2cfd5477e94d38f7f70e4aa319b44df5bf62a9e0 39235316
10 blobs/sha256:2f19a8cf89693277baaa454087d49d95967ad8872e2bcc44741d4046abaf1cd6 37461527
10 blobs/sha256:c3b490814c92873a4f533992ccbc1e625e6afbaf01046a25debf0ed487e38eba 43875821
14 blobs/sha256:fc70bf18a086007016e948b04aed3b82103a36bea41755b6cddfaf10ace3c6ef 1024
 ...

The example above shows there are 9 copies of the first blob starting sha256:5d9ff... and each one is 39235316 bytes in size.

Whereas this command below will count all the duplicates and provide a total of the total space lost in duplicates so you can see how big the problem is on different mirrors:

find -name sha256:* -printf "%f %s\n" | sort | uniq -dc | sed -e "s/^ *\([0-9]*\) .* \([0-9]*\)/((\1-1)*\2)/" | paste -sd+ | bc | numfmt --to=iec
130G

Given that the main purpose of oc image mirror is to mirror a registry to prepare for an airgap install, this is a lot of wasted space and time when mirroring large repositories. Therefore, it would be really helpful to eliminate the duplicates, perhaps by using the link file mechanism that some registries use internally, such as the manifestTagIndexEntryLinkPathSpec and the layerLinkPathSpec from distribution.

Happy to provide more information if required.

MGK

@m-g-k m-g-k changed the title oc image mirror creates duplicate files when mirroring to disk for an airgap install RFE: oc image mirror creates duplicate files when mirroring to disk for an airgap install Mar 26, 2023
@m-g-k m-g-k changed the title RFE: oc image mirror creates duplicate files when mirroring to disk for an airgap install RFE: stop oc image mirror creating duplicate files when mirroring to disk for an airgap install Mar 26, 2023
@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 25, 2023
@m-g-k
Copy link
Author

m-g-k commented Jul 21, 2023

/lifecycle frozen
This is still an issue. Anyone have any thoughts?? Thx!

@openshift-ci openshift-ci bot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

No branches or pull requests

2 participants