Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Git-LFS support #10

Open
vaygr opened this issue Oct 5, 2015 · 12 comments
Open

Git-LFS support #10

vaygr opened this issue Oct 5, 2015 · 12 comments

Comments

@vaygr
Copy link

vaygr commented Oct 5, 2015

Hello,

is it possible to use lfs-enabled git repositories in the official images' specs? according to that, git-lfs binary must be installed on the side they're pulled/cloned to.

Thanks in advance.

@andyshinn
Copy link

I'm curious about this as well. The current workflow of maintaining tarballs as commits in git repositories is not very ideal. But I'm happy to pay for some git space somewhere to keep the tarballs in check. The workflow actually shouldn't change much. It seems like it would mostly just need the git LFS binaries installed on whatever builder hosts are cloning the repository.

@tianon
Copy link
Member

tianon commented Dec 14, 2015

I'm definitely interested in exploring Git LFS, especially now that GitLab supports it too. I think my biggest concern with it is making sure that it's not too invasive an addition on top of Git itself (ie, trivial to install, etc), since Git is widely packaged and available, but I don't know how easy Git LFS is to get set up with (since I haven't had a chance to play with it yet). I'm not very comfortable requiring a large additional non-trivial installation for re-building the official images from scratch unless there are some obvious and large benefits to outweigh it. For example, https://bugs.debian.org/792075 looks promising, but is still a ways out (which would be how it'd become an actual package in Debian and Ubuntu, because it appears to be written in Ruby, which for me personally is a point against it).

As a side note, if Git LFS could potentially give us the ability to "diff" tarballs, that'd be a real killer feature for me that would prioritize this over our current force-push insanity. 😄

@vaygr
Copy link
Author

vaygr commented Dec 15, 2015

Well, in our source-based Linux distro, for example, the only dependency for git-lfs is go.

@justincormack
Copy link

@tianon at some point it was rewritten in Go https://github.com/github/git-lfs so it is not Ruby any more.

Had pretty good experience with it for a large project that at one time had a lot of binaries checked in.

@tianon
Copy link
Member

tianon commented Oct 25, 2016

Yeah, that's definitely nice! Thanks for chiming in. ❤️

I think this has sat stale long enough that I should write up my current concerns, just so they're clear (not as an argument for or against, just to keep the info current; I'm generally +1 on the idea):

  • what's the simplest way to install git-lfs, especially in Debian and Windows? (native distro packages would be preferred, but the Go dependency chain looks to be holding that back)
  • what's the failure mode like when git-lfs isn't installed? does the filename we actually want exist as metadata, or is it completely nonexistent causing a more clear failure?
  • how difficult is it for external tooling to interact with git-lfs? we currently use git archive to get our context tarball directly from Git itself without ever doing a clone or checkout (just a fetch), so I'm guessing we'd need to invoke git-lfs separately somehow?
    • bashbrew is written in Go, so might it be possible to use a library to include git-lfs support directly in our build tool?

    • https://github.com/github/git-lfs/tree/97306ee58215a3c7730e8588d57d1db133b5df75#using-lfs-from-other-go-code 😢

      At the moment git-lfs is only focussed on the stability of its command line interface, and the server APIs. The contents of the source packages is subject to change. We therefore currently discourage other Go code from depending on the git-lfs packages directly; an API to be used by external Go code may be provided in future.

@tianon
Copy link
Member

tianon commented Oct 25, 2016

More color specifically about git archive:

  • git archive doesn't include LFS files git-lfs/git-lfs#1322: git archive doesn't include LFS files

    I hoped that git archive would use the configured smudge filter, but apparently not. We'd have to modify git archive to support LFS, which probably won't happen since it's a core command, and LFS is an uncommon extension. I think implementing a git lfs archive command would be better.

Which then points to https://github.com/github/git-lfs/blob/master/ROADMAP.md where it's listed under "Possible Features" with a link back to that GitHub issue. 😢

@justincormack
Copy link

Install is pretty simple see https://git-lfs.github.com/ - there are packages for every OS. We had a team across Linux, OSX and Windows with no issues.

Files are actually stored as a small text file that says where the real data is.

There is reasonable tooling to convert history in both directions (we redid our history and converted all the old checkins to git lfs) but if you rewrite history is is of course an issue - it was easier for us as it was a private repo so we could coordinate all the users.

Not sure about git archive...

@tianon
Copy link
Member

tianon commented Oct 26, 2016

Spent a second playing with this a little bit (https://github.com/tianon/test-lfs).

The "failure mode" looks pretty sane, actually:

$ git clone [email protected]:tianon/test-lfs.git
Cloning into 'test-lfs'...
remote: Counting objects: 7, done.
remote: Compressing objects: 100% (5/5), done.
remote: Total 7 (delta 0), reused 4 (delta 0), pack-reused 0
Receiving objects: 100% (7/7), done.
Checking connectivity... done.
git-lfs smudge -- 'git-lfs-linux-amd64-1.4.4.tar.gz': git-lfs: command not found
error: external filter git-lfs smudge -- %f failed -1
error: external filter git-lfs smudge -- %f failed
fatal: git-lfs-linux-amd64-1.4.4.tar.gz: smudge filter lfs failed
warning: Clone succeeded, but checkout failed.
You can inspect what was checked out with 'git status'
and retry the checkout with 'git checkout -f HEAD'

Additionally, it looks like git archive does run filters, but since we don't clone a working copy, it fails:

tianon@00f82f5ec123:/tmp/test-lfs$ git init
Initialized empty Git repository in /tmp/test-lfs/.git/
tianon@00f82f5ec123:/tmp/test-lfs$ git fetch https://github.com/tianon/test-lfs.git
remote: Counting objects: 7, done.
remote: Compressing objects: 100% (5/5), done.
remote: Total 7 (delta 0), reused 4 (delta 0), pack-reused 0
Unpacking objects: 100% (7/7), done.
From https://github.com/tianon/test-lfs
 * branch            HEAD       -> FETCH_HEAD
tianon@00f82f5ec123:/tmp/test-lfs$ git archive 42e18259212eae0301fabbd7883fe97625a3eba4 | tar -t   
Downloading git-lfs-linux-amd64-1.4.4.tar.gz (2.98 MB)
Error downloading object: git-lfs-linux-amd64-1.4.4.tar.gz (2c1de8d00759587a93eb78b24c42192a76909d817214d4abc312135c345fbaca)

Errors logged to /tmp/test-lfs/.git/lfs/objects/logs/20161026T140914.513150585.log
Use `git lfs logs last` to view the log.
error: external filter git-lfs smudge -- %f failed 2
error: external filter git-lfs smudge -- %f failed
fatal: git-lfs-linux-amd64-1.4.4.tar.gz: smudge filter lfs failed
tar: This does not look like a tar archive
tar: Exiting with failure status due to previous errors
tianon@00f82f5ec123:/tmp/test-lfs$ cat .git/lfs/objects/logs/20161026T140914.513150585.log 
git-lfs/1.4.4 (GitHub; linux amd64; go 1.7.3; git cbf91a9)
git version 2.1.4

$ git-lfs smudge -- git-lfs-linux-amd64-1.4.4.tar.gz
Error downloading object: git-lfs-linux-amd64-1.4.4.tar.gz (2c1de8d00759587a93eb78b24c42192a76909d817214d4abc312135c345fbaca)

Smudge error: Error downloading 2c1de8d00759587a93eb78b24c42192a76909d817214d4abc312135c345fbaca: http: Post objects/batch: unsupported protocol scheme "": http: Post objects/batch: unsupported protocol scheme ""
github.com/github/git-lfs/errors.newWrappedError
    /Users/ttaylorr/dev/go/src/github.com/github/git-lfs/errors/types.go:166
github.com/github/git-lfs/errors.NewSmudgeError
    /Users/ttaylorr/dev/go/src/github.com/github/git-lfs/errors/types.go:252
github.com/github/git-lfs/lfs.PointerSmudge
    /Users/ttaylorr/dev/go/src/github.com/github/git-lfs/lfs/pointer_smudge.go:69
github.com/github/git-lfs/lfs.(*Pointer).Smudge
    /Users/ttaylorr/dev/go/src/github.com/github/git-lfs/lfs/pointer.go:64
github.com/github/git-lfs/commands.smudgeCommand
    /Users/ttaylorr/dev/go/src/github.com/github/git-lfs/commands/command_smudge.go:66
github.com/github/git-lfs/vendor/github.com/spf13/cobra.(*Command).execute
    /Users/ttaylorr/dev/go/src/github.com/github/git-lfs/vendor/github.com/spf13/cobra/command.go:477
github.com/github/git-lfs/vendor/github.com/spf13/cobra.(*Command).Execute
    /Users/ttaylorr/dev/go/src/github.com/github/git-lfs/vendor/github.com/spf13/cobra/command.go:551
github.com/github/git-lfs/commands.Run
    /Users/ttaylorr/dev/go/src/github.com/github/git-lfs/commands/run.go:65
main.main
    /Users/ttaylorr/dev/go/src/github.com/github/git-lfs/git-lfs.go:33
runtime.main
    /usr/local/Cellar/go/1.7.3/libexec/src/runtime/proc.go:183
runtime.goexit
    /usr/local/Cellar/go/1.7.3/libexec/src/runtime/asm_amd64.s:2086

ENV:
LocalWorkingDir=/tmp/test-lfs
LocalGitDir=/tmp/test-lfs/.git
LocalGitStorageDir=/tmp/test-lfs/.git
LocalMediaDir=/tmp/test-lfs/.git/lfs/objects
LocalReferenceDir=
TempDir=/tmp/test-lfs/.git/lfs/tmp
ConcurrentTransfers=3
TusTransfers=false
BasicTransfersOnly=false
BatchTransfer=true
SkipDownloadErrors=false
FetchRecentAlways=false
FetchRecentRefsDays=7
FetchRecentCommitsDays=0
FetchRecentRefsIncludeRemotes=true
PruneOffsetDays=3
PruneVerifyRemoteAlways=false
PruneRemoteName=origin
AccessDownload=none
AccessUpload=none
DownloadTransfers=basic
UploadTransfers=basic
GIT_PREFIX=

(it looks like git lfs smudge is looking at the local working copy explicitly to find the metadata necessary to fetch the file 😞)

I wonder if there's a way to disable that filters invocation so that we can invoke something after the fact instead.

@tianon
Copy link
Member

tianon commented Oct 26, 2016

http://stackoverflow.com/q/37457367/433558 is not encouraging... 😞

@tianon
Copy link
Member

tianon commented Jul 21, 2017

Checking in to note that I've just tested git-lfs-2.2.1, and got the same result as above (https://github.com/docker-library/official-images/issues/1095#issuecomment-256481338; git archive still unsupported).

@tianon
Copy link
Member

tianon commented Aug 14, 2019

I've figured out a flow that works (on Git LFS 2.8.0) -- we can use git lfs ls-files 'GitCommit' to determine whether any LFS files need to be fetched, and git lfs fetch 'GitRepo' 'GitCommit' to get them fetched into our local directory, after which git archive ... will include the LFS objects!

(Edit: To be clear, we have to git lfs fetch manually instead of letting git archive's smudge filter do it automatically because our bare Git repo doesn't have any remotes configured for it to use, nor would it be able to properly determine which one to use automatically.)

@tianon
Copy link
Member

tianon commented Aug 15, 2019

So far, git lfs ls-files COMMIT was the best way I found to determine whether we need to run fetch, but the performance hit is pretty significant (not to mention I'm running into a bad rash of too many open files errors thanks to issues like git-lfs/git-lfs#3752 and our repository being massive and my default ulimit -n being set to 1024 by Debian's defaults).

Maybe we need to do something clever with git show COMMIT:.gitattributes to detect whether an lfs filter is in use at all? Not sure whether that would be faster, but would have the benefit of being pure-go instead of shelling out (since we've got https://github.com/src-d/go-git, which unfortunately doesn't support LFS; yet? src-d/go-git#1110 😄 😇).

We could use that as an initial smoke test for whether we need to run ls-files to determine whether any LFS objects actually exist in a given commit?

Edit: using this as a scratchpad:

  • For context, the "significant performance hit" is such that before the change, I can gather the FROM values from every current image in ~1.348s (across all supported architectures), but after it jumps to ~11m39.467s ❗

  • To implement .gitattributes checking properly, we'd need to check all of some/sub/dir/.gitattributes, some/sub/.gitattributes, some/.gitattributes, and .gitattributes (a bit hairy, but not insurmountable and potentially faster than git lfs ls-files ... which seems to be doing a lot more work than we need to for a at least a basic "do we care about LFS for this commit+directory combination?)

  • Doing a really naive .gitattributes parsing, I've managed ~2.519s (vs ~1.348s without LFS entirely).

  • With a simple cache in place, I'm down to ~1.6s for time bashbrew from --all > /dev/null.

Edit (2022-02-16): linking to the new issue on the new go-git repo instead: go-git/go-git#381

@tianon tianon transferred this issue from docker-library/official-images May 8, 2020
tianon added a commit to docker-library/official-images that referenced this issue Jun 30, 2020
…breaking diff/test

See docker-library/bashbrew#10 for details and #8282 for the PR which failed and caused us to notice. 😅
tianon added a commit to infosiftr/bashbrew that referenced this issue Aug 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants