Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: set up git lfs for larger files #27923

Closed
wants to merge 10 commits into from
Closed

Conversation

mistercrunch
Copy link
Member

@mistercrunch mistercrunch commented Apr 5, 2024

SUMMARY

[this is DRAFT but is a bit of a request for comment]

I recently realized that even shallow clones of this repo are pretty large, with a fair amount of that coming from video and image files.

Here I setup the large video in our README to be stored in git lfs as well as image files for the docs.

This change should be 100% seamless for anyone who has git-lfs installed (and GitHub Actions have it on by default, meaning a normal clone/checkout operation will fetch the lfs assets) and shouldn't affect common workflows.

What I did to setup this PR:

git lfs track *.mp4
git lfs track docs/static/img/*.jpg
git lfs track docs/static/img/*.png

Also, to reap the benefits in CI, I'm setting up all of the actions/checkout (except for the docs-building ones) to systematically not bring those files as they aren't required for most CI workloads, with the exception of building docs.

Given the load on CI and how much we checkout the repo there, this should have a significant impact on speed and bandwidth of CI. I estimated about 1/2 the size from this handful of files

Note that this change only affects the repo moving forward, meaning only shallow clones will see a benefit. Also note that by default all action/checkout are shallow clones by default. Also note that there is a possibility of doing this retroactively, but it requires modifying git history (possibly affecting all shas/refs) and I would not recommend that at this time.

Testing

  • [] test that docs will build properly in CI
  • [] test that the mp4 video is accessible on the branch

@github-actions github-actions bot added doc Namespace | Anything related to documentation github_actions Pull requests that update GitHub Actions code preset-io labels Apr 5, 2024
@mistercrunch mistercrunch marked this pull request as ready for review April 8, 2024 16:35
@mistercrunch mistercrunch force-pushed the setup-git-lfs branch 3 times, most recently from 9e1fb3e to 2607939 Compare April 10, 2024 21:34
@rusackas
Copy link
Member

The only thing I'd ask for here is updates to the docs, particularly around testing/contrinbuting to the docs. If the process to check out and run the docs with images/video is different, we just need those steps documented. Also, we should soon be re-jiggering DB logos, and adding more release notes, so if the process for adding those files differs from what we've done traditionally, we'll want to document that and perhaps add CI safeguards.

RELEASING/**/*.jpg filter=lfs diff=lfs merge=lfs -text
RELEASING/**/*.png filter=lfs diff=lfs merge=lfs -text
RELEASING/**/*.gif filter=lfs diff=lfs merge=lfs -text
docs/static/img/**/*.jpg filter=lfs diff=lfs merge=lfs -text
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hah, I was just going to ask this supports wildcards. Not sure we even need the file extensions here, as people might add svg, webp, mov, mp4, who knows.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I could be less specific here and include the docs/static/img/*. For RELEASING I could probably find a better glob too, but it feels like this works well enough. Goal is to get the bulk of the binaries out.

I terms of doing it for the whole repo, it can create confusion for people who don't have git-lfs installed, and it can become hard to define which process in CI need or don't need which part of the lfs. For now I was able to switch off lfs downloads for everything EXCEPT docs building and docker building (because of the geojson, we do want that when building dockers).

@rusackas
Copy link
Member

rusackas commented Apr 11, 2024

100% seamless for anyone who has git-lfs installed

This is the part that worries me... I assume most people don't?

I recently realized that even shallow clones of this repo are pretty
large, with a fair amount of that coming from video and image files.

Here I setup the large video in our README to be stored in git lfs
as well as image files for the docs.

This change should be 100% seemless for anyone who has git-lfs installed
and shouldn't affect common workflows.

Also, to reap the benefits in CI, I'm setting up the checkout actions
to more systematically not bring those files as they aren't required
for most CI workloads, with the exception of building docs.

Given the load on CI and how much we checkout the repo there, this
should have a significant impact on speed and bandwidth of CI. I
estimated about 1/2 the size from this handful of files

Note that this change only affects the repo moving forward, meaning
only shallow clones will see a benefit. There's a possibility of doing
this retroactively, but it requires modifying git history and I would
not recommend that.
@mistercrunch
Copy link
Member Author

oh wow!

  • shallow clone is 0m20.222s VS 1m4.067s (3x faster)
  • size of .git 35M VS 161M (21% of the space)

GPT-generated benchmark script

#!/bin/bash

# Repository URL
REPO_URL="https://github.com/your-repo.git"

# Prevent Git LFS from fetching files during clone
export GIT_LFS_SKIP_SMUDGE=1

# Function to clone and measure repo size
clone_and_measure() {
    local branch=$1
    local dir_name="${branch}_clone"
    
    echo "Cloning $branch branch..."
    time git clone --depth 1 --branch $branch $REPO_URL $dir_name

    echo "Calculating .git size for $branch branch..."
    du -sh $dir_name/.git

    echo "Calculating repository size excluding .git for $branch branch..."
    # Using find to list all files and directories except .git, then calculating size with du
    find $dir_name -type d -name .git -prune -o -print | du -ch | tail -n 1
}

# Clone and measure master branch
clone_and_measure "master"

# Clone and measure log_queries branch
clone_and_measure "log_queries"

@rusackas
Copy link
Member

Just noticed the geojson files are now moved to git-lfs, that means you need to have/use git-lfs to run Superset with maps now. This really ought to get covered in all the relevant points in documentation before we merge it (how to pull/run things, and how to contribute/commit things as well).

@mistercrunch
Copy link
Member Author

Ok this is a bit disruptive, decided to go with this approach instead -> #27998

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc Namespace | Anything related to documentation github_actions Pull requests that update GitHub Actions code preset-io size/M
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants