|
| 1 | +# How to cache git repositories with a reference repo |
| 2 | + |
| 3 | +## The main idea |
| 4 | + |
| 5 | +Using `git clone --reference /repo/with/cache https://some/repo/to/clone`. |
| 6 | + |
| 7 | +What it does? |
| 8 | + |
| 9 | +- When cloning a repo, you can specify a repo, that will be searched first for commits to download. |
| 10 | + (If the commit is found in the reference repo, reference is used instead of saving the data. |
| 11 | + If not found, commit is downloaded and data saved.) |
| 12 | + |
| 13 | + --reference[-if-able] <repository> |
| 14 | + If the reference repository is on the local machine, automatically setup .git/objects/info/alternates to obtain objects from the reference repository. Using an already existing repository as an alternate will require |
| 15 | + fewer objects to be copied from the repository being cloned, reducing network and local storage costs. When using the --reference-if-able, a non existing directory is skipped with a warning instead of aborting the clone. |
| 16 | + |
| 17 | +- The state of the reference repo is not important. |
| 18 | + (We don't need to care about force-pushes, merging, local changes,...) |
| 19 | + What is fetched counts. |
| 20 | +- We can fetch commits of multiple repositories in the reference repository. |
| 21 | + - Alternatively, we can use more cache repositories: |
| 22 | + `git clone --reference-if-able /cache/some-repo-to-clone https://some/repo/to/clone` |
| 23 | + |
| 24 | +## Pros&Cons |
| 25 | + |
| 26 | +- Cloning is much faster for repositories in the cache. |
| 27 | +- Cloning is slower for repositories not present in the cache. |
| 28 | +- Less memory is needed to clone repositories in the cache. |
| 29 | + (Which makes it possible to clone kernel for example.) |
| 30 | +- More memory is needed to clone repositories not present in the cache. |
| 31 | + - e.g. 1000m needed to clone ogr using cache repo with kernel-ark and systemd (800m has not been enough) |
| 32 | +- Less storage is needed for the cloned repo if it is in the cache. |
| 33 | + (Only the current state of the repo is saved, historical commits reference the cache repo.) |
| 34 | + |
| 35 | +## Where to store this cache repository? |
| 36 | + |
| 37 | +- The cache does not need to be writable for cloning. |
| 38 | + Only for creating/updating. |
| 39 | +- Persistent volumes can be used. |
| 40 | +- How much storage we can afford? |
| 41 | + - Storage is cheap, but git repositories can be really big after some time. |
| 42 | + - I haven't tested efficiency of the scenario with a lot of repositories in one cache repository. |
| 43 | +- Data itself can be shared between stage/prod if we want. (Probably not wanted and hard to do in openshift.) |
| 44 | + |
| 45 | +## How to create the cache repository? |
| 46 | + |
| 47 | +- Manually on request. Mount the volume once with more memory and fetch the needed repository. |
| 48 | +- Manually on sentry issue. As previous but gather the problematic repos in sentry. |
| 49 | +- Start with kernel manually and add new ones on the go. |
| 50 | + |
| 51 | +## What repositories we want there? |
| 52 | + |
| 53 | +- Just kernel. |
| 54 | +- A group of hardcoded/configured repositories. |
| 55 | +- All repositories matching some condition (at least some commits, some size, ...) |
| 56 | +- All repositories. (Add if not present.) |
| 57 | + |
| 58 | +## When we want to use this mechanism? |
| 59 | + |
| 60 | +- Everytime. It is possible, but it can be time/memory consuming to use it for new repositories. |
| 61 | +- Only on the repos matching the origin. (Will not work for forks/renames.) |
| 62 | + We can cache the list to not need to read it multiple times. |
| 63 | +- Use the `--reference-if-able` and have one repo for each project. |
| 64 | + (Similarly to the previous one, forks/renames will not work.) |
| 65 | + |
| 66 | +## How to update the repositories? |
| 67 | + |
| 68 | +Updating of some bigger repositories can require more memory we used to have for workers. |
| 69 | + |
| 70 | +- Manually. (`git fetch --all`) |
| 71 | +- As a cron job. (daily x hourly x weekly) |
| 72 | +- As a celery task. (As a reaction on empty queue?) |
| 73 | + |
| 74 | +## How can this be implemented? |
| 75 | + |
| 76 | +Currently, cloning is done lazily in LocalProject. |
| 77 | + |
| 78 | +1. In case of the very basic version, we can make packit agnostic to this |
| 79 | + by being able to configure additional arguments for clone commands: |
| 80 | + _ [packit] Enhance the schema of the user config. |
| 81 | + _ [packit] Use this value when cloning (LocalProject). |
| 82 | + _ [deployment] Add persistent volume. |
| 83 | + _ [deployment] Use `--reference /persistent/volume` in the service config. |
| 84 | +2. Alternatively, we can have the cache directory configurable (optionally). |
| 85 | +3. If we want more clever behaviour, we need to put the cloning logic to packit. |
| 86 | +4. Or, we can forward some method for handling the cloning. |
| 87 | + (Defined in the service repo, run in the packit.) |
| 88 | + |
| 89 | +## Is this relevant for the CLI users? |
| 90 | + |
| 91 | +Can help to avoid long cloning during some commands: |
| 92 | + |
| 93 | +- using temporary distgit repo (default) |
| 94 | +- using URL instead of the local git repository as an input |
| 95 | + |
| 96 | +To give it some value, we need to: |
| 97 | + |
| 98 | +- add new repositories automatically |
| 99 | +- provide a way to update the cache (automatically/manually) |
| 100 | + |
| 101 | +It looks like it can be useful if the implementation can be shared, |
| 102 | +but it does not make sense to spend a lot of time on CLI-only code. |
0 commit comments