Skip to content

Commit 630bbbe

Browse files
Describe caching of git repositories
Signed-off-by: Frantisek Lachman <[email protected]>
1 parent f12c393 commit 630bbbe

File tree

1 file changed

+102
-0
lines changed

1 file changed

+102
-0
lines changed

caching_of_git_repos/README.md

Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,102 @@
1+
# How to cache git repositories with a reference repo
2+
3+
## The main idea
4+
5+
Using `git clone --reference /repo/with/cache https://some/repo/to/clone`.
6+
7+
What it does?
8+
9+
- When cloning a repo, you can specify a repo, that will be searched first for commits to download.
10+
(If the commit is found in the reference repo, reference is used instead of saving the data.
11+
If not found, commit is downloaded and data saved.)
12+
13+
--reference[-if-able] <repository>
14+
If the reference repository is on the local machine, automatically setup .git/objects/info/alternates to obtain objects from the reference repository. Using an already existing repository as an alternate will require
15+
fewer objects to be copied from the repository being cloned, reducing network and local storage costs. When using the --reference-if-able, a non existing directory is skipped with a warning instead of aborting the clone.
16+
17+
- The state of the reference repo is not important.
18+
(We don't need to care about force-pushes, merging, local changes,...)
19+
What is fetched counts.
20+
- We can fetch commits of multiple repositories in the reference repository.
21+
- Alternatively, we can use more cache repositories:
22+
`git clone --reference-if-able /cache/some-repo-to-clone https://some/repo/to/clone`
23+
24+
## Pros&Cons
25+
26+
- Cloning is much faster for repositories in the cache.
27+
- Cloning is slower for repositories not present in the cache.
28+
- Less memory is needed to clone repositories in the cache.
29+
(Which makes it possible to clone kernel for example.)
30+
- More memory is needed to clone repositories not present in the cache.
31+
- e.g. 1000m needed to clone ogr using cache repo with kernel-ark and systemd (800m has not been enough)
32+
- Less storage is needed for the cloned repo if it is in the cache.
33+
(Only the current state of the repo is saved, historical commits reference the cache repo.)
34+
35+
## Where to store this cache repository?
36+
37+
- The cache does not need to be writable for cloning.
38+
Only for creating/updating.
39+
- Persistent volumes can be used.
40+
- How much storage we can afford?
41+
- Storage is cheap, but git repositories can be really big after some time.
42+
- I haven't tested efficiency of the scenario with a lot of repositories in one cache repository.
43+
- Data itself can be shared between stage/prod if we want. (Probably not wanted and hard to do in openshift.)
44+
45+
## How to create the cache repository?
46+
47+
- Manually on request. Mount the volume once with more memory and fetch the needed repository.
48+
- Manually on sentry issue. As previous but gather the problematic repos in sentry.
49+
- Start with kernel manually and add new ones on the go.
50+
51+
## What repositories we want there?
52+
53+
- Just kernel.
54+
- A group of hardcoded/configured repositories.
55+
- All repositories matching some condition (at least some commits, some size, ...)
56+
- All repositories. (Add if not present.)
57+
58+
## When we want to use this mechanism?
59+
60+
- Everytime. It is possible, but it can be time/memory consuming to use it for new repositories.
61+
- Only on the repos matching the origin. (Will not work for forks/renames.)
62+
We can cache the list to not need to read it multiple times.
63+
- Use the `--reference-if-able` and have one repo for each project.
64+
(Similarly to the previous one, forks/renames will not work.)
65+
66+
## How to update the repositories?
67+
68+
Updating of some bigger repositories can require more memory we used to have for workers.
69+
70+
- Manually. (`git fetch --all`)
71+
- As a cron job. (daily x hourly x weekly)
72+
- As a celery task. (As a reaction on empty queue?)
73+
74+
## How can this be implemented?
75+
76+
Currently, cloning is done lazily in LocalProject.
77+
78+
1. In case of the very basic version, we can make packit agnostic to this
79+
by being able to configure additional arguments for clone commands:
80+
_ [packit] Enhance the schema of the user config.
81+
_ [packit] Use this value when cloning (LocalProject).
82+
_ [deployment] Add persistent volume.
83+
_ [deployment] Use `--reference /persistent/volume` in the service config.
84+
2. Alternatively, we can have the cache directory configurable (optionally).
85+
3. If we want more clever behaviour, we need to put the cloning logic to packit.
86+
4. Or, we can forward some method for handling the cloning.
87+
(Defined in the service repo, run in the packit.)
88+
89+
## Is this relevant for the CLI users?
90+
91+
Can help to avoid long cloning during some commands:
92+
93+
- using temporary distgit repo (default)
94+
- using URL instead of the local git repository as an input
95+
96+
To give it some value, we need to:
97+
98+
- add new repositories automatically
99+
- provide a way to update the cache (automatically/manually)
100+
101+
It looks like it can be useful if the implementation can be shared,
102+
but it does not make sense to spend a lot of time on CLI-only code.

0 commit comments

Comments
 (0)