Description
I think we need to reconsider our model for the distribution of input tarballs/distfiles into live-bootstrap.
State of play
We have three "distinct"-ish sections of the bootstrap in this area, each of which have been treated with somewhat different requirements.
- pre-networking. Before networking is available, all distfiles must be pre-loaded onto the system.
- pre-SSL. Once networking is available, we immediately build curl, so we have the option (if
--external-sources
is off), to download sources within the bootstrapped system. However, we cannot access HTTPS sites at this point, as we don't have SSL support. Therefore, all distfiles in this stage must be available over HTTP (non-SSL). - post-SSL. At this point, we have curl with SSL support, so we can get distfiles over HTTPS.
And we are currently using two ways to get distfiles:
- HTTP
- HTTPS
Note: some distfiles are effectively an endpoint running on-demand, or serving a cached, git archive
.
Here are some "de-facto" rules we have been using;
- HTTP and HTTPS is allowed for pre-networking and post-SSL stages.
- HTTP only is allowed for pre-SSL stages.
- there is a non-#bootstrappable/bootstrappable.world source available for each distfile.
- this has proved particularly challenging in the pre-SSL stage, where there are often few HTTP-only sites available, and for Git snapshots, which are quite unreliable (currently, Gnulib is a problem)
Ideas/Questions/Proposals
Proposal: Do not require a HTTP-only, non-#bootstrappable source for each distfile in the pre-SSL stage.
Currently: We need an upstream source, or a mirror, or archive.org, hosted on HTTP, for every distfile in the pre-SSL stage.
Suggestion: Host them ourselves on a HTTP-enabled server. This is OK, because it will have the same checksum as the upstream anyways. Furthermore, once SSL is available, it is easy to check the file from upstream also matches the checksum.
Problems:
- We control both the checksum and the distfile, so malicious changes could be easily slipped in.
- Mitigation: It is easy to check that the distfile is equivalent, using checksums.
- Mitigation: See proposal below regarding mirror network.
Proposal: Create git snapshots ourselves using git archive
and distribute them ourselves, instead of using Git snapshots from cgit/gitweb/GitHub/similar.
Currently: If we need a particular Git commit, we download a snapshot of it from something like cgit, gitweb or GitHub. These tend to be unreliable or just randomly disappear (see Gnulib). Further, no-one is checking that the files are the same in the Git repository as they are in the generated snapshot.
Suggestion: git archive
s are created in a scripted manner, and distributed by us. Also, investigate building Git in the bootstrap process, then we can just git clone
directly.
Problems:
- We control the distfile, so malicious changes could be easily slipped in.
- Mitigation: Create it using a script, so anyone can validate the work, as
git archive
is reproducible. - Mitigation: If
--external-sources
is used,git clone
the repository instead and create the tarball as a part ofrootfs.py
. - Mitigation: See proposal below regarding mirror network.
- Mitigation: Create it using a script, so anyone can validate the work, as
Proposal: Begin a mirror network.
Currently: We use nearly exclusively upstream sources for distfiles.
Suggestion: Pull (somewhat?)randomly from a global mirror network for distfiles, each controlled by different people. Each mirror would not mirror a #bootstrappable controlled server, but would mirror upstream files. For the previous Git proposal, each mirror would generate its own git archive
snapshots. This makes it nearly impossible for a single internal bad actor to manage to both change a distfile and its related checksum within live-bootstrap.
Questions:
- How do we bootstrap the (ever-changing) mirror list?
- Suppose that for a particular distfile, an upstream source is sufficient (e.g. we are in the post-SSL stage, and are downloading a HTTPS-hosted distfile). Do we prefer the upstream source, or mirrors?
- Benefits of upstream source: Trust? Consistency? Puts less load on the mirror network?
- Benefits of mirrors: Puts less load on the upstream source?