-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Source distribution #485
Comments
Note that I still plan to eliminate the "pre-SSL, post-networking" stage, and switch to exclusively HTTPS downloads - my expectation is still that ISPs won't allow non-SSL traffic to pass through their networks for too long. Expect random RSTs injected into plain HTTP streams, or straight up blocking port 80, similar to how almost all ISPs block port 25 inbound, and many also outbound. |
Also, we need to support a mode where rootfs.py gathers a copy of all files locally, and then spawns its own server for the bootstrap machine to download from. This is so that the bootstrap machine can be isolated from the Internet, and not get exposed to packets sent by untrusted sources, which might try to exploit some kernel-level vulnerability to compromise the bootstrap. |
What is the benefit of this over, say, an |
I guess from mirrors, so that less strain on upstream. It's all checksummed anyway. But this whole distributed mirror network sounds a lot like reinventing DHT from Bittorrent. Hence the question arises, can you reuse that? |
Hmm, I am not familiar with that, but it seems promising! More research required... |
If we can make |
My opinions, as a bootstrap enthusiast but someone who hasn't yet contributed all that much: Mirrors are necessaryUpstream sources, particularly for older software, are not reliable. Every now and then I try turning off substitutes in my Nix and Guix configs, and I am always disappointed by files that have gone 404 without anyone noticing, or have changed hash, or whatever else. I don't think there's any reproducible bootrap without us mirroring upstream sources. HTTPS is inevitableI agree with @Googulator that HTTPS is inevitable. I think we are in that world already, and it didn't even need ISP shenanigans: I just reported a bootstrap failure to #477 where Seed files are suspectHow do we know that we've put a true copy of This creates a tension: @Googulator proposes to minimise the HTTP-but-not-HTTPS phase of the bootstrap. This probably means shifting more source tarballs to pre-networking seed files. In addition to the hashing considerations above, I worry that relying too heavily on HTTPS introduces time bombs into the bootstrap process. If cipher suites change and the ones we can easily bootstrap into are deprecated and removed, the bootstrap is sunk. This is not theoretical: Guix has a similar issue where it currently cannot complete a bootstrap without substituters because of certificate time bombs in As for @fosslinux's proposals: Re: "Do not require a HTTP-only, non-#bootstrappable source"I think this is regrettable, but necessary. Historic mirrors are simply too unreliable. Whatever scripts use to acquire sources should cross-check against upstream if available. Re: "Create git snapshots ourselves using
|
Yes, for the same git version. (https://github.blog/changelog/2023-01-30-git-archive-checksums-may-change/) (git/git@4f4be00)
This is the easiest solution I think, but my greatest concern is increasing your aforementioned "risk of the #bootstrappable project becoming a single point of compromise (of itself)" |
We already checksum every single tarball, in a bootstrapped fashion. (More details, see parts.rst, but we have builder-hex0 + stage0-posix, and part of stage0-posix is mescc-tools-extra, which contains checksumming program. Thus from that point onward we are same in that regard).
Yeah, agreed on both these counts. To clarify my thoughts a bit more on the second point;
|
That should be enough to get started as a way to snapshot upstream. It would be ideal to have a tool that tried extremely hard to be deterministic here, but that seems fine to defer until future work.
Very cool. It seems pretty hard to slip something in before
Agree with these points, but it would be cool to strengthen the second: users could have the ability to fetch from mirrors and cross-check against (ideally) upstream or (if not) other mirrors. Enabling this by default would completely defeat the purpose of a mirror network (since every user would hit upstream), but you could request a random fraction of each file from another source to ensure they remained in sync. |
Yeah, I would like to use mirror network more too (assuming we don't trust stuff from it before checksumming in some way). Perhaps a configurable option but I would prefer mirrors to be default. |
Ok, fair points. We can make mirrors default. Another point of discussion: I didn't really think of this when it came up originally, but now seems like a good time to revisit it. For packages such as I'm not totally sure the tradeoff is worth it. Obviously, recompression changes the checksum, and it is a layer of indirection from upstream that is pretty unverifiable. I'd be for going back to upstream tarballs for those. At minimum, I would want our mirror network to do the recompression, rather than blindly trusting fedora/slackware there. |
Unless you have very hard space requirements, I'd say that storage and RAM are both cheap enough to not worry. If you were bumping up against the addressing limits of 32-bit machines or something then we'd need to think more carefully but we already know that the bootstrap doesn't run on 2GB machines. Matching upstream is so much more important. Even if you do have stringent storage requirements, you'd need a deterministic and possibly bootstrappable compression program to crunch things down. RAM requirements would be more affected by uncompressed package sizes, anyway? |
@Googulator wanted to be able to fit initial bootstrap sources into 256 MiB (basically the largest existing chips that can be programmed manually without software)... But yeah, it should probably be a separate patch on top of live-bootstrap for doing something like that... |
I think we need to reconsider our model for the distribution of input tarballs/distfiles into live-bootstrap.
State of play
We have three "distinct"-ish sections of the bootstrap in this area, each of which have been treated with somewhat different requirements.
--external-sources
is off), to download sources within the bootstrapped system. However, we cannot access HTTPS sites at this point, as we don't have SSL support. Therefore, all distfiles in this stage must be available over HTTP (non-SSL).And we are currently using two ways to get distfiles:
Note: some distfiles are effectively an endpoint running on-demand, or serving a cached,
git archive
.Here are some "de-facto" rules we have been using;
Ideas/Questions/Proposals
Proposal: Do not require a HTTP-only, non-#bootstrappable source for each distfile in the pre-SSL stage.
Currently: We need an upstream source, or a mirror, or archive.org, hosted on HTTP, for every distfile in the pre-SSL stage.
Suggestion: Host them ourselves on a HTTP-enabled server. This is OK, because it will have the same checksum as the upstream anyways. Furthermore, once SSL is available, it is easy to check the file from upstream also matches the checksum.
Problems:
Proposal: Create git snapshots ourselves using
git archive
and distribute them ourselves, instead of using Git snapshots from cgit/gitweb/GitHub/similar.Currently: If we need a particular Git commit, we download a snapshot of it from something like cgit, gitweb or GitHub. These tend to be unreliable or just randomly disappear (see Gnulib). Further, no-one is checking that the files are the same in the Git repository as they are in the generated snapshot.
Suggestion:
git archive
s are created in a scripted manner, and distributed by us. Also, investigate building Git in the bootstrap process, then we can justgit clone
directly.Problems:
git archive
is reproducible.--external-sources
is used,git clone
the repository instead and create the tarball as a part ofrootfs.py
.Proposal: Begin a mirror network.
Currently: We use nearly exclusively upstream sources for distfiles.
Suggestion: Pull (somewhat?)randomly from a global mirror network for distfiles, each controlled by different people. Each mirror would not mirror a #bootstrappable controlled server, but would mirror upstream files. For the previous Git proposal, each mirror would generate its own
git archive
snapshots. This makes it nearly impossible for a single internal bad actor to manage to both change a distfile and its related checksum within live-bootstrap.Questions:
The text was updated successfully, but these errors were encountered: