Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Source distribution #485

Open
fosslinux opened this issue Dec 18, 2024 · 14 comments
Open

RFC: Source distribution #485

fosslinux opened this issue Dec 18, 2024 · 14 comments

Comments

@fosslinux
Copy link
Owner

I think we need to reconsider our model for the distribution of input tarballs/distfiles into live-bootstrap.

State of play

We have three "distinct"-ish sections of the bootstrap in this area, each of which have been treated with somewhat different requirements.

  • pre-networking. Before networking is available, all distfiles must be pre-loaded onto the system.
  • pre-SSL. Once networking is available, we immediately build curl, so we have the option (if --external-sources is off), to download sources within the bootstrapped system. However, we cannot access HTTPS sites at this point, as we don't have SSL support. Therefore, all distfiles in this stage must be available over HTTP (non-SSL).
  • post-SSL. At this point, we have curl with SSL support, so we can get distfiles over HTTPS.

And we are currently using two ways to get distfiles:

  • HTTP
  • HTTPS

Note: some distfiles are effectively an endpoint running on-demand, or serving a cached, git archive.

Here are some "de-facto" rules we have been using;

  • HTTP and HTTPS is allowed for pre-networking and post-SSL stages.
  • HTTP only is allowed for pre-SSL stages.
  • there is a non-#bootstrappable/bootstrappable.world source available for each distfile.
    • this has proved particularly challenging in the pre-SSL stage, where there are often few HTTP-only sites available, and for Git snapshots, which are quite unreliable (currently, Gnulib is a problem)

Ideas/Questions/Proposals

Proposal: Do not require a HTTP-only, non-#bootstrappable source for each distfile in the pre-SSL stage.

Currently: We need an upstream source, or a mirror, or archive.org, hosted on HTTP, for every distfile in the pre-SSL stage.
Suggestion: Host them ourselves on a HTTP-enabled server. This is OK, because it will have the same checksum as the upstream anyways. Furthermore, once SSL is available, it is easy to check the file from upstream also matches the checksum.

Problems:

  • We control both the checksum and the distfile, so malicious changes could be easily slipped in.
    • Mitigation: It is easy to check that the distfile is equivalent, using checksums.
    • Mitigation: See proposal below regarding mirror network.

Proposal: Create git snapshots ourselves using git archive and distribute them ourselves, instead of using Git snapshots from cgit/gitweb/GitHub/similar.

Currently: If we need a particular Git commit, we download a snapshot of it from something like cgit, gitweb or GitHub. These tend to be unreliable or just randomly disappear (see Gnulib). Further, no-one is checking that the files are the same in the Git repository as they are in the generated snapshot.
Suggestion: git archives are created in a scripted manner, and distributed by us. Also, investigate building Git in the bootstrap process, then we can just git clone directly.

Problems:

  • We control the distfile, so malicious changes could be easily slipped in.
    • Mitigation: Create it using a script, so anyone can validate the work, as git archive is reproducible.
    • Mitigation: If --external-sources is used, git clone the repository instead and create the tarball as a part of rootfs.py.
    • Mitigation: See proposal below regarding mirror network.

Proposal: Begin a mirror network.

Currently: We use nearly exclusively upstream sources for distfiles.
Suggestion: Pull (somewhat?)randomly from a global mirror network for distfiles, each controlled by different people. Each mirror would not mirror a #bootstrappable controlled server, but would mirror upstream files. For the previous Git proposal, each mirror would generate its own git archive snapshots. This makes it nearly impossible for a single internal bad actor to manage to both change a distfile and its related checksum within live-bootstrap.

Questions:

  • How do we bootstrap the (ever-changing) mirror list?
  • Suppose that for a particular distfile, an upstream source is sufficient (e.g. we are in the post-SSL stage, and are downloading a HTTPS-hosted distfile). Do we prefer the upstream source, or mirrors?
    • Benefits of upstream source: Trust? Consistency? Puts less load on the mirror network?
    • Benefits of mirrors: Puts less load on the upstream source?
@Googulator
Copy link
Collaborator

Note that I still plan to eliminate the "pre-SSL, post-networking" stage, and switch to exclusively HTTPS downloads - my expectation is still that ISPs won't allow non-SSL traffic to pass through their networks for too long. Expect random RSTs injected into plain HTTP streams, or straight up blocking port 80, similar to how almost all ISPs block port 25 inbound, and many also outbound.

@Googulator
Copy link
Collaborator

Also, we need to support a mode where rootfs.py gathers a copy of all files locally, and then spawns its own server for the bootstrap machine to download from. This is so that the bootstrap machine can be isolated from the Internet, and not get exposed to packets sent by untrusted sources, which might try to exploit some kernel-level vulnerability to compromise the bootstrap.

@fosslinux
Copy link
Owner Author

Also, we need to support a mode where rootfs.py gathers a copy of all files locally, and then spawns its own server for the bootstrap machine to download from. This is so that the bootstrap machine can be isolated from the Internet, and not get exposed to packets sent by untrusted sources, which might try to exploit some kernel-level vulnerability to compromise the bootstrap.

What is the benefit of this over, say, an --external-sources mode? Let's suppose that we had support for splitting the set of distfiles across multiple disks if that was a problem.

@stikonas
Copy link
Collaborator

* How do we bootstrap the (ever-changing) mirror list?

* Suppose that for a particular distfile, an upstream source is sufficient (e.g. we are in the post-SSL stage, and are downloading a HTTPS-hosted distfile). Do we prefer the upstream source, or mirrors?

I guess from mirrors, so that less strain on upstream. It's all checksummed anyway.

But this whole distributed mirror network sounds a lot like reinventing DHT from Bittorrent. Hence the question arises, can you reuse that?

@fosslinux
Copy link
Owner Author

But this whole distributed mirror network sounds a lot like reinventing DHT from Bittorrent. Hence the question arises, can you reuse that?

Hmm, I am not familiar with that, but it seems promising! More research required...

@Googulator
Copy link
Collaborator

What is the benefit of this over, say, an --external-sources mode? Let's suppose that we had support for splitting the set of distfiles across multiple disks if that was a problem.

If we can make --external-sources reliable in terms of picking the correct disk for running the bootstrap on, vs. reading sources from (or even do it on a single disk with partitioning), then it may be OK too.

@endgame
Copy link

endgame commented Dec 22, 2024

My opinions, as a bootstrap enthusiast but someone who hasn't yet contributed all that much:

Mirrors are necessary

Upstream sources, particularly for older software, are not reliable. Every now and then I try turning off substitutes in my Nix and Guix configs, and I am always disappointed by files that have gone 404 without anyone noticing, or have changed hash, or whatever else. I don't think there's any reproducible bootrap without us mirroring upstream sources.

HTTPS is inevitable

I agree with @Googulator that HTTPS is inevitable. I think we are in that world already, and it didn't even need ISP shenanigans: I just reported a bootstrap failure to #477 where libtool-2.4.7 failed to download under Linux because the request was unconditionally redirected to HTTPS before we had a curl that could handle it.

Seed files are suspect

How do we know that we've put a true copy of builder-hex0 or any other seed file onto the boot device, given that it's come from an untrusted machine? Ideally, we should be building a checksumming program very early in the bootstrap, using it to check all the files used in the bootstrap thus far, and then using it to check any files from the root before we touch them.

This creates a tension: @Googulator proposes to minimise the HTTP-but-not-HTTPS phase of the bootstrap. This probably means shifting more source tarballs to pre-networking seed files. In addition to the hashing considerations above, I worry that relying too heavily on HTTPS introduces time bombs into the bootstrap process. If cipher suites change and the ones we can easily bootstrap into are deprecated and removed, the bootstrap is sunk. This is not theoretical: Guix has a similar issue where it currently cannot complete a bootstrap without substituters because of certificate time bombs in openssl-1.1.1l.


As for @fosslinux's proposals:

Re: "Do not require a HTTP-only, non-#bootstrappable source"

I think this is regrettable, but necessary. Historic mirrors are simply too unreliable. Whatever scripts use to acquire sources should cross-check against upstream if available.

Re: "Create git snapshots ourselves using git archive"

Is git archive deterministic? (As in, does git archive from a checkout of a given ref always result in byte-for-byte identical output?) It may be necessary to extract objects from a git snapshot in a particular order so that the archives are built the same way each time, and two people on different machines (and possibly even OSes) can build the same source snapshot with the same checksum.

This heuristic (requiring a deterministic process from a repo snapshot) also looks applicable to other DVCSes.

Re: "Begin a mirror network"

As above: regrettable but likely necessary. We reduce the risk of the #bootstrappable project becoming a single point of compromise (of itself) if we encourage and support the various upstreams to participate in existing mirror networks, instead of a #bootstrappable network of mirrors. Many ISPs, universities, etc, mirror FOSS projects.

If upstreams are not sympathetic, then taking on the mirroring ourselves will be necessary.


On Bootstrapping (ha) the Mirror List

Could we just keep a file up-to-date in a git repo? If the download script fetched 1/N of each file from N mirrors (using byte-range fetches, and choosing the N mirrors from the list at random), and the resultant file passed the checksum check, you have a good guarantee that the mirror is faithfully replicating the content.

On BitTorrent

It seems like BitTorrent will do a lot of what we want: it would let people join and leave the mirroring of bootstrap seeds, as well as let them find other sources for their bootstrap seeds. One obvious concern: we'd have to think about how often to re-issue the .torrent file.

BEP 39 (BEP = BitTorrent Enhancement Proposal) provides support for updatable torrents, which might help seeders keep up to date with the latest versions of the bootstrap seeds.

There are also two BEPs that allow torrent clients to use HTTP sources if other peers are not enough:

  • BEP 17 ("HTTP Seeding") documents a HTTP endpoint for requesting individual pieces of a torrent; and

  • BEP 19 ("Web Seeds") documents how to encode within a .torrent file that its contents are available over HTTP. Note that this essentially requires a web seed to contain the entire torrent's contents as a directory on the server; it won't let us define a torrent that identifies distinct HTTP servers for individual files.

It doesn't appear that you could use existing BT clients to verify that what's on the HTTP seeds matches what's in the swarm. That's something you'd have to build yourself.

If we do provide our own HTTP mirrors (which we probably should, even if we can get upstreams interested in mirroring), coming back later and publishing a torrent with web seeds should not be difficult. It could even be regenerated whenever there's a "major release" of the bootstrap, so that seeders are kept up-to-date.

@fosslinux
Copy link
Owner Author

fosslinux commented Dec 23, 2024

Is git archive deterministic? (As in, does git archive from a checkout of a given ref always result in byte-for-byte identical output?) It may be necessary to extract objects from a git snapshot in a particular order so that the archives are built the same way each time, and two people on different machines (and possibly even OSes) can build the same source snapshot with the same checksum.

Yes, for the same git version. (https://github.blog/changelog/2023-01-30-git-archive-checksums-may-change/) (git/git@4f4be00)

Could we just keep a file up-to-date in a git repo? If the download script fetched 1/N of each file from N mirrors (using byte-range fetches, and choosing the N mirrors from the list at random), and the resultant file passed the checksum check, you have a good guarantee that the mirror is faithfully replicating the content.

This is the easiest solution I think, but my greatest concern is increasing your aforementioned "risk of the #bootstrappable project becoming a single point of compromise (of itself)"

@fosslinux
Copy link
Owner Author

How do we know that we've put a true copy of builder-hex0 or any other seed file onto the boot device, given that it's come from an untrusted machine? Ideally, we should be building a checksumming program very early in the bootstrap, using it to check all the files used in the bootstrap thus far, and then using it to check any files from the root before we touch them.

We already checksum every single tarball, in a bootstrapped fashion. (More details, see parts.rst, but we have builder-hex0 + stage0-posix, and part of stage0-posix is mescc-tools-extra, which contains checksumming program. Thus from that point onward we are same in that regard).

We reduce the risk of the #bootstrappable project becoming a single point of compromise (of itself) if we encourage and support the various upstreams to participate in existing mirror networks, instead of a #bootstrappable network of mirrors. Many ISPs, universities, etc, mirror FOSS projects.

Whatever scripts use to acquire sources should cross-check against upstream if available.

Yeah, agreed on both these counts. To clarify my thoughts a bit more on the second point;

  • for mirrors acquiring sources: they should only get their sources from upstreams, not from other mirrors
  • for end users of live-bootstrap acquiring sources: generally, default to upstream, and use mirrors where it is infeasible/impossible

@endgame
Copy link

endgame commented Dec 23, 2024

Is git archive deterministic?

Yes, for the same git version.

That should be enough to get started as a way to snapshot upstream. It would be ideal to have a tool that tried extremely hard to be deterministic here, but that seems fine to defer until future work.

We already checksum every single tarball, in a bootstrapped fashion.

Very cool. It seems pretty hard to slip something in before mescc-tools-extra.

  • for mirrors acquiring sources: they should only get their sources from upstreams, not from other mirrors

  • for end users of live-bootstrap acquiring sources: generally, default to upstream, and use mirrors where it is infeasible/impossible

Agree with these points, but it would be cool to strengthen the second: users could have the ability to fetch from mirrors and cross-check against (ideally) upstream or (if not) other mirrors. Enabling this by default would completely defeat the purpose of a mirror network (since every user would hit upstream), but you could request a random fraction of each file from another source to ensure they remained in sync.

@stikonas
Copy link
Collaborator

Yeah, I would like to use mirror network more too (assuming we don't trust stuff from it before checksumming in some way). Perhaps a configurable option but I would prefer mirrors to be default.

@fosslinux
Copy link
Owner Author

fosslinux commented Dec 24, 2024

Ok, fair points. We can make mirrors default.

Another point of discussion: I didn't really think of this when it came up originally, but now seems like a good time to revisit it. For packages such as bash-2.05b and bc-1.07.1, we have been using what appears to me as recompressed upstream tarballs distributed by 3rd parties, such as fedora and slackware, to save disk space in the early bootstrap.

I'm not totally sure the tradeoff is worth it. Obviously, recompression changes the checksum, and it is a layer of indirection from upstream that is pretty unverifiable.

I'd be for going back to upstream tarballs for those. At minimum, I would want our mirror network to do the recompression, rather than blindly trusting fedora/slackware there.

@endgame
Copy link

endgame commented Dec 24, 2024

Unless you have very hard space requirements, I'd say that storage and RAM are both cheap enough to not worry. If you were bumping up against the addressing limits of 32-bit machines or something then we'd need to think more carefully but we already know that the bootstrap doesn't run on 2GB machines. Matching upstream is so much more important. Even if you do have stringent storage requirements, you'd need a deterministic and possibly bootstrappable compression program to crunch things down. RAM requirements would be more affected by uncompressed package sizes, anyway?

@stikonas
Copy link
Collaborator

Unless you have very hard space requirements, I'd say that storage and RAM are both cheap enough to not worry. If you were bumping up against the addressing limits of 32-bit machines or something then we'd need to think more carefully but we already know that the bootstrap doesn't run on 2GB machines. Matching upstream is so much more important. Even if you do have stringent storage requirements, you'd need a deterministic and possibly bootstrappable compression program to crunch things down. RAM requirements would be more affected by uncompressed package sizes, anyway?

@Googulator wanted to be able to fit initial bootstrap sources into 256 MiB (basically the largest existing chips that can be programmed manually without software)... But yeah, it should probably be a separate patch on top of live-bootstrap for doing something like that...

@fosslinux fosslinux mentioned this issue Jan 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants