-
Plain old filesystem (probably won’t scale with the level of caching granularity we might want to aspire to)
-
-
Author recommends comparing against PingCAP’s TitanDB for the large blob workload:
one DB that might be interesting to benchmark for your workload against sled is PingCAP’s titandb, which separates keys from values, and I think they might move values even less over time, so that could be one of the better options
they use the WiscKey DB architecture of keeping keys in an LSM but values out of it so they can avoid moving values as often. this is the same approach used in the Badger DB for the Go ecosystem that is nice in some situations
— Tyler Neely (@spacejam) -
TiKV for distributed stores would be interesting, though probably excessive (there’s not as much chance of conflict in a largely content‐addressed database)
-
Investigate which of these options would need an accompanying large blob storage system (certainly SQLite) and then don’t bother using those options because that would be way too much work. The sled author’s remarks on the topic:
the general reason why most DBs will suggest punting to a FS for that kind of thing is because over time, databases will tend to rearrange items internally in order to perform defragmentation etc… sled will actually just spill over to using a file itself, but as I’m writing this it will only split leaf nodes when they hit 17 children, so if you have 17 1gb values all clustered on one leaf, sled will need to read that whole 17gb file in before serving your data
having said that, it isn’t that much work to change how sled splits leaf nodes to increase this granularity, so each 1gb value would get its own file
DBs usually assume they can copy values over and over pretty cheaply, and for most DB’s this is indeed sort of a nightmare workload
but right now, sled only splits leafs when they exceed the 16 child limit. in general I want to make this more size based, but as of right now it’s up to 16 values per node, and this stuff happens per-node rather than per-key
anyway, the bottom line is, measure it 🙂
as the creator of the thing I’ll always be aware of ways of making it better, but maybe for you it’s already good enough
- 0x00/
uu:…
-
A universally unique random identifier.
- 0x01/
b3:…
-
A BLAKE3 hash.
- …
-
Extensible! Let’s hope we never need more than 255 hash functions.
Hash/UUID payload is 31 bytes so that the reference as a whole is 256 bits.
TODO: Come up with a name for uu:…
identifiers that doesn’t clash
with standard 128 bit UUIDs.
# Hash a constant $ mew ref 1234 b3:… # Import and hash a file $ mew ref -f ./file b3:… # Hash a file without importing into the forest $ mew ref -n -f ./file b3:… # Download and import a file $ mew ref -f https://example.com/example-1.0.tar.gz b3:… # Generate a unique random identifier $ mew ref -u uu:…
gRPC and Cap’n Proto are the main contenders here. Maybe figure out if gRPC could be used with Cap’n Proto payloads, or hand‐roll something based on Cap’n Proto + QUIC/HTTP 3.
Somewhere between a purely functional shell and Dhall; implementation tightly integrated with the build store interface.
Ensuring adherence to object‐capability principles is Very Important™.
It’d be nice to specify the language grammar as a hybrid parser‐pretty printer. I’ve always wanted to do that.
I’d like to think about how to reconcile Roslyn‐style 1:1 concrete‐abstract syntax mapping that preserves comments and whitespace with Wadler‐style pretty-printing, which feels much more structured than just twiddling the whitespace fields according to a large library of rules to me.
TreeSpec : Type
Dir : Map PathComponent TreeSpec → TreeSpec
Blob : TreeSpec
Executable : TreeSpec
Tree : TreeType → Type
get : (p : PathComponent) → ∀ts ⇒ Tree (Dir ts) → (e : p ∈ ts) ⇒ Tree (value e)
get? : PathComponent → ∀ts ⇒ Tree (Dir ts) → Option (∃t · Tree t)
execute : Tree Executable → …
execute? : ∀t ⇒ Tree t → Option …
— guaranteed to contain bin/hello, which you can execute (including
— from build rules), and share/doc/hello/README, a plain blob, but can
— also contain arbitrary other trees in addition
hello-tree : Tree (Dir {
"bin" = Dir {"hello" = Executable},
"share" = Dir {"doc" = Dir {"hello" = Dir {"README" = Blob}}},
})
— hello-tree ▻ get "bin" : Tree (Dir {"hello" = Executable})
— execute (hello-tree ▻ get "bin" ▻ get "hello") : …
Could probably use row types for this:
— essentially mapping from identifiers to the specified type
Row : Type → Type
⦃
bin : Dir ⦃hello : Executable⦄,
share : Dir ⦃doc : Dir ⦃hello : Dir ⦃README : Blob⦄⦄⦄,
⦄ : Row TreeSpec
Then we can get nicer bin : …
syntax by punning on the fact
that rows would also be used to specify record types.
Adapton seems like it should have insights that are applicable to modelling Ansible/Terraform‐style reconciliation of configuration with state in a pure system.
Shiz’s LLVM + clang + LLD + elftoolchain + compiler-rt + libc++ Linux toolchain work is probably worth referencing.
This will probably be really annoying in the early stages of hacking, depending on the latency.
It would be good to integrate the bors setup with git-test to test all the commits of a pull request rather than just the HEAD.
See above, though tapping a YubiKey a few times when pushing to the public repository isn’t too bad.
GitHub supports ICE, and it would be nice to have the root of trust for binary builds under our direct control.
This would require manually administering build and VCS machines, prevent the use of the existing bors implementation, and substantially increase the barrier to contribution, so it should be done carefully: ideally people would still be able to contribute via GitHub issues and pull requests and have them automatically mirrored to the self‐hosted infrastructure.