Skip to content

Commit d8ef825

Browse files
chore: docs improvements
Signed-off-by: Andrew Lilley Brinker <[email protected]>
1 parent 02220bd commit d8ef825

File tree

6 files changed

+157
-191
lines changed

6 files changed

+157
-191
lines changed

omnibor/src/artifact_id.rs

Lines changed: 98 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -30,10 +30,105 @@ use crate::hash_algorithm::Sha256;
3030

3131
/// A universally reproducible software identifier.
3232
///
33-
/// This is a content-based unique identifier for any software artifact.
33+
/// An Artifact ID is a Git Object Identifier (GitOID), with only a type of
34+
/// "blob," with SHA-256 as the hash function, and with unconditional newline
35+
/// normalization.
3436
///
35-
/// It is built around, per the specification, any supported hash algorithm.
36-
/// Currently, only SHA-256 is supported, but others may be added in the future.
37+
/// If that explanation makes sense, then congrats, that's all you need to know!
38+
///
39+
/// Otherwise, to explain in more detail:
40+
///
41+
/// ## The GitOID Construction
42+
///
43+
/// The Git Version Control System identifies all objects checked into a
44+
/// repository by calculating a Git Object Identifier. This identifier is based
45+
/// around a hash function and what we'll call the "GitOID Construction" that
46+
/// determines what gets input into the hash function.
47+
///
48+
/// In the GitOID Construction, you first hash in a prefix string of the form:
49+
///
50+
/// ```ignore,custom
51+
/// <object_type> <size_of_input_in_bytes>\0
52+
/// ```
53+
///
54+
/// The `<object_type>` can be `blob`, `commit`, `tag`, or `tree`. The last
55+
/// three are used for commits, tags, and directories respectively; `blob` is
56+
/// used for files.
57+
///
58+
/// The `<size_of_input_in_bytes>` is what it sounds like; Git calculates the
59+
/// size of an input file and includes that in the hash.
60+
///
61+
/// After hashing in the prefix string, Git then hashes the contents of the
62+
/// file being identified. That's the GitOID Construction! Artifact IDs use
63+
/// this same construction, with the `<object_type>` always set to `blob`.
64+
///
65+
/// ## Choice of Hash Function
66+
///
67+
/// We also restrict the hash function to _only_ SHA-256 today,
68+
/// though the specification leaves open the possibility of transitioning to
69+
/// an alternative in the future if SHA-256 is cryptographically broken.
70+
///
71+
/// This is a difference from Git's default today. Git normally uses SHA-1,
72+
/// and is in the slow process of transitioning to SHA-256. So why not use
73+
/// SHA-1 to match Git's current default?
74+
///
75+
/// First, it's worth saying that Git can use SHA-1 _or_ a variant of SHA-1
76+
/// called "SHA-1CD" (sometimes spelled "SHA-1DC"). Back in 2017, researchers
77+
/// from Google and CWI Amsterdam announced the "SHAttered" attack against
78+
/// SHA-1, where they had successively engineered a collision (two different
79+
/// documents which produced the same SHA-1 hash). The SHA-1CD algorithm was
80+
/// developed in response. It's a variant of SHA-1 which attempts to detect
81+
/// when the input is attempting to produce a collision like the one in the
82+
/// SHAttered attack, and on detection modifies the hashing algorithm to
83+
/// produce a different hash and stop that collision.
84+
///
85+
/// Different versions of Git will use either SHA-1 or SHA-1CD by default. This
86+
/// means that for Artifact IDs our choice of hash algorithm was between three
87+
/// choices: SHA-1, SHA-1CD, or SHA-256.
88+
///
89+
/// The split of SHA-1 and SHA-1CD doesn't matter for most Git users, since
90+
/// a single repository will just use one or the other and most files will
91+
/// not trigger the collision detection code path that causes their outputs to
92+
/// diverge. For Artifact IDs though, it's a problem, since we care strongly
93+
/// about our IDs being universally reproducible. Thus, the split creates a
94+
/// challenge for our potential use of SHA-1.
95+
///
96+
/// Additionally, it's worth noting that attacks against SHA-1 continue to
97+
/// become more practical as computing hardware improves. In October 2024
98+
/// NIST, the National Institute of Standards and Technology in the United
99+
/// States, published an initial draft of a document "Transitioning the Use of
100+
/// Cryptographic Algorithms and Key Lengths." While it is not yet an official
101+
/// NIST recommendation, it does explicitly disallow the use of SHA-1 for
102+
/// digital signature generation, considers its use for digital signature
103+
/// verification to be a "legacy use" requiring special approval, and otherwise
104+
/// prepares to sunset any use of SHA-1 by 2030.
105+
///
106+
/// NIST is not a regulatory agency, but their recommendations _are_ generally
107+
/// incorporated into policies both in government and in private industry, and
108+
/// a NIST recommendation to fully transition away from SHA-1 is something we
109+
/// think should be taken seriously.
110+
///
111+
/// For all of the above reasons, we opted to base Artifact IDs on SHA-256,
112+
/// rather than SHA-1 or SHA-1CD.
113+
///
114+
/// ## Unconditional Newline Normalization
115+
///
116+
/// The final requirement of note is the unconditional newline normalization
117+
/// performed for Artifact IDs. This is a feature that Git offers which is
118+
/// configurable, permitting users of Git to decide whether checked-out files
119+
/// should have newlines converted to the ones for their current platform, and
120+
/// whether the checked-in copies should have _their_ newlines converted.
121+
///
122+
/// For our case, we care that users of Artifact IDs can produce the same ID
123+
/// regardless of what platform they're on. To ensure this, we always normalize
124+
/// newlines from `\r\n` to `\n` (CRLF to LF / Windows to Unix). We perform
125+
/// this regardless of the _type_ of input file, whether it's a binary or text
126+
/// file. Since we aren't storing files, only identifying them, we don't have
127+
/// to worry about not newline normalizing binaries.
128+
///
129+
/// So that's it! Artifact IDs are Git Object Identifiers made with the `blob`
130+
/// type, SHA-256 as the hash algorithm, and unconditional newline
131+
/// normalization.
37132
pub struct ArtifactId<H: HashAlgorithm> {
38133
#[doc(hidden)]
39134
gitoid: GitOid<H, Blob>,

0 commit comments

Comments
 (0)