@@ -30,10 +30,105 @@ use crate::hash_algorithm::Sha256;
3030
3131/// A universally reproducible software identifier.
3232///
33- /// This is a content-based unique identifier for any software artifact.
33+ /// An Artifact ID is a Git Object Identifier (GitOID), with only a type of
34+ /// "blob," with SHA-256 as the hash function, and with unconditional newline
35+ /// normalization.
3436///
35- /// It is built around, per the specification, any supported hash algorithm.
36- /// Currently, only SHA-256 is supported, but others may be added in the future.
37+ /// If that explanation makes sense, then congrats, that's all you need to know!
38+ ///
39+ /// Otherwise, to explain in more detail:
40+ ///
41+ /// ## The GitOID Construction
42+ ///
43+ /// The Git Version Control System identifies all objects checked into a
44+ /// repository by calculating a Git Object Identifier. This identifier is based
45+ /// around a hash function and what we'll call the "GitOID Construction" that
46+ /// determines what gets input into the hash function.
47+ ///
48+ /// In the GitOID Construction, you first hash in a prefix string of the form:
49+ ///
50+ /// ```ignore,custom
51+ /// <object_type> <size_of_input_in_bytes>\0
52+ /// ```
53+ ///
54+ /// The `<object_type>` can be `blob`, `commit`, `tag`, or `tree`. The last
55+ /// three are used for commits, tags, and directories respectively; `blob` is
56+ /// used for files.
57+ ///
58+ /// The `<size_of_input_in_bytes>` is what it sounds like; Git calculates the
59+ /// size of an input file and includes that in the hash.
60+ ///
61+ /// After hashing in the prefix string, Git then hashes the contents of the
62+ /// file being identified. That's the GitOID Construction! Artifact IDs use
63+ /// this same construction, with the `<object_type>` always set to `blob`.
64+ ///
65+ /// ## Choice of Hash Function
66+ ///
67+ /// We also restrict the hash function to _only_ SHA-256 today,
68+ /// though the specification leaves open the possibility of transitioning to
69+ /// an alternative in the future if SHA-256 is cryptographically broken.
70+ ///
71+ /// This is a difference from Git's default today. Git normally uses SHA-1,
72+ /// and is in the slow process of transitioning to SHA-256. So why not use
73+ /// SHA-1 to match Git's current default?
74+ ///
75+ /// First, it's worth saying that Git can use SHA-1 _or_ a variant of SHA-1
76+ /// called "SHA-1CD" (sometimes spelled "SHA-1DC"). Back in 2017, researchers
77+ /// from Google and CWI Amsterdam announced the "SHAttered" attack against
78+ /// SHA-1, where they had successively engineered a collision (two different
79+ /// documents which produced the same SHA-1 hash). The SHA-1CD algorithm was
80+ /// developed in response. It's a variant of SHA-1 which attempts to detect
81+ /// when the input is attempting to produce a collision like the one in the
82+ /// SHAttered attack, and on detection modifies the hashing algorithm to
83+ /// produce a different hash and stop that collision.
84+ ///
85+ /// Different versions of Git will use either SHA-1 or SHA-1CD by default. This
86+ /// means that for Artifact IDs our choice of hash algorithm was between three
87+ /// choices: SHA-1, SHA-1CD, or SHA-256.
88+ ///
89+ /// The split of SHA-1 and SHA-1CD doesn't matter for most Git users, since
90+ /// a single repository will just use one or the other and most files will
91+ /// not trigger the collision detection code path that causes their outputs to
92+ /// diverge. For Artifact IDs though, it's a problem, since we care strongly
93+ /// about our IDs being universally reproducible. Thus, the split creates a
94+ /// challenge for our potential use of SHA-1.
95+ ///
96+ /// Additionally, it's worth noting that attacks against SHA-1 continue to
97+ /// become more practical as computing hardware improves. In October 2024
98+ /// NIST, the National Institute of Standards and Technology in the United
99+ /// States, published an initial draft of a document "Transitioning the Use of
100+ /// Cryptographic Algorithms and Key Lengths." While it is not yet an official
101+ /// NIST recommendation, it does explicitly disallow the use of SHA-1 for
102+ /// digital signature generation, considers its use for digital signature
103+ /// verification to be a "legacy use" requiring special approval, and otherwise
104+ /// prepares to sunset any use of SHA-1 by 2030.
105+ ///
106+ /// NIST is not a regulatory agency, but their recommendations _are_ generally
107+ /// incorporated into policies both in government and in private industry, and
108+ /// a NIST recommendation to fully transition away from SHA-1 is something we
109+ /// think should be taken seriously.
110+ ///
111+ /// For all of the above reasons, we opted to base Artifact IDs on SHA-256,
112+ /// rather than SHA-1 or SHA-1CD.
113+ ///
114+ /// ## Unconditional Newline Normalization
115+ ///
116+ /// The final requirement of note is the unconditional newline normalization
117+ /// performed for Artifact IDs. This is a feature that Git offers which is
118+ /// configurable, permitting users of Git to decide whether checked-out files
119+ /// should have newlines converted to the ones for their current platform, and
120+ /// whether the checked-in copies should have _their_ newlines converted.
121+ ///
122+ /// For our case, we care that users of Artifact IDs can produce the same ID
123+ /// regardless of what platform they're on. To ensure this, we always normalize
124+ /// newlines from `\r\n` to `\n` (CRLF to LF / Windows to Unix). We perform
125+ /// this regardless of the _type_ of input file, whether it's a binary or text
126+ /// file. Since we aren't storing files, only identifying them, we don't have
127+ /// to worry about not newline normalizing binaries.
128+ ///
129+ /// So that's it! Artifact IDs are Git Object Identifiers made with the `blob`
130+ /// type, SHA-256 as the hash algorithm, and unconditional newline
131+ /// normalization.
37132pub struct ArtifactId < H : HashAlgorithm > {
38133 #[ doc( hidden) ]
39134 gitoid : GitOid < H , Blob > ,
0 commit comments