Replies: 10 comments 7 replies
-
I like this idea of "Input Role". In my opinion, this "Input Role" can be classified as metadata. In general, such metadata is not suggested to put into the Input Manifest because this will create a slightly different Input Manifest file, thus impacting the associated bom_id of the generated artifact (the bom_id is usually embedded in the generated artifact). Instead, it is suggested to create a new database of artifact metadata, which is key'ed with artifact identifier, and multiple metadata can all be the associated value: file_path, build_cmd, including this "Input Role". In Bomsh's implementation, this metadata database is implemented as a JSON database, key'ed with gitoid of each artifact file (and its associated bom_id if the bom_id is not embedded). If the associated bom_id is not embedded into the generated artifact, then it will allow us to tag the Input Manifest with the "Input Role". In fact, then we can always modify the Input Manifest at any time, to create a new Input Manifest, which only affects/propagates to all its upper parents. This will also means that many variants of bom_id's can co-exist with the same build. These variants just carry different levels of details in metadata. In order to compare these variants, we must "normalize" them before the comparison. That is, although users like such flexibility, it seems to also create a multitude of "equivalent/similar" bom_id's which may unnecessarily complicate OmniBOR adoption/deployment. |
Beta Was this translation helpful? Give feedback.
-
Edits:
|
Beta Was this translation helpful? Give feedback.
-
I like this idea, but a few questions come to mind:
Some examples
|
Beta Was this translation helpful? Give feedback.
-
And example of non-authoritative inline-metadata would be in the class of 'user asserted data that is passed through the build tool'.
I'm pretty sure that's not something any build tool can determine authoritatively in most cases. If you will indulge me for this example in presuming this is set by a flag like --license=${SDPX License Identifier} , that's information that the build tool is 'non-authoritative' with regard to. I can also tell you, sadly, that the variance on assertion of license metadata can be very error prone (ie: high variance). |
Beta Was this translation helpful? Give feedback.
-
Thinking about this it occurred to me that more clarity about your suggestion might be helpful. As I read it, and I may be misreading it, it sounds like you are suggesting that a 'record' or 'line' in the Manifest go from:
to
Am I reading this correctly? |
Beta Was this translation helpful? Give feedback.
-
One thing that I'm uncertain about is whether the metadata in question is about explaining the relationship between artifact inputs or about encoding the provenance for the use of those inputs to produce the terminal artifact. For the relationship, I imagine something similar to how relationships are encoded in SPDX, where there's essentially an enumeration of types of relationships, and the encoding is only indicating which of those enumeration items apply. For provenance, I imagine something similar to SLSA provenance attestations, in which case I guess my question is what benefit we have doing the attestations within OmniBOR rather than perhaps tying SLSA attestations into OmniBOR. |
Beta Was this translation helpful? Give feedback.
-
To clarify something that came up in the meeting - I'm seeing this "irrelevant input stripping" process as being very late - something that the consumer of the omnibor information might do to make their own use-case more efficient. The producers should tend towards including as much information as generally makes sense, since while it's possible to strip unwanted things out, it's not possible to add back missing things. (Though I guess this mechanism would allow someone to embellish the graph with additional information if its possible to derive it later.) |
Beta Was this translation helpful? Give feedback.
-
Okay, having read through the discussion, I am now trying to tease out some ideas for myself. I think there are two ways I could imagine this "input roles" question, and I think one makes sense to me, and one may not, and I want to make sure I have the ideas right. First, I could imagine input roles as an enumeration of possible relationship types, akin to package relationships in the SPDX standard, namely the items in table 68 of version 2.3 of the standard, seen here: https://spdx.github.io/spdx-spec/v2.3/relationships-between-SPDX-elements/. Not that we would use this exact list, but rather it would essentially be an enumeration which could be represented in roughly the following Rust code: enum InputRole {
SourceFileOf,
GeneratorBasisOf,
// ... more roles,
Other(AsciiString), // where ASCII string is a string in ASCII encoding.
} This could then be encoded in an input manifest with canonical string representations, like This has the benefit of involving little to no domain knowledge, and being valuable largely for filtering for users who want to apply some convergence-oriented checking on top of OmniBOR's maximal-divergence design. The second option is the metadata encoding which has been the focus of the discussion above. This seems to be about encoding things like compilation information or other provenance metadata, and more generally seems philosophically about binding provenance information directly into the input manifests, rather than having it encoded in separate provenance attestations which may reference the Artifact Identifiers. Do I have that right? I worry that encoding this information may lose out on the canonical / reproducible properties which OmniBOR otherwise strives for; at the very least I think there's a tension worth teasing out more about divergence vs. convergence here. |
Beta Was this translation helpful? Give feedback.
-
@jsgf I have been thinking a bit on what I perceive to be your use cases. I think at root your use case is being able to 'filter' the tree to create a new tree that is more 'convergent'. And you are suggesting this could be used to add more data about things like toolchain data etc. Filtering the tree for convergenceFor those just catching up, I highly recommend #42 to better understand 'convergence' vs 'divergence'. Filtering is always going to be somewhat bespoke. What should trigger divergence will be highly situational. Your proposal appears to be to allow adding at the time of generation of the Input Manifest contextual 'hints' per input artifact that can be used for after the fact filtering. I think you are suggesting filtering by omitting zero or more lines from the Input Manifest. Could you provide an example where such omission is the desired result? If I contemplate your protobuf file whitespace example its unclear to me how I would use 'role' plus deletion of a line from an Input Manifest to decrease the divergence without also losing information about the fact that protobuf file was part of the input. |
Beta Was this translation helpful? Give feedback.
-
Summarizing today's discussion: folks generally agreed that a strong correlation between input artifacts, output manifest, output artifact, and build metadata is desirable (e.g., perhaps through a consistent naming scheme), but also that the metadata identifier should not be included in the output artifact or the output manifest. |
Beta Was this translation helpful? Give feedback.
-
Right now, pretty diagrams not withstanding, you don't actually know what any of the artifacts in a graph are. In principle you could use the gitoid to look each one up and inspect it, then make an inference about why it appears in an input manifest, but the manifest itself isn't going to give you any help.
So what if we added that to the input manifest? What if we tagged each input with some indication of its role? I see several benefits from this:
How would these identifiers work? I'm thinking that they would be gitoids themselves, and we would have a set of "well known" oids for common cases, but allow arbitrary oids for any use-case dependent types. (For the sake of example, we could have something like the oid for the literal string
source file
as generic compiler inputs for example.)Toolchain metadata
Related to this, there's been the ongoing discussion about whether to encode things like build tools and their invocation flags in the graph itself, or whether to have some kind of adjacent metadata store which isn't included in the hash. The former means that (possibly) irrelevant detail gets hashed, and the latter means that the metadata is only weakly associated.
I think this proposal gives us a middle ground - the details can be included in the hash, but it's also possible to strip them out if they're not relevant. For example, say we have an additional input in the input manifest:
where
<toolchain-oid>
might just be the oid of the stringbuild-tool
(for the sake of discussion).<invocation-details-oid>
is a bit more interesting - what bytes is it the oid of?c-compiler
- ie, generically this was built by some C compilerc-compiler-ISO/IEC 9899:1990
- we really want to emphasize this is standard C codegcc
/clang
- some specific compilerclang -O3 -Iincludes/ -fdetailed-optimization-flag ...
- very specific invocation details(Obviously flat strings are not great for including structured data, but if we were to use something like json it also means a discussion about how to normalize etc, so let's leave that aside.)
The key point here is that you can either use general "well known" strings to give consumers a general sense of what you've done, or you can include an arbitrary amount of detail as needed.
Other kinds of input roles
What springs to mind:
Outputs?
Since we've been discussing how to handle "external input manifests" - ie, not embedding the manifest ID in the object itself. Since that embedding is the only strong link between the output and the manifest, if we're separating them we could list the output as a "related artifact" (ie, generalize beyond just "inputs").
Actual concrete details of how it would fit into existing formats
Glossed over until we've considered the general idea. The main thing is that input tagging should apply to both leaf and derived inputs.
(But if you need something concrete to keep in mind, let's assume for the sake of discussion that it's
: <oid>
appended to the existing lines.)Also unclear to me whether we should have a notion of heirarchy: eg
c-compiler
is more general thanclang
/gcc
, which in turn is more general than a full command line. I guess we could just keep all the details as separate "inputs" so they can be selectively removed.Beta Was this translation helpful? Give feedback.
All reactions