Tagging what inputs are for ("Input roles"?) #43

jsgf · 2023-02-23T20:28:27Z

jsgf
Feb 23, 2023
Maintainer

Right now, pretty diagrams not withstanding, you don't actually know what any of the artifacts in a graph are. In principle you could use the gitoid to look each one up and inspect it, then make an inference about why it appears in an input manifest, but the manifest itself isn't going to give you any help.

So what if we added that to the input manifest? What if we tagged each input with some indication of its role? I see several benefits from this:

One could work out the role of each input in the action, which would potentially let you make better use of the graph on its own without needing to refer to external data,
If two actions used the same inputs in different ways, you'd get distinct input manifests for them
You could rewrite a manifest to exclude inputs which aren't relevant to your analysis. For example you could strip out inputs which aren't relevant to your use-case in order to increase convergence.

How would these identifiers work? I'm thinking that they would be gitoids themselves, and we would have a set of "well known" oids for common cases, but allow arbitrary oids for any use-case dependent types. (For the sake of example, we could have something like the oid for the literal string source file as generic compiler inputs for example.)

Toolchain metadata

Related to this, there's been the ongoing discussion about whether to encode things like build tools and their invocation flags in the graph itself, or whether to have some kind of adjacent metadata store which isn't included in the hash. The former means that (possibly) irrelevant detail gets hashed, and the latter means that the metadata is only weakly associated.

I think this proposal gives us a middle ground - the details can be included in the hash, but it's also possible to strip them out if they're not relevant. For example, say we have an additional input in the input manifest:

<invocation-details-oid> <toolchain-oid>

where <toolchain-oid> might just be the oid of the string build-tool (for the sake of discussion).

<invocation-details-oid> is a bit more interesting - what bytes is it the oid of?

it could just be c-compiler - ie, generically this was built by some C compiler
c-compiler-ISO/IEC 9899:1990 - we really want to emphasize this is standard C code
gcc/clang - some specific compiler
clang -O3 -Iincludes/ -fdetailed-optimization-flag ... - very specific invocation details

(Obviously flat strings are not great for including structured data, but if we were to use something like json it also means a discussion about how to normalize etc, so let's leave that aside.)

The key point here is that you can either use general "well known" strings to give consumers a general sense of what you've done, or you can include an arbitrary amount of detail as needed.

Other kinds of input roles

What springs to mind:

generic - ie some undifferentiated input, which is equivalent to all inputs today
source files
build tool invocations (ie, as above)
prebuilt / "external" artifacts (eg prebuilt distro package libraries which don't participate in omnibor)
build rules
non-function annotation like licenses (an "input" in that it confers a specific legal status on the output)
other descriptive metadata (eg package-url)
...?

Outputs?

Since we've been discussing how to handle "external input manifests" - ie, not embedding the manifest ID in the object itself. Since that embedding is the only strong link between the output and the manifest, if we're separating them we could list the output as a "related artifact" (ie, generalize beyond just "inputs").

Actual concrete details of how it would fit into existing formats

Glossed over until we've considered the general idea. The main thing is that input tagging should apply to both leaf and derived inputs.

(But if you need something concrete to keep in mind, let's assume for the sake of discussion that it's : <oid> appended to the existing lines.)

Also unclear to me whether we should have a notion of heirarchy: eg c-compiler is more general than clang/gcc, which in turn is more general than a full command line. I guess we could just keep all the details as separate "inputs" so they can be selectively removed.

yonhan3 · 2023-02-24T01:26:02Z

yonhan3
Feb 24, 2023

I like this idea of "Input Role".

In my opinion, this "Input Role" can be classified as metadata. In general, such metadata is not suggested to put into the Input Manifest because this will create a slightly different Input Manifest file, thus impacting the associated bom_id of the generated artifact (the bom_id is usually embedded in the generated artifact). Instead, it is suggested to create a new database of artifact metadata, which is key'ed with artifact identifier, and multiple metadata can all be the associated value: file_path, build_cmd, including this "Input Role". In Bomsh's implementation, this metadata database is implemented as a JSON database, key'ed with gitoid of each artifact file (and its associated bom_id if the bom_id is not embedded).

If the associated bom_id is not embedded into the generated artifact, then it will allow us to tag the Input Manifest with the "Input Role". In fact, then we can always modify the Input Manifest at any time, to create a new Input Manifest, which only affects/propagates to all its upper parents. This will also means that many variants of bom_id's can co-exist with the same build. These variants just carry different levels of details in metadata. In order to compare these variants, we must "normalize" them before the comparison. That is, although users like such flexibility, it seems to also create a multitude of "equivalent/similar" bom_id's which may unnecessarily complicate OmniBOR adoption/deployment.

2 replies

jsgf Feb 24, 2023
Maintainer Author

Well, the this proposal is explicitly in the camp of "metadata should be hashed".

More generally, since the entire structure of OmniBOR is based on gitoids computed from content hashes, I think as a general statement of principle anything which isn't hashed into a gitoid is out of the scope of the OmniBOR spec. Of course one is free to build systems around OmniBOR which associates arbitrary metadata with any given gitoid, but I don't think we need to directly consider them (other than making sure the basics are in place to support such uses).

And specifically, I think anything which is structurally important must be hashed into the gitoids in order to be significant. As I talk about in my other discussion note Convergence and Divergence #42 the ideal is that two things hash to the same gitoid when they're logically identical and to to different gitoids when they have meaningful differences - for your particular use-case.

Since it's impossible to know in a generic way what is significant or insignificant up front, this proposal gives users a way to consume and filter/re-write the input manifest structures to exclude data that they're not interested in. This allows the input manifests to be constructed with arbitrary amounts of detail which can then be discarded if not needed.

This implies that there may be multiple input manifests for a given artifact, and they may be generated after the original artifact creation. This necessarily means we'll need to support some kind of "external manifest" to allow for this.

yonhan3 Feb 25, 2023

Yes, I totally agree with this idea of "external manifest". That is, even if the bom_id of the Input Manifest is embedded in the generated artifact file, a few "external manifests" can be created for a given artifact with different details. Thus, a given artifact can be associated with multiple OmniBOR ADG IDs. These manifests or variants of bom_id's (ADG IDs) can be compared after "normalization", so we know they correspond to the same ADG.

jsgf · 2023-02-24T17:56:18Z

jsgf
Feb 24, 2023
Maintainer Author

Edits:

added "generic" input type (ie, what all inputs are today)
floated idea of "output", er, associated artifact to generalize beyond just inputs, in order to support separate/external manifests.

0 replies

edwarnicke · 2023-02-28T00:10:59Z

edwarnicke
Feb 28, 2023
Maintainer

Well, the this proposal is explicitly in the camp of "metadata should be hashed".

I like this idea, but a few questions come to mind:

What is the provider of the metadata that is being hashed?
Is that source authoritative about that metadata?
What is the variance on that metadata?

Some examples

C compiler and our traditional hash of source/header files:
1. Provider: Compiler itself
2. Authoritative: Yes - the compiler authoritatively knows what source/header files it built into the output .o
3. Variance: Low - The same build performed at different times, on different servers, by different parties would produce the same output.
C compiler internal compiler macros (like ARM_ARCH_7 ):
1. Provider: compiler itself
2. Authoritative: Yes
3. Variance: Low
C compiler external compiler macros (like -D DEBUG )

Provider: compiler itself
Authoritative: Yes
Variance: Low

C compiler -I directories

Provider: compiler itself
Authoritative: Yes
Variance: High - as it will differ greatly across otherwise completely identical builds.

1 reply

jsgf Feb 28, 2023
Maintainer Author

What is the provider of the metadata that is being hashed?

My thought was that for "standard" metadata items there would be a published set of well-known gitoids to tag it, which you would use if you want maximal interop.

But if you want to define in-house or custom metadata you just make up your own designations and compute the gitoids for them. Of course it would be nice to have some namespacing scheme to avoid inadvertent convergence if two people independently choose the same tag for different meanings.

Is that source authoritative about that metadata?

All your examples are for "Authoritative: Yes". What's a non-authoritative example? Something like "build timestamp"?

What is the variance on that metadata?

Yeah the variance will depends on what the metadata itself is. As such this proposal would allow a consumer of the (input) manifest to strip out metadata with unwanted variance if they don't need it. But I'm not sure what you're asking here precisely?

edwarnicke · 2023-03-01T14:12:48Z

edwarnicke
Mar 1, 2023
Maintainer

All your examples are for "Authoritative: Yes". What's a non-authoritative example? Something like "build timestamp"?

And example of non-authoritative inline-metadata would be in the class of 'user asserted data that is passed through the build tool'.
From your example of:

non-function annotation like licenses (an "input" in that it confers a specific legal status on the output)

I'm pretty sure that's not something any build tool can determine authoritatively in most cases. If you will indulge me for this example in presuming this is set by a flag like --license=${SDPX License Identifier} , that's information that the build tool is 'non-authoritative' with regard to.

I can also tell you, sadly, that the variance on assertion of license metadata can be very error prone (ie: high variance).

0 replies

edwarnicke · 2023-03-01T14:17:53Z

edwarnicke
Mar 1, 2023
Maintainer

Thinking about this it occurred to me that more clarity about your suggestion might be helpful. As I read it, and I may be misreading it, it sounds like you are suggesting that a 'record' or 'line' in the Manifest go from:

blob ${gitoid of artifact} bom ${gitoid of artifact's input manifest}

to

blob ${gitoid of artifact} bom ${gitoid of artifact's input manifest} role ${gitoid of role information}

Am I reading this correctly?

1 reply

jsgf Mar 1, 2023
Maintainer Author

More or less, with an analogous change for leaf input artifacts.

If we also want to extend manifests to include outputs, then we might want a more distinct syntax for that.

alilleybrinker · 2023-03-01T16:07:24Z

alilleybrinker
Mar 1, 2023
Maintainer

One thing that I'm uncertain about is whether the metadata in question is about explaining the relationship between artifact inputs or about encoding the provenance for the use of those inputs to produce the terminal artifact. For the relationship, I imagine something similar to how relationships are encoded in SPDX, where there's essentially an enumeration of types of relationships, and the encoding is only indicating which of those enumeration items apply. For provenance, I imagine something similar to SLSA provenance attestations, in which case I guess my question is what benefit we have doing the attestations within OmniBOR rather than perhaps tying SLSA attestations into OmniBOR.

0 replies

jsgf · 2023-03-01T17:02:43Z

jsgf
Mar 1, 2023
Maintainer Author

To clarify something that came up in the meeting - I'm seeing this "irrelevant input stripping" process as being very late - something that the consumer of the omnibor information might do to make their own use-case more efficient.

The producers should tend towards including as much information as generally makes sense, since while it's possible to strip unwanted things out, it's not possible to add back missing things. (Though I guess this mechanism would allow someone to embellish the graph with additional information if its possible to derive it later.)

0 replies

alilleybrinker · 2023-03-06T20:22:39Z

alilleybrinker
Mar 6, 2023
Maintainer

Okay, having read through the discussion, I am now trying to tease out some ideas for myself. I think there are two ways I could imagine this "input roles" question, and I think one makes sense to me, and one may not, and I want to make sure I have the ideas right.

First, I could imagine input roles as an enumeration of possible relationship types, akin to package relationships in the SPDX standard, namely the items in table 68 of version 2.3 of the standard, seen here: https://spdx.github.io/spdx-spec/v2.3/relationships-between-SPDX-elements/. Not that we would use this exact list, but rather it would essentially be an enumeration which could be represented in roughly the following Rust code:

enum InputRole {
    SourceFileOf,
    GeneratorBasisOf,
    // ... more roles,
    Other(AsciiString), // where ASCII string is a string in ASCII encoding.
}

This could then be encoded in an input manifest with canonical string representations, like <gitoid of object> bom <gitoid of manifest> role SOURCE_FILE_OF, or something along those lines.

This has the benefit of involving little to no domain knowledge, and being valuable largely for filtering for users who want to apply some convergence-oriented checking on top of OmniBOR's maximal-divergence design.

The second option is the metadata encoding which has been the focus of the discussion above. This seems to be about encoding things like compilation information or other provenance metadata, and more generally seems philosophically about binding provenance information directly into the input manifests, rather than having it encoded in separate provenance attestations which may reference the Artifact Identifiers. Do I have that right?

I worry that encoding this information may lose out on the canonical / reproducible properties which OmniBOR otherwise strives for; at the very least I think there's a tension worth teasing out more about divergence vs. convergence here.

2 replies

jsgf Mar 8, 2023
Maintainer Author

akin to package relationships in the SPDX standard

Yeah, I think many of the relationships I had in mind are covered there.

I had been thinking that the actual tags could themselves gitoids, referring to some "document" describing the relationship. For things like the SPDX relationships we'd have a well-known list of such documents and their gitoids, but it would allow people to define custom relationships as needed.

I guess what I had in mind is that those documents would have some kind of machine-consumable schema description, so the relationships could be given some formal semantics (no real idea of what that would look like in practice). If its referenced by gitoid then it would be effectively immutable. On the one hand that means we can't accidentally go back and retroactively change the meaning of already existing relationships - but it also means we can't fix bugs either.

The "slug of text" approach is more flexible in that respect. One would still need to be very disciplined about evolving the spec in order to not invalidate existing uses of a relationship. In practice it might mean that we'd need to introduce a new subtly different kind of relationship instead of changing the definition of an existing one, at the risk of having a confusing proliferation of similar terms.

In either case, if we want to allow locally defined relationships, they need to be appropriately scoped - for example, make sure some formal name (either in the document or the string) always has a well-defined namespace (with OmniBOR either having an explicit namespace or claiming the empty namespace).

The second option is the metadata encoding which has been the focus of the discussion above. This seems to be about encoding things like compilation information or other provenance metadata, and more generally seems philosophically about binding provenance information directly into the input manifests, rather than having it encoded in separate provenance attestations which may reference the Artifact Identifiers. Do I have that right?

Well in my mind, the key points are:

"artifacts" are the atom of OmniBOR. They're encoded as byte strings and hashed into gitoids. There's no unit that's hashed, and larger units are composed of artifacts referencing each other via embedded gitoids
While many of the artifacts in the dependency graph have a direct relationship with artifacts used in the actual build process, there's no requirement that it's always a pure 1:1 relationship. If you want to have some particularly significant piece of information pulled out into an identifiable artifact then you can do so, and give it its own gitoid
There's no real constraint on what those extra pieces of info could be. They could be straightforward "this was a source file" style provenance information. It could be "this was that source's filename" (since we can't otherwise distinguish identical sources with different names). It could be "this was compiled by clang". It could be "here's a command line with paths normalized". It could be "here's the hostname, username, timestamp the build started". This latter is probably noise for almost all consumers, but if it's properly tagged it can be stripped out and ignored.

I worry that encoding this information may lose out on the canonical / reproducible properties which OmniBOR otherwise strives for; at the very least I think there's a tension worth teasing out more about divergence vs. convergence here.

I think the question of exactly what OmniBOR strives for is still a bit of an open question. If you don't include any invocation information then you can use it to tell whether a build was reproduced, but if it gives you no information about how to actually do the reproduction. If you include all those details then you could reproduce the build, but lose the ability to recognize other logical reproductions that differ in insignificant ways.

edwarnicke Mar 22, 2023
Maintainer

@jsgf

Well, the this proposal is explicitly in the camp of "metadata should be hashed".

Would it be be fair to say this proposal is in the camp of 'certain metadata should be hashed into the ADG' ? Or am I misunderstanding?

edwarnicke · 2023-03-22T14:52:51Z

edwarnicke
Mar 22, 2023
Maintainer

@jsgf I have been thinking a bit on what I perceive to be your use cases.

I think at root your use case is being able to 'filter' the tree to create a new tree that is more 'convergent'.

And you are suggesting this could be used to add more data about things like toolchain data etc.

Filtering the tree for convergence

For those just catching up, I highly recommend #42 to better understand 'convergence' vs 'divergence'.

Filtering is always going to be somewhat bespoke. What should trigger divergence will be highly situational. Your proposal appears to be to allow adding at the time of generation of the Input Manifest contextual 'hints' per input artifact that can be used for after the fact filtering. I think you are suggesting filtering by omitting zero or more lines from the Input Manifest. Could you provide an example where such omission is the desired result?

If I contemplate your protobuf file whitespace example its unclear to me how I would use 'role' plus deletion of a line from an Input Manifest to decrease the divergence without also losing information about the fact that protobuf file was part of the input.

1 reply

jsgf Mar 22, 2023
Maintainer Author

I think at root your use case is being able to 'filter' the tree to create a new tree that is more 'convergent'.

Yes, exactly - you can create a new derived dependency graph with unnecessary (for your specific use-case) detail removed. The input tags let you identify which inputs are the ones you do/do not care about.

And you are suggesting this could be used to add more data about things like toolchain data etc.

Yes, if a consumer of the graph can filter out unwanted detail, then there's basically no downside to including as much extra detail as you want at graph construction time. So you can include all the "real" input artifacts like source files, semi-input artifacts like toolchains, and completely synthetic artifacts with any arbitrary metadata you think would be useful for a broad range of consumer use-cases. Then any specific consumer can filter it down to match their actual requirements.

One key assumption here is that we do not embed manifest ids into output artifacts. If we do that then the output artifact contents become dependent on the very specific dependency graph + metadata generated at build time. This means that the data is much less useful for, say, people who want reproducible builds.

Could you provide an example where such omission is the desired result?

Well, take the reproducible build use-case. If we want to precisely reproduce a build and (ideally) get bit for bit identical outputs we need:

all the input sources
their dependency relationships
the precise toolchain (eg version details, but maybe its complete gitoid)
how the build tools were invoked to generate each output artifact, including
1. the core set of options
2. options which may include build-host specific detail (eg full paths)
3. normalized options (eg paths relative to the top-level source dir)

If one is careful, I think you can package all that up as metadata within each manifest involved in the build in such a way that irrelevant detail (like full include paths) can be stripped out without affecting "relevant" aspects and the generated artifacts.

Of course it all gets awkward if the build machine's full paths are actually baked into the output artifacts, so one might still want the full paths so if you have the complete generated artifacts you're trying to replicate then you can verify your local artifacts are the same except for the paths.

On the other hand, if you're trying to tell whether a source file has an effect on the generated binary, you strip out everything except the input sources and the generated artifacts. If two builds differ in one or more source files but converge on the same output binary, then you tell that those files were not relevant to the output.

If I contemplate your protobuf file whitespace example its unclear to me how I would use 'role' plus deletion of a line from an Input Manifest to decrease the divergence without also losing information about the fact that protobuf file was part of the input.

If you have two .proto files which when fed through protoc generate (converge on) the same output code, then you can conclude that the differences in the input are not semantically important for the code generation. The differences might still be significant to some other process - eg they could be changes to doc comments which affect the output of a documentation generator step.

So in this case, as long as your manifests aren't cluttered with detail that don't relate to this specific "are these two .proto files semanically identical WRT codegen" question, then you can answer it directly from the OmniBOR data. (And again, it only works if you don't embed the manifest id in the output artifacts.)

AevaOnline · 2023-09-06T18:42:20Z

AevaOnline
Sep 6, 2023
Maintainer

Summarizing today's discussion: folks generally agreed that a strong correlation between input artifacts, output manifest, output artifact, and build metadata is desirable (e.g., perhaps through a consistent naming scheme), but also that the metadata identifier should not be included in the output artifact or the output manifest.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OmniBOR

Tagging what inputs are for ("Input roles"?) #43

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 10 comments 7 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

OmniBOR

Tagging what inputs are for ("Input roles"?) #43

jsgf Feb 23, 2023 Maintainer

Toolchain metadata

Other kinds of input roles

Outputs?

Actual concrete details of how it would fit into existing formats

Replies: 10 comments · 7 replies

yonhan3 Feb 24, 2023

jsgf Feb 24, 2023 Maintainer Author

yonhan3 Feb 25, 2023

jsgf Feb 24, 2023 Maintainer Author

edwarnicke Feb 28, 2023 Maintainer

jsgf Feb 28, 2023 Maintainer Author

edwarnicke Mar 1, 2023 Maintainer

edwarnicke Mar 1, 2023 Maintainer

jsgf Mar 1, 2023 Maintainer Author

alilleybrinker Mar 1, 2023 Maintainer

jsgf Mar 1, 2023 Maintainer Author

alilleybrinker Mar 6, 2023 Maintainer

jsgf Mar 8, 2023 Maintainer Author

edwarnicke Mar 22, 2023 Maintainer

edwarnicke Mar 22, 2023 Maintainer

Filtering the tree for convergence

jsgf Mar 22, 2023 Maintainer Author

AevaOnline Sep 6, 2023 Maintainer

jsgf
Feb 23, 2023
Maintainer

Replies: 10 comments 7 replies

yonhan3
Feb 24, 2023

jsgf Feb 24, 2023
Maintainer Author

jsgf
Feb 24, 2023
Maintainer Author

edwarnicke
Feb 28, 2023
Maintainer

jsgf Feb 28, 2023
Maintainer Author

edwarnicke
Mar 1, 2023
Maintainer

edwarnicke
Mar 1, 2023
Maintainer

jsgf Mar 1, 2023
Maintainer Author

alilleybrinker
Mar 1, 2023
Maintainer

jsgf
Mar 1, 2023
Maintainer Author

alilleybrinker
Mar 6, 2023
Maintainer

jsgf Mar 8, 2023
Maintainer Author

edwarnicke Mar 22, 2023
Maintainer

edwarnicke
Mar 22, 2023
Maintainer

jsgf Mar 22, 2023
Maintainer Author

AevaOnline
Sep 6, 2023
Maintainer