Skip to content

Commit db3231f

Browse files
authored
feat!: introduce confidence scores for check facts (#620)
This PR changes the data model and allows specifying confidence scores for check results, which is especially useful when a check reports multiple candidate results. All of these confidence scores are added to the check tables in the database. However, the fact that has the highest confidence is shown in the HTML/JSON report only. The justifications are no longer required to be added manually to the CheckResultData. Instead, they are curated directly from the results in the table. If a column has specified JustificationType in the column mapping, it will be picked up automatically and rendered as plain text or href depending on the specified type. If a check fails or is skipped, we show a default Not Available. justification. This allows to create HTML/JSON reports from the database reproducibly. Signed-off-by: behnazh-w <[email protected]>
1 parent 31f1c87 commit db3231f

34 files changed

+1225
-807
lines changed

docs/source/assets/er-diagram.svg

Lines changed: 385 additions & 264 deletions
Loading

docs/source/pages/developers_guide/index.rst

Lines changed: 208 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
.. Copyright (c) 2023 - 2023, Oracle and/or its affiliates. All rights reserved.
1+
.. Copyright (c) 2023 - 2024, Oracle and/or its affiliates. All rights reserved.
22
.. Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/.
33
44
=========================
@@ -11,6 +11,213 @@ To follow the project's code style, see the :doc:`Macaron Style Guide </pages/de
1111

1212
For API reference, see the :doc:`API Reference </pages/developers_guide/apidoc/index>` page.
1313

14+
-------------------
15+
Writing a New Check
16+
-------------------
17+
18+
Contributors to Macaron are very likely to need to write a new check or modify an existing one at some point. In this
19+
section, we will explain how Macaron checks work. We will also show how to develop a new check.
20+
21+
+++++++++++++++++
22+
High-level Design
23+
+++++++++++++++++
24+
25+
Before jumping into coding, it is useful to understand how Macaron as a framework works. Macaron is an extensible
26+
framework designed to make writing new supply chain security analyses easy. It provides an interface
27+
that you can leverage to access existing models and abstractions instead of implementing everything from scratch. For
28+
instance, many security checks require traversing through the code in GitHub Actions configurations. Normally,
29+
you would need to find the right repository and commit, clone it, find the workflows, and parse them. With Macaron,
30+
you don't need to do any of that and can simply write your security check by using the parsed shell scripts that are
31+
triggered in the CI.
32+
33+
Another important aspect of our design is that all the check results are automatically mapped and stored in a local database.
34+
By performing this mapping, we make it possible to enforce use case-specific policies on the results of the checks. While storing
35+
the check results in the database happens automatically in Macaron's backend, the developer needs to add a brief specification
36+
to make that possible as we will see later.
37+
38+
Once you get familiar with writing a basic check, you can explore the check dependency feature in Macaron. The checks
39+
in our framework can be customized to only run if another check has run and returned a specific
40+
:class:`result type <macaron.slsa_analyzer.checks.check_result.CheckResultType>`. This feature can be used when checks
41+
have an ordering and a parent-child relationship, i.e., one check implements a weaker or stronger version of a
42+
security property in a parent check. Therefore, it might make sense to skip running the check and report a
43+
:class:`result type <macaron.slsa_analyzer.checks.check_result.CheckResultType>` based on the result of the parent check.
44+
45+
+++++++++++++++++++
46+
The Check Interface
47+
+++++++++++++++++++
48+
49+
Each check needs to be implemented as a Python class in a Python module under ``src/macaron/slsa_analyzer/checks``.
50+
A check class should subclass the :class:`BaseCheck <macaron.slsa_analyzer.checks.base_check.BaseCheck>` class.
51+
52+
The main logic of a check should be implemented in the :func:`run_check <macaron.slsa_analyzer.checks.base_check.BaseCheck.run_check>` abstract method. It is important to understand the input
53+
parameters and output objects computed by this method.
54+
55+
.. code-block: python
56+
def run_check(self, ctx: AnalyzeContext) -> CheckResultData:
57+
58+
''''''''''''''''
59+
Input Parameters
60+
''''''''''''''''
61+
62+
The :func:`run_check <macaron.slsa_analyzer.checks.base_check.BaseCheck.run_check>` method is a callback called by our checker framework. The framework pre-computes a context object,
63+
:class:`ctx: AnalyzeContext <macaron.slsa_analyzer.analyze_context.AnalyzeContext>` and makes it available as the input
64+
parameter to the function. The ``ctx`` object contains various intermediate representations and models as the input parameter.
65+
Most likely, you will need to use the following properties:
66+
67+
* :attr:`component <macaron.slsa_analyzer.analyze_context.AnalyzeContext.component>`
68+
* :attr:`dynamic_data <macaron.slsa_analyzer.analyze_context.AnalyzeContext.dynamic_data>`
69+
70+
The :attr:`component <macaron.slsa_analyzer.analyze_context.AnalyzeContext.component>`
71+
object acts as a representation of a software component and contains data, such as it's
72+
corresponding :class:`Repository <macaron.database.table_definitions.Repository>` and
73+
:data:`dependencies <macaron.database.table_definitions.components_association_table>`.
74+
Note that :attr:`component <macaron.slsa_analyzer.analyze_context.AnalyzeContext.component>` will also be stored
75+
in the database and its attributes, such as :attr:`repository <macaron.database.table_definitions.Component.repository>`
76+
are established as database relationships. You can see the existing tables and their relationships
77+
in our :mod:`data model <macaron.database.table_definitions>`.
78+
79+
The :attr:`dynamic_data <macaron.slsa_analyzer.analyze_context.AnalyzeContext.dynamic_data>` property would be particularly useful as it contains
80+
data about the CI service, artifact registry, and build tool used for building the software component.
81+
Note that this object is a shared state among checks. If a check runs before another check, it can
82+
make changes to this object, which will be accessible to the checks run subsequently.
83+
84+
''''''
85+
Output
86+
''''''
87+
88+
The :func:`run_check <macaron.slsa_analyzer.checks.base_check.BaseCheck.run_check>` method returns a :class:`CheckResultData <macaron.slsa_analyzer.checks.check_result.CheckResultData>` object.
89+
This object consists of :attr:`result_tables <macaron.slsa_analyzer.checks.check_result.CheckResultData.result_tables>` and
90+
:attr:`result_type <macaron.slsa_analyzer.checks.check_result.CheckResultData.result_type>`.
91+
The :attr:`result_tables <macaron.slsa_analyzer.checks.check_result.CheckResultData.result_tables>` object is the list of facts generated from the check. The :attr:`result_type <macaron.slsa_analyzer.checks.check_result.CheckResultData.result_type>`
92+
value shows the final result type of the check.
93+
94+
+++++++
95+
Example
96+
+++++++
97+
98+
In this example, we show how to add a check to determine if a software component has a source-code repository.
99+
Note that this is a simple example to just demonstrate how to add a check from scratch.
100+
Feel free to explore other existing checks under ``src/macaron/slsa_analyzer/checks`` for more examples.
101+
102+
As discussed earlier, each check needs to be implemented as a Python class in a Python module under ``src/macaron/slsa_analyzer/checks``.
103+
A check class should subclass the :class:`BaseCheck <macaron.slsa_analyzer.checks.base_check.BaseCheck>` class.
104+
105+
'''''''''''''''
106+
Create a module
107+
'''''''''''''''
108+
First create a module called ``repo_check.py`` under ``src/macaron/slsa_analyzer/checks``.
109+
110+
111+
''''''''''''''''''''''''''''
112+
Add a class for the database
113+
''''''''''''''''''''''''''''
114+
115+
* Add a class that subclasses :class:`CheckFacts <macaron.database.table_definitions.CheckFacts>` to map your outputs to a table in the database. The class name should follow the ``<MyCheck>Facts`` pattern.
116+
* Specify the table name in the ``__tablename__`` class variable. Note that the table name should start with ``_`` and it should not have been used by other checks.
117+
* Add the ``id`` column as the primary key where the foreign key is ``_check_facts.id``.
118+
* Add columns for the check outputs that you would like to store in the database. If a column needs to appear as a justification in the HTML/JSON report, pass ``info={"justification": JustificationType.<TEXT or HREF>}`` to the column mapper.
119+
* Add ``__mapper_args__`` class variable and set ``"polymorphic_identity"`` key to the table name.
120+
121+
.. code-block:: python
122+
123+
# Add this line at the top of the file to create the logger object if you plan to use it.
124+
logger: logging.Logger = logging.getLogger(__name__)
125+
126+
127+
class RepoCheckFacts(CheckFacts):
128+
"""The ORM mapping for justifications in the check repository check."""
129+
130+
__tablename__ = "_repo_check"
131+
132+
#: The primary key.
133+
id: Mapped[int] = mapped_column(ForeignKey("_check_facts.id"), primary_key=True)
134+
135+
#: The Git repository path.
136+
git_repo: Mapped[str] = mapped_column(String, nullable=True, info={"justification": JustificationType.HREF})
137+
138+
__mapper_args__ = {
139+
"polymorphic_identity": "_repo_check",
140+
}
141+
142+
'''''''''''''''''''
143+
Add the check class
144+
'''''''''''''''''''
145+
146+
Add a class for your check that subclasses :class:`BaseCheck <macaron.slsa_analyzer.checks.base_check.BaseCheck>`,
147+
provide the check details in the initializer method, and implement the logic of the check in
148+
:func:`run_check <macaron.slsa_analyzer.checks.base_check.BaseCheck.run_check>`.
149+
150+
A ``check_id`` should match the ``^mcn_([a-z]+_)+([0-9]+)$`` regular expression, which means it should meet the following requirements:
151+
152+
- The general format: ``mcn_<name>_<digits>``.
153+
- Use lowercase alphabetical letters in ``name``. If ``name`` contains multiple words, they must be separated by underscores.
154+
155+
You can set the ``depends_on`` attribute in the initializer method to declare such dependencies. In this example, we leave this list empty.
156+
157+
.. code-block:: python
158+
159+
class RepoCheck(BaseCheck):
160+
"""This Check checks whether the target software component has a source-code repository."""
161+
162+
def __init__(self) -> None:
163+
"""Initialize instance."""
164+
check_id = "mcn_repo_exists_1"
165+
description = "Check whether the target software component has a source-code repository."
166+
depends_on: list[tuple[str, CheckResultType]] = [] # This check doesn't depend on any other checks.
167+
eval_reqs = [
168+
ReqName.VCS
169+
] # Choose a SLSA requirement that roughly matches this check from the ReqName enum class.
170+
super().__init__(check_id=check_id, description=description, depends_on=depends_on, eval_reqs=eval_reqs)
171+
172+
def run_check(self, ctx: AnalyzeContext) -> CheckResultData:
173+
"""Implement the check in this method.
174+
175+
Parameters
176+
----------
177+
ctx : AnalyzeContext
178+
The object containing processed data for the target software component.
179+
180+
Returns
181+
-------
182+
CheckResultData
183+
The result of the check.
184+
"""
185+
if not ctx.component.repository:
186+
logger.info("Unable to find a Git repository for %s", ctx.component.purl)
187+
# We do not store any results in the database if a check fails. So, just leave result_tables empty.
188+
return CheckResultData(result_tables=[], result_type=CheckResultType.FAILED)
189+
190+
return CheckResultData(
191+
result_tables=[RepoCheckFacts(git_repo=ctx.component.repository.remote_path, confidence=Confidence.HIGH)],
192+
result_type=CheckResultType.PASSED,
193+
)
194+
195+
As you can see, the result of the check is returned via the :class:`CheckResultData <macaron.slsa_analyzer.checks.check_result.CheckResultData>` object.
196+
You should specify a :class:`Confidence <macaron.slsa_analyzer.checks.check_result.Confidence>`
197+
score choosing one of the :class:`Confidence <macaron.slsa_analyzer.checks.check_result.Confidence>` enum values,
198+
e.g., :class:`Confidence.HIGH <macaron.slsa_analyzer.checks.check_result.Confidence.HIGH>` and pass it via keyword
199+
argument :attr:`confidence <macaron.database.table_definitions.CheckFacts.confidence>`. You should choose a suitable
200+
confidence score based on the accuracy of your check analysis.
201+
202+
'''''''''''''''''''
203+
Register your check
204+
'''''''''''''''''''
205+
206+
Finally, you need to register your check by adding it to the :mod:`registry module <macaron.slsa_analyzer.registry>` at the end of your check module:
207+
208+
.. code-block:: python
209+
210+
registry.register(RepoCheck())
211+
212+
213+
'''''''''''''''
214+
Test your check
215+
'''''''''''''''
216+
217+
Finally, you can add tests for you check by adding ``tests/slsa_analyzer/checks/test_repo_check.py`` module. Macaron
218+
uses `pytest <https://docs.pytest.org>`_ and `hypothesis <https://hypothesis.readthedocs.io>`_ for testing. Take a look
219+
at other tests for inspiration!
220+
14221
.. toctree::
15222
:maxdepth: 1
16223

src/macaron/database/table_definitions.py

Lines changed: 25 additions & 63 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Copyright (c) 2023 - 2023, Oracle and/or its affiliates. All rights reserved.
1+
# Copyright (c) 2023 - 2024, Oracle and/or its affiliates. All rights reserved.
22
# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/.
33

44
"""
@@ -10,23 +10,31 @@
1010
1111
For table associated with a check see the check module.
1212
"""
13-
import hashlib
1413
import logging
1514
import os
1615
import string
1716
from datetime import datetime
1817
from pathlib import Path
19-
from typing import Any, Self
18+
from typing import Any
2019

2120
from packageurl import PackageURL
22-
from sqlalchemy import Boolean, Column, Enum, ForeignKey, Integer, String, Table, UniqueConstraint
21+
from sqlalchemy import (
22+
Boolean,
23+
CheckConstraint,
24+
Column,
25+
Enum,
26+
Float,
27+
ForeignKey,
28+
Integer,
29+
String,
30+
Table,
31+
UniqueConstraint,
32+
)
2333
from sqlalchemy.orm import Mapped, mapped_column, relationship
2434

2535
from macaron.database.database_manager import ORMBase
2636
from macaron.database.rfc3339_datetime import RFC3339DateTime
27-
from macaron.errors import CUEExpectationError, CUERuntimeError, InvalidPURLError
28-
from macaron.slsa_analyzer.provenance.expectations.cue import cue_validator
29-
from macaron.slsa_analyzer.provenance.expectations.expectation import Expectation
37+
from macaron.errors import InvalidPURLError
3038
from macaron.slsa_analyzer.slsa_req import ReqName
3139

3240
logger: logging.Logger = logging.getLogger(__name__)
@@ -415,6 +423,16 @@ class CheckFacts(ORMBase):
415423
#: The primary key.
416424
id: Mapped[int] = mapped_column(Integer, primary_key=True, autoincrement=True) # noqa: A003
417425

426+
#: The confidence score to estimate the accuracy of the check fact. This value should be in the range [0.0, 1.0] with
427+
#: a lower value depicting a lower confidence. Because some analyses used in checks may use
428+
#: heuristics, the results can be inaccurate in certain cases.
429+
#: We use the confidence score to enable the check designer to assign a confidence estimate.
430+
#: This confidence is stored in the database to be used by the policy. This confidence score is
431+
#: also used to decide which evidence should be shown to the user in the HTML/JSON report.
432+
confidence: Mapped[float] = mapped_column(
433+
Float, CheckConstraint("confidence>=0.0 AND confidence<=1.0"), nullable=False
434+
)
435+
418436
#: The foreign key to the software component.
419437
component_id: Mapped[int] = mapped_column(Integer, ForeignKey("_component.id"), nullable=False)
420438

@@ -437,62 +455,6 @@ class CheckFacts(ORMBase):
437455
}
438456

439457

440-
class CUEExpectation(Expectation, CheckFacts):
441-
"""ORM Class for an expectation."""
442-
443-
# TODO: provenance content check should store the expectation, its evaluation result,
444-
# and which PROVENANCE it was applied to rather than only linking to the repository.
445-
446-
__tablename__ = "_expectation"
447-
448-
#: The primary key, which is also a foreign key to the base check table.
449-
id: Mapped[int] = mapped_column(ForeignKey("_check_facts.id"), primary_key=True) # noqa: A003
450-
451-
#: The polymorphic inheritance configuration.
452-
__mapper_args__ = {
453-
"polymorphic_identity": "_expectation",
454-
}
455-
456-
@classmethod
457-
def make_expectation(cls, expectation_path: str) -> Self | None:
458-
"""Construct a CUE expectation from a CUE file.
459-
460-
Note: we require the CUE expectation file to have a "target" field.
461-
462-
Parameters
463-
----------
464-
expectation_path: str
465-
The path to the expectation file.
466-
467-
Returns
468-
-------
469-
Self
470-
The instantiated expectation object.
471-
"""
472-
logger.info("Generating an expectation from file %s", expectation_path)
473-
expectation: CUEExpectation = CUEExpectation(
474-
description="CUE expectation",
475-
path=expectation_path,
476-
target="",
477-
expectation_type="CUE",
478-
)
479-
480-
try:
481-
with open(expectation_path, encoding="utf-8") as expectation_file:
482-
expectation.text = expectation_file.read()
483-
expectation.sha = str(hashlib.sha256(expectation.text.encode("utf-8")).hexdigest())
484-
expectation.target = cue_validator.get_target(expectation.text)
485-
expectation._validator = ( # pylint: disable=protected-access
486-
lambda provenance: cue_validator.validate_expectation(expectation.text, provenance)
487-
)
488-
except (OSError, CUERuntimeError, CUEExpectationError) as error:
489-
logger.error("CUE expectation error: %s", error)
490-
return None
491-
492-
# TODO remove type ignore once mypy adds support for Self.
493-
return expectation # type: ignore
494-
495-
496458
class Provenance(ORMBase):
497459
"""ORM class for a provenance document."""
498460

src/macaron/policy_engine/souffle_code_generator.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
1-
# Copyright (c) 2023 - 2023, Oracle and/or its affiliates. All rights reserved.
1+
# Copyright (c) 2023 - 2024, Oracle and/or its affiliates. All rights reserved.
22
# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/.
33

44
"""Generate souffle datalog for policy prelude."""
55

66
import logging
77
import os
88

9-
from sqlalchemy import Column, MetaData, Table
9+
from sqlalchemy import Column, Float, MetaData, Table
1010
from sqlalchemy.sql.sqltypes import Boolean, Integer, String, Text
1111

1212
logger: logging.Logger = logging.getLogger(__name__)
@@ -81,6 +81,8 @@ def column_to_souffle_type(column: Column) -> str:
8181
souffle_type = "symbol"
8282
elif isinstance(sql_type, Integer):
8383
souffle_type = "number"
84+
elif isinstance(sql_type, Float):
85+
souffle_type = "number"
8486
elif isinstance(sql_type, Text):
8587
souffle_type = "symbol"
8688
elif isinstance(sql_type, Boolean):

0 commit comments

Comments
 (0)