Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

License scanning for VMR #17442

Merged
merged 21 commits into from
Oct 12, 2023
Merged

Conversation

mthalman
Copy link
Member

@mthalman mthalman commented Sep 28, 2023

Updates the source-build smoke tests to include a test which scans the sub-repos of the VMR, searching for license references. The intent is to be aware of any non-open-source license being introduced. The tool used for scanning for licenses is https://github.com/nexB/scancode-toolkit.

Each sub-repo of the VMR is scanned separately because of the amount of time it takes. If the entire VMR were to be scanned with one invocation of the tool, it would take ~12 hrs. By splitting it into separate jobs it takes ~6hrs.

When scanning is run, the test provides a list of files for the scanner to ignore. These include binary file types. It also includes .il/.ildump file types which are massive, causing the scanner to choke and don't include license references anyway.

Once the scanner returns the results, a filtering process occurs. First, any detected license that is in the allowed list of licenses is filtered out. The test defines a list of such licenses that all represent open-source licenses. Next, a license exclusions file is applied to the filtering. This file contains a set of paths for which certain detected licenses are to be ignored. Such a path can be defined to ignore all detected licenses or specific ones. These exclusions are useful for ignoring false positives where the scanning tool has detected something in the file that makes it think it's a license reference when that's not actually the intent. Other cases that are excluded are when the license is meant as configuration or test data and not actually applying to the code. These exclusions further filter down the set of the detected licenses for each file. Everything that's left at this point is reported. It gets compared to a baseline file (which is defined for each sub-repo). If the filtered results differ from what's defined in the baseline, the test fails.

Each job that runs this test will store pipeline artifacts to be used for logging purposes. It consists of two files:

  • UpdatedLicenses.<repo-name>.json: This is the output of that gets compared to the stored baseline. If they're the same, the test passes; if not, it fails. By comparing this file to the baseline, one can determine which new license references have been introduced. If everything is deemed to be acceptable, the developer can either update the allowed licenses, update the exclusions file, update the baseline, or any combination.
  • scancode-results.json: This is the raw output that comes from scancode. This file is useful for diagnostic purposes because it tells you the exact line number of where a license has been detected in a file.

The test suite was also refactored as part of these changes. Those changes have been included as a separate commit. See the commit description for more details on that: 109c3d8.

Fixes dotnet/source-build#3070

/cc @omajid

There are assumptions in the test infrastructure that all tests are
based around testing the SDK output. But that's not always the case.
This refactors the test suite to have a new TestBase class which can be
used for tests that don't target the SDK and SmokeTest class has been
renamed to SdkTests for those tests which do test the SDK. Also factored
out the logic for parsing exclusion files into the Utilities class.
@mthalman mthalman requested a review from a team as a code owner September 28, 2023 16:44
@mthalman
Copy link
Member Author

Here's an example of the output files from the scan for the installer repo: logs.zip

@omajid
Copy link
Member

omajid commented Sep 29, 2023

cc @dviererbe @mateusrodrigues @mirespace

@MichaelSimons
Copy link
Member

Thanks for working on this Matt. Do you see this detection as always living in the VMR or do you think this is something that can/should be pushed to the product repos? This feels like a type of compliance that repo owners should be responsible for. I worry about the burden this add to the source-build folks to monitor and push for resolution on license violations from individual repos. This may be something we would continue to run on the VMR but can this be backported to arcade and run on the repo level in addition? The idea is then the exclusions/exceptions would be owned by the individual repos.

@omajid
Copy link
Member

omajid commented Sep 29, 2023

Thanks for working on this Matt.

I couldn't agree more. This is amazing! Thanks Matt!

do you think this is something that can/should be pushed to the product repos? .. The idea is then the exclusions/exceptions would be owned by the individual repos.

That's a great point. A few reasons that make me less excited about product repos owning this:

  • We want this to be done for everything in the VMR (only), while product repos may have additional code (which may be proprietary). Maybe we can work around this by tying the what-sources-to-scan-for-licenses to what-product-repo-code-to-sync-with-vmr?
  • We have seen Product repos allow prebuilts so they can get fixes in. The prebuilts were caught due to build failures in the VMR. I wonder product repos are also likely to allow-list licenses and/or exclude code from being scanned, just to get past failures.
  • Product repos are likely further away from the distro maintainers, and this area benefits from close collaboration between distro-maintainers and repo owners. Though maybe the solution is to get distro-maintainers involved with the product repos more?

@MichaelSimons
Copy link
Member

That's a great point. A few reasons that make me less excited about product repos owning this:

We want this to be done for everything in the VMR (only), while product repos may have additional code (which may be proprietary). Maybe we can work around this by tying the what-sources-to-scan-for-licenses to what-product-repo-code-to-sync-with-vmr?
We have seen Product repos allow prebuilts so they can get fixes in. The prebuilts were caught due to build failures in the VMR. I wonder product repos are also likely to allow-list licenses and/or exclude code from being scanned, just to get past failures.
Product repos are likely further away from the distro maintainers, and this area benefits from close collaboration between distro-maintainers and repo owners. Though maybe the solution is to get distro-maintainers involved with the product repos more?

I share your concern about ensuring correctness and yes it feels similar to prebuilts. It is just that this feels like a compliance item better owned by the repo owners with the appropriate checks/sign-offs in place.

@MichaelSimons
Copy link
Member

Please capture some guidance documentation for how devs can run the license scanning test for a particular repo. This will be helpful in the future when logging repo issues about licensing issues.

@mthalman
Copy link
Member Author

mthalman commented Oct 3, 2023

Please capture some guidance documentation for how devs can run the license scanning test for a particular repo. This will be helpful in the future when logging repo issues about licensing issues.

I've added documentation in this commit: bfe0446

@mthalman
Copy link
Member Author

mthalman commented Oct 3, 2023

I'm going to be updating the test to have some additional validation for the exclusion file. One of my concerns is with the maintenance of that file and it getting filled with obsolete content. While it's not going to be the perfect answer, I want to have some validation to check for whether paths listed in there actually exist in the VMR. It can only do this up to a point, considering that those paths can contain wildcards.

Copy link
Member

@omajid omajid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@@ -60,6 +86,11 @@ src/fsharp/setup/resources/eula/*.rtf
# False positive
src/installer/src/core-sdk-tasks/BuildFPMToolPreReqs.cs|json
src/installer/src/redist/targets/packaging/osx/clisdk/resources/en.lproj/welcome.html|cecill-c
src/installer/THIRD-PARTY-NOTICES|proprietary-license
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/arcade/src/Microsoft.DotNet.Build.Tasks.Installers/src/BuildFPMToolPreReqs.cs|json
src/arcade/src/Microsoft.DotNet.Build.Tasks.Installers/build/rpm_templates/copyright|cecill-c
src/arcade/src/SignCheck/SignCheck/THIRD-PARTY-NOTICES.TXT
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying out a quick fix to see if we can delete this file: dotnet/arcade#14090

@mthalman mthalman merged commit 2e06985 into dotnet:release/8.0.1xx Oct 12, 2023
18 checks passed
@mthalman mthalman deleted the sb3070-working branch October 12, 2023 14:28
@mthalman
Copy link
Member Author

/backport to main

@github-actions
Copy link

Started backporting to main: https://github.com/dotnet/installer/actions/runs/6497054013

@github-actions
Copy link

@mthalman backporting to main failed, the patch most likely resulted in conflicts:

$ git am --3way --ignore-whitespace --keep-non-patch changes.patch

Applying: Test suite refactoring
.git/rebase-apply/patch:358: trailing whitespace.
    
warning: 1 line adds whitespace errors.
Using index info to reconstruct a base tree...
M	src/SourceBuild/content/test/Microsoft.DotNet.SourceBuild.SmokeTests/BaselineHelper.cs
M	src/SourceBuild/content/test/Microsoft.DotNet.SourceBuild.SmokeTests/BasicScenarioTests.cs
M	src/SourceBuild/content/test/Microsoft.DotNet.SourceBuild.SmokeTests/DotNetFormatTests.cs
M	src/SourceBuild/content/test/Microsoft.DotNet.SourceBuild.SmokeTests/DotNetHelper.cs
M	src/SourceBuild/content/test/Microsoft.DotNet.SourceBuild.SmokeTests/DotNetWatchTests.cs
M	src/SourceBuild/content/test/Microsoft.DotNet.SourceBuild.SmokeTests/OmniSharpTests.cs
A	src/SourceBuild/content/test/Microsoft.DotNet.SourceBuild.SmokeTests/SourcelinkTests.cs
M	src/SourceBuild/content/test/Microsoft.DotNet.SourceBuild.SmokeTests/Utilities.cs
M	src/SourceBuild/content/test/Microsoft.DotNet.SourceBuild.SmokeTests/WebScenarioTests.cs
Falling back to patching base and 3-way merge...
Auto-merging src/SourceBuild/content/test/Microsoft.DotNet.SourceBuild.SmokeTests/WebScenarioTests.cs
Auto-merging src/SourceBuild/content/test/Microsoft.DotNet.SourceBuild.SmokeTests/Utilities.cs
CONFLICT (content): Merge conflict in src/SourceBuild/content/test/Microsoft.DotNet.SourceBuild.SmokeTests/Utilities.cs
CONFLICT (modify/delete): src/SourceBuild/content/test/Microsoft.DotNet.SourceBuild.SmokeTests/SourcelinkTests.cs deleted in HEAD and modified in Test suite refactoring. Version Test suite refactoring of src/SourceBuild/content/test/Microsoft.DotNet.SourceBuild.SmokeTests/SourcelinkTests.cs left in tree.
Auto-merging src/SourceBuild/content/test/Microsoft.DotNet.SourceBuild.SmokeTests/OmniSharpTests.cs
Auto-merging src/SourceBuild/content/test/Microsoft.DotNet.SourceBuild.SmokeTests/DotNetWatchTests.cs
Auto-merging src/SourceBuild/content/test/Microsoft.DotNet.SourceBuild.SmokeTests/DotNetHelper.cs
Auto-merging src/SourceBuild/content/test/Microsoft.DotNet.SourceBuild.SmokeTests/DotNetFormatTests.cs
Auto-merging src/SourceBuild/content/test/Microsoft.DotNet.SourceBuild.SmokeTests/BasicScenarioTests.cs
Auto-merging src/SourceBuild/content/test/Microsoft.DotNet.SourceBuild.SmokeTests/BaselineHelper.cs
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Test suite refactoring
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".
Error: The process '/usr/bin/git' failed with exit code 128

Please backport manually!

@github-actions
Copy link

@mthalman an error occurred while backporting to main, please check the run log for details!

Error: git am failed, most likely due to a merge conflict.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants