Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(license): Support for deep license scanning for finding concluded licenses #7344

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

hrithik-777
Copy link

@hrithik-777 hrithik-777 commented Aug 15, 2024

Description

Added Support for deep license scanning to find concluded licenses. This feature is an extension to the existing license scanning support to find the concluded licenses. Current implementation only checks the manifest files and gets the declared licenses found for each package. With this deep license scanning, it also scans all other files found in the package directory, applies google license classifier on the file content and gets the license findings. This feature is useful from a license persona, where they are mainly concerned about all the licenses that can be found in the given image/filesystem.

Currently Adding support for NodeJS and dotnet languages as part of this PR. Can be extended to other languages as well.

  • Tests updated
  • Added to documentation for docs site

Features:

  • When license scanner is enabled along with --license-full flag, the deep license scanning is performed on the given image/filesystem. It finds all the licenses it can and adds to the report.
  • Since deep license scanning is a slow process, --license-scan-workers flag can be passed to increase the number of threads used for the processing. (default 5 threads)
  • Capture the license text found as part of google license classifier findings. A checksum for this text is generated and this text is persisted as .txt file. This is stored at the directory specified at --license-text-cacheDir flag. If flag is not given, it stores in the default cache dir.
  • Fetch the copyright text present as part of the license text using regex match.
  • Declared licenses are sometimes specified as combined licenses. Ex: MIT OR BSD-3 etc. In these cases, it makes sense to split this combined license and validate the individual parts into spdx licenses. If a valid spdx identifier is found, we add them in the final report. If no spdx license identifiers are found, we just add the original combined license string in the output.
  • Added IsSpdxClassified field as part of licenses to specify if the license identifier falls under spdx classification or not.

Limitations:

  • Generally the concluded licenses found should be greater than or equal to the declared licenses. But there can be scenarios where the google license classifier cannot identify a license text in a file. So in these cases the number of concluded licenses can be lesser than declared licenses.
  • The google license classifier internally compares the given license text with the full license text that it has to identify it as a known license. If partial license text is given as part of the source code (header / code comments), it fails to classify that partial text. So only full license text can be taken into consideration.

Before,
{ "ID": "[email protected]", "Name": "harmony-reflect", "Identifier": { "PURL": "pkg:npm/[email protected]", "UID": "1e2a0faf8c98eb2e" }, "Version": "1.6.2", "Licenses": [ "Apache-2.0", "MPL-1.1" ], "Indirect": true, "Relationship": "indirect", "Layer": {}, "Locations": [ { "StartLine": 12321, "EndLine": 12326 } ] },

After
{ "ID": "[email protected]", "Name": "harmony-reflect", "Identifier": { "PURL": "pkg:npm/[email protected]", "UID": "1e2a0faf8c98eb2e" }, "Version": "1.6.2", "Licenses": [ "Apache-2.0", "MPL-1.1" ], "ConcludedLicenses": [ { "Name": "Apache-2.0", "Type": "header", "IsDeclared": false, "FilePath": "node_modules/harmony-reflect/reflect.js", "LicenseTextChecksum": "ebc83a1acf7dd9ea52c518980ecd82e2e0688f4aa1aa27cb3350a722d86c6380" } ], "Indirect": true, "Relationship": "indirect", "Layer": {}, "Locations": [ { "StartLine": 12321, "EndLine": 12326 } ] },

Checklist

  • I've read the guidelines for contributing to this repository.
  • I've followed the conventions in the PR title.
  • I've added tests that prove my fix is effective or that my feature works.
  • I've updated the documentation with the relevant information (if needed).
  • I've added usage information (if the PR introduces new options)
  • I've included a "before" and "after" example to the description (if the PR is a user interface change).

@CLAassistant
Copy link

CLAassistant commented Aug 15, 2024

CLA assistant check
All committers have signed the CLA.

@hrithik-777 hrithik-777 force-pushed the hrithiky-deep-license-scanning branch from 6d83985 to ce2624c Compare August 15, 2024 14:34
@hrithik-777 hrithik-777 force-pushed the hrithiky-deep-license-scanning branch from ce2624c to 681a35f Compare August 16, 2024 05:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants