Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(license): stop spliting a long license text #7336

Merged
merged 18 commits into from
Sep 5, 2024

Conversation

afdesk
Copy link
Contributor

@afdesk afdesk commented Aug 13, 2024

Description

When we looks for licenses Trivy tries to split information about license through a regex.
but for some cases License field contains a long descriptive text.

This PR adds a detection of a long license text and keep it inside a new field - License, as Dmitriy suggested.

LinceseText field is available for JSON format only. for TABLE format Trivy shows CUSTOM License name instead of a long text.

For tests I use next image:

$ trivy i -d --cache-backend memory --scanners license saisk026/conda-test:v1

Before:
изображение
изображение

Afrer:
изображение
изображение

JSON output:
изображение

Related issues

Checklist

  • I've read the guidelines for contributing to this repository.
  • I've followed the conventions in the PR title.
  • I've added tests that prove my fix is effective or that my feature works.
  • I've updated the documentation with the relevant information (if needed).
  • I've added usage information (if the PR introduces new options)
  • I've included a "before" and "after" example to the description (if the PR is a user interface change).

@afdesk afdesk changed the title fix(license): handling unknown licenses fix(license): combining unknown licenses Aug 13, 2024
@afdesk afdesk marked this pull request as ready for review August 13, 2024 12:30
@afdesk
Copy link
Contributor Author

afdesk commented Aug 13, 2024

@knqyf263 i think it's ready for review.
please, take a look when you have time. thanks!

@knqyf263
Copy link
Collaborator

Is there a way to distinguish between the license name and the license text?

@afdesk
Copy link
Contributor Author

afdesk commented Aug 14, 2024

Is there a way to distinguish between the license name and the license text?

Right now, I'm not sure.
but I'll take a look and will show the cases.

@knqyf263
Copy link
Collaborator

I want to show "unknown" for the license text.

@afdesk
Copy link
Contributor Author

afdesk commented Aug 14, 2024

My concern is next.
Python deb packages can contain the license text inside license field:

cat /usr/share/doc/python3.9-minimal/copyright

image

but if I understand correctly it's a mistake.
the docs: First line (synopsis): an abbreviated name for the license.

@afdesk
Copy link
Contributor Author

afdesk commented Aug 14, 2024

My concern is that there is no a correct way to distinguish between incorrect Permission to use, copy, modify, and distribute this software and its and correct license synopsis TinySCHEME or permissive for gpgv package.

@afdesk
Copy link
Contributor Author

afdesk commented Aug 14, 2024

the same situation is with License field inside *.dist-info.METADATA.

some packages contain a correct license name (eg. Pympler): Apache License, Version 2.0.
some undefined license (eg zope): ZPL 2.1.
some packages contain the license text (eg menuinst):
image

@afdesk
Copy link
Contributor Author

afdesk commented Aug 14, 2024

@knqyf263 I have an idea. trying

@afdesk
Copy link
Contributor Author

afdesk commented Aug 14, 2024

I want to show "unknown" for the license text.

@knqyf263 Could you confirm that I understand correctly this requirement? thanks
image

@knqyf263
Copy link
Collaborator

I know it's not ideal, but what if checking the length and the number of newlines?

func isLicenseName(license string) bool {
	// Check text length
	if len(license) < 100 {
		return true
	}

	// Count newlines
	if strings.Count(license, "\n") > 3 {
		return false
	}
        ...

@knqyf263
Copy link
Collaborator

I want to show "unknown" for the license text.

@knqyf263 Could you confirm that I understand correctly this requirement? thanks image

I don't want to show the license text there as it's too long. I thought we would show "UNKNOWN", but we know a license text. We just don't know the short name. How about "Custom"? Then, the license text can be stored in another field.

@afdesk
Copy link
Contributor Author

afdesk commented Aug 19, 2024

I know it's not ideal, but what if checking the length and the number of newlines?

func isLicenseName(license string) bool {
	// Check text length
	if len(license) < 100 {
		return true
	}

	// Count newlines
	if strings.Count(license, "\n") > 3 {
		return false
	}
        ...

@knqyf263 I tried this way, you're right, it's not ideal.

Counting new lines don't affect on the output, because Trivy reads only one line from license in dpkg, python packages also contain a long single line license... so this check is always true.

About check text length. It works for long linceses in python.
but dpkg's licenses are still shown incorrectly.

i thought it's a long text, but actually it's a few first rows of several licenses:
изображение

License: Redistribution and use in source and binary forms, with or without
License: By obtaining, using, and/or copying this software and/or its
License: Permission to use, copy, modify, and distribute this software and
License: Redistribution and use in source and binary forms, with or without
License: This software is provided 'as-is', without any express or implied
License: Permission to use, copy, modify, and distribute this software and
License: Permission  is  hereby granted,  free  of charge,  to  any person
License: This software is provided 'as-is', without any express or implied
License: Permission is hereby granted, free of charge, to any person obtaining
   under the terms of the GNU General Public License as published by the
   section entitled ``GNU General Public License''.
License: Permission to use, copy, modify, and distribute this software and its
License: Permission to use, copy, modify, and distribute this software and its
License: This software is provided 'as-is', without any express or implied
License: Permission is hereby granted, free of charge, to any person obtaining
License: Redistribution and use in source and binary forms, with or without
License: This software is provided as-is, without express or implied
License: Permission to use, copy, modify, and distribute this software for any
License: Permission to use, copy, modify, and distribute this software and its
License:
License:  * Permission to use this software in any way is granted without
License: Permission to use, copy, modify, and distribute this software and its

@afdesk
Copy link
Contributor Author

afdesk commented Aug 19, 2024

Right now, I can't see a good solution, but there are several options:

  1. We just check text lengths, and don't update output for dpkg licenses, because it's a python's fault.

  2. We keep a list of known licenses, and check every splited part.
    if there is no matching, we print a license text.

  3. We keep some keywords, and if we can find them inside a string, it'll mean a license text (or a part).
    for example: as-is, redistribution, permission, using, use, @, http etc.

@knqyf263 wdyt?

@afdesk
Copy link
Contributor Author

afdesk commented Aug 19, 2024

I want to show "unknown" for the license text.
I don't want to show the license text there as it's too long. I thought we would show "UNKNOWN", but we know a license text. We just don't know the short name. How about "Custom"? Then, the license text can be stored in another field.

You mean we should print CUSTOM license instead of a long license text for table output, and keep another field (ex licenseText) for JSON, right?
IMHO, It makes sense.
also I'd add a few words from license )

@DmitriyLewen
Copy link
Contributor

i thought it's a long text, but actually it's a few first rows of several licenses:

What if we add one more check for copyright files:
if number of License: * fields more then one - split each line:

  • if 2 or more licenses are found in field - it is text license
  • if only 1 license is found in field - this is license name

e.g.
line 1 - text, line 2 - license name:

License: Redistribution and use in source and binary forms, with or without
License: BSD-3-Clause

My logic is as follows:
if you use multiple licenses separarted by and/or - you will use only one License: * field.
But if you use multiple License: * fields:

  • you will use one license for each License: * field.
    e.g.:
    License: MIT
    License: BSD-3-Clause
    
  • in other cases it is license text.

@afdesk
Copy link
Contributor Author

afdesk commented Aug 19, 2024

@DmitriyLewen that's an interesting idea.

there are next cases for perl and python packages:

perl:

License: GPL-1+ or Artistic or Artistic-dist

python3.9:

License: This software is provided 'as-is', without any express or implied

i'm not sure we can separate these cases, but maybe if we also will check string length...

a long string with a few splited licenses is a text.
what is long? I think it about 50 chars.
wdyt?

@DmitriyLewen
Copy link
Contributor

but maybe if we also will check string length...

Yeah. That's what I thought
if there's only one field - we check the length and number of lines.

I think it about 50 chars.

What if use 30 characters + no saved licenses found

@knqyf263
Copy link
Collaborator

You mean we should print CUSTOM license instead of a long license text for table output, and keep another field (ex licenseText) for JSON, right?

Exactly

@afdesk afdesk marked this pull request as draft August 23, 2024 03:46
@afdesk
Copy link
Contributor Author

afdesk commented Aug 23, 2024

@knqyf263 @DmitriyLewen
I tried severel ways to separate a long license text from a license name, and the best result for me is a detection by keywords.

There were selected a few obvious words, that can appear inside license texts only.

Please, take a look at this suggestion when you have free time. thanks!

@afdesk afdesk marked this pull request as ready for review August 23, 2024 04:39
Copy link
Contributor

@DmitriyLewen DmitriyLewen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@afdesk I left comments. Take a look, please.

I tried severel ways to separate a long license text from a license name, and the best result for me is a detection by keywords.

Can you give examples of problems for other methods to save time if we need to come back to this question?

Severity: dbTypes.SeverityUnknown.String(),
Category: ftypes.CategoryUnknown,
PkgName: pkg.Name,
Name: "CUSTOM License",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to add info when we use CUSTOM License in docs and license Name/log message

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to add the first few words of the license text to the name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to add the first few words of the license text to the name.

if @knqyf263 agrees too, I'll add it.
I thought about filtering by CUSTOM License, but maybe it doesn't matter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about CUSTOM License again. This may not be clear to users.
What if we use the name Incomparable License/Unmatched License
This means that we can't compare license text with known Trivy licenses and therefore store license as text.
cc. @knqyf263

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw some places calling them "custom licenses", so I don't think it's that bad.

Custom licenses are strongly disfavored and should be used only in extraordinary circumstances. An Agency ought to adopt a license approved by the Open Source Initiative (OSI) at opensource.org.

https://opensource.org/authority

If you are using a license that hasn't been assigned an SPDX identifier, or if you are using a custom license, use a string value like this one:

https://docs.npmjs.com/cli/v10/configuring-npm/package-json#license

Alternatively, you can use a LicenseRef- custom license identifier to refer to a license that is not on the SPDX License List, such as the following:

https://spdx.github.io/spdx-spec/v2.3/using-SPDX-short-identifiers-in-source-files/

Or "non-standard" or something like that.

Incomparable License/Unmatched License are also unclear what these did not match.

pkg/types/license.go Outdated Show resolved Hide resolved
func SplitLicenses(str string) []string {
if str == "" {
return nil
}
if isLicenseText(strings.ToLower(str)) {
return []string{
"text://" + str,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use file:// prefix in Python.
What if we create constants for these prefixes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@afdesk
Copy link
Contributor Author

afdesk commented Aug 23, 2024

I tried severel ways to separate a long license text from a license name, and the best result for me is a detection by keywords.

Can you give examples of problems for other methods to save time if we need to come back to this question?

The main problem is to separate a license name and a license text.
Some developers put a full license text inside a license field, where should be a license name (or a license ID).

Checking the length doesn't work because there are correct long lincenses (it's already added to test cases):

Common Development and Distribution License 1.0 (CDDL-1.0)

the number of newlines doesn't work too, because we read only one line from copyright or license files.

the number of spaces doesn't work, because there are too long correct license names (ex CDDL-1.0).

saisk026/conda-test:v1 is a good image for license testing.

@afdesk afdesk changed the title fix(license): combining unknown licenses fix(license): stop spliting a long license text Aug 26, 2024
Copy link
Contributor

@DmitriyLewen DmitriyLewen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@knqyf263 wdyt about this way?

@@ -1,6 +1,6 @@
// Code generated by protoc-gen-go. DO NOT EDIT.
// versions:
// protoc-gen-go v1.27.1
// protoc-gen-go v1.34.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like we can skip this change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change results the versions of protoc-gen-go for cache/ and common to the same version. it was build automatically

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed this update for this PR)

@knqyf263
Copy link
Collaborator

knqyf263 commented Sep 3, 2024

I tried severel ways to separate a long license text from a license name, and the best result for me is a detection by keywords.

OK, we don't have an easy way. Let's see how it goes.

@afdesk
Copy link
Contributor Author

afdesk commented Sep 3, 2024

I tried severel ways to separate a long license text from a license name, and the best result for me is a detection by keywords.

OK, we don't have an easy way. Let's see how it goes.

Should i fix something to add this PR in 0.55?

@knqyf263
Copy link
Collaborator

knqyf263 commented Sep 3, 2024

Should i fix something to add this PR in 0.55?

I'm reviewing the changes now. I'll update you soon.
If we don't make it for v0.55.0, we can include it in v0.55.1.

Signed-off-by: knqyf263 <[email protected]>
Copy link
Collaborator

@knqyf263 knqyf263 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored but realized there was no test. I'm not sure my changes work as expected.
@afdesk Could you add a test for license texts?

@@ -22,6 +22,9 @@ type DetectedLicense struct {
// Name holds a detected license name
Name string

// LicenseText holds a long license text if Trivy detects a license name as a license text
LicenseText string
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Since this is "DetectedLicense", Text looks enough.

Suggested change
LicenseText string
Text string

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

string name = 5;
float confidence = 6;
string link = 7;
string license_text = 8;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@afdesk
Copy link
Contributor Author

afdesk commented Sep 4, 2024

but realized there was no test. I'm not sure my changes work as expected. @afdesk Could you add a test for license texts?

yeh, sure. the test is added

@afdesk afdesk requested a review from knqyf263 September 4, 2024 10:48
@knqyf263 knqyf263 added this pull request to the merge queue Sep 5, 2024
Merged via the queue into aquasecurity:main with commit 4926da7 Sep 5, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

don't split licenses from License field from python packaging
3 participants