Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for package url fields #7

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

sjn
Copy link

@sjn sjn commented May 13, 2023

Implements #6

sjn added 2 commits May 13, 2023 20:29
Package URLs are useful for referring to (and being referred to)
packages in other ecosystem namespaces. With this, we can introduce
a packaging/ecosystem-agnostic way to refer to CPAN packages.
@stigtsp
Copy link

stigtsp commented May 13, 2023

Is the ext=tar.gz parameter useful? I'm wondering if there are cases where a distribution is packaged with multiple extensions, or if Author-Distribution-Version identifies one downloadable artifact.

@sjn
Copy link
Author

sjn commented May 14, 2023

Is the ext=tar.gz parameter useful?

I added it to make it easier for clients/tooling in other package ecosystems (without knowledge about the inner workings of CPAN) can put together a valid download URL with the supplied package URL.

@sjn
Copy link
Author

sjn commented Aug 2, 2023

in this commit, I still get a few failing tests:

$ make test
Skip blib/lib/CPAN/DistnameInfo.pm (unchanged)
PERL_DL_NONLAZY=1 "/usr/bin/perl" "-MExtUtils::Command::MM" "-MTest::Harness" "-e" "undef *Test::Harness::Switches; test_harness(0, 'blib/lib', 'blib/arch')" t/*.t
t/ext.t ... ok       
t/path.t .. 1/299 
    #   Failed test 'hash matches'
    #   at t/path.t line 21.
    #     Structures begin differing at:
    #          $got->{pkgurl} = 'pkg:cpan/LDS/[email protected]?ext=tar.gz'
    #     $expected->{pkgurl} = 'pkg:cpan/LDS/[email protected]?ext=tar.gz'
    # Looks like you failed 1 test of 10.

#   Failed test 'CGI.pm-2.34.tar.gz'
#   at t/path.t line 24.

    #   Failed test 'hash matches'
    #   at t/path.t line 21.
    #     Structures begin differing at:
    #          $got->{pkgurl} = 'pkg:cpan/RJBS/[email protected]?ext=tar.gz'
    #     $expected->{pkgurl} = 'pkg:cpan/RJBS/[email protected]?ext=tar.gz'
    # Looks like you failed 1 test of 10.

#   Failed test 'Dist-Zilla-2.100860-TRIAL.tar.gz'
#   at t/path.t line 24.

    #   Failed test 'hash matches'
    #   at t/path.t line 21.
    #     Structures begin differing at:
    #          $got->{pkgurl} = 'pkg:cpan/MINGYILIU/[email protected]?ext=tar.gz'
    #     $expected->{pkgurl} = 'pkg:cpan/MINGYILIU/[email protected]?ext=tar.gz'
    # Looks like you failed 1 test of 10.

#   Failed test 'Bio-ASN1-EntrezGene-1.10-withoutworldwriteables.tar.gz'
#   at t/path.t line 24.
# Looks like you planned 299 tests but ran 30.
# Looks like you failed 3 tests of 30 run.
t/path.t .. Dubious, test returned 3 (wstat 768, 0x300)
Failed 272/299 subtests 

Test Summary Report
-------------------
t/path.t (Wstat: 768 (exited 3) Tests: 30 Failed: 3)
  Failed tests:  26, 29-30
  Non-zero exit status: 3
  Parse errors: Bad plan.  You planned 299 tests but ran 30.
Files=2, Tests=593,  1 wallclock secs ( 0.07 usr  0.01 sys +  0.23 cusr  0.04 csys =  0.35 CPU)
Result: FAIL
Failed 1/2 test programs. 3/593 subtests failed.
make: *** [Makefile:826: test_dynamic] Error 3

@stigtsp
Copy link

stigtsp commented Aug 2, 2023

Is the ext=tar.gz parameter useful?

I added it to make it easier for clients/tooling in other package ecosystems (without knowledge about the inner workings of CPAN) can put together a valid download URL with the supplied package URL.

I think there are other concerns as well for generating URLs to artifacts directly from purls, like:

  • Dist file names that have a v prefix or not in front of the version number
  • Mapping AUTHOR/ to A/AU/AUTHOR for the URL
  • Some authors have modules in sub directories
  • Other cases that are indexed but can't be directly derived from the purl

If AUTHOR, DISTRIBUTION, VERSION map to one unique downloadable artifact, then lookups could be done against some index like 02packages or MetaCPAN that knows the correct artifact URL.

Some examples that might need to be considered:

pkg:cpan/ANDK/[email protected] -> https://cpan.org/authors/id/A/AN/ANDK/CPAN-2.36.tar.gz

pkg:cpan/ANDK/[email protected] -> https://cpan.org/authors/id/A/AN/ANDK/CPAN-2.36-TRIAL.tar.gz

pkg:cpan/ILYAZ/[email protected] -> https://cpan.org/authors/id/I/IL/ILYAZ/modules/Term-Gnuplot-0.90380906.zip (note the /modules subdirectory)

pkg:cpan/SANDEEPV/[email protected] -> https://cpan.org/authors/id/S/SA/SANDEEPV/GuiBuilder_v0_3.zip (note v0_3 in the filename)

@stigtsp
Copy link

stigtsp commented Aug 2, 2023

I'd also love support for checksum qualifiers identifying the artifact a purl resolves to:

Like: pkg:cpan/ANDK/[email protected]?checksum=sha256:1d72a5eb40e588e3c10...

https://github.com/package-url/purl-spec/blob/master/PURL-SPECIFICATION.rst#known-qualifiers-keyvalue-pairs

@sjn
Copy link
Author

sjn commented Aug 2, 2023

CPAN::DistnameInfo is (afaik) only a producer of PURLs, and hence is limited to things with the fields it can extract from an module's full path on CPAN (e.g. authors/id/G/GB/GBARR/CPAN-DistnameInfo-0.02.tar.gz).

A discussion around a PURL checksum field probably belongs in the https://github.com/giterlizzi/perl-URI-PackageURL issue tracker?

@sjn
Copy link
Author

sjn commented Aug 2, 2023

If AUTHOR, DISTRIBUTION, VERSION map to one unique downloadable artifact, then lookups could be done against some index like 02packages or MetaCPAN that knows the correct artifact URL.

I don't think fetching things online is an option available for this module. As I see it, the purpose of this module is to pick out all necessary bits of the distro path so these can be used for something else. My adding a purl method is mostly for convenience so that software using this module easily can start using PURLs in their own output.

In this regard, I'm thinking our task is to make sure we can extract everything we can from the provided path (however weird it is), and then provide the necessary fields so we can reproduce it (with the purl method conveniently doing this in a standardized manner).

Some examples that might need to be considered:

pkg:cpan/ANDK/[email protected] -> https://cpan.org/authors/id/A/AN/ANDK/CPAN-2.36.tar.gz

The full URL is not possible to create with the information available to CPAN::DistnameInfo as it is. The best we can do is to try and produce the path (which later can be consumed by eg. cpan(1)) and let this consumer pick a server.

The spec itself needs to support a hostname though, but that's not a conversation for this module.

Other than the server part, I think your CPAN-2.36.tar.gz example above should be supported fine.

pkg:cpan/ILYAZ/[email protected] -> https://cpan.org/authors/id/I/IL/ILYAZ/modules/Term-Gnuplot-0.90380906.zip (note the /modules subdirectory)

This isn't tested for, it seems. I'll add one and see what happens. I don't think the /modules subdir is supported (though I might have missed something - the regex picking this apart is pretty gnarly).

pkg:cpan/SANDEEPV/[email protected] -> https://cpan.org/authors/id/S/SA/SANDEEPV/GuiBuilder_v0_3.zip (note v0_3 in the filename)

This isn't tested in t/path.t, but it looks like it's supported in the code. I'll add a test to check. I think this isn't a path pattern that can be supported without some ugly hacking. 😞

(Strictly speaking, I'd argue it's time for for crazy filenames to be stopped in PAUSE 😠 )

@stigtsp
Copy link

stigtsp commented Aug 4, 2023

(Strictly speaking, I'd argue it's time for for crazy filenames to be stopped in PAUSE angry )

... or, maybe CPAN or MetaCPAN could serve assets directly from URLs that are more friendy to PURLs. This rest on the assumption that the AUTHOR, DISTRO, VERSION triple always resolves to the same artifact of course.

A benefit of this would be that a URL to the asset could be derived directly from the purl.

I'm imagining something like:

pkg:cpan/SANDEEPV/[email protected] -> https://cpan.org/SANDEEPV/GuiBuilder/0.03

Which could return the artifact directly like:

200 OK
Content-Disposition: attachment; filename="GuiBuilder_v0_3.zip"
Content-Type: application/zip

...data...

I'm thinking this could be possible to implement with some map in nginx, for instance.

@haarg
Copy link
Member

haarg commented Aug 4, 2023

For www.cpan.org, that would involve a lot of work. It is currently a fully static site, where supporting URLs like that would involve some kind of index lookup. And doing a lookup is complicated more by the fact that in PAUSE, there isn't any data stored connected to a distribution.

It's something MetaCPAN could provide though.

@sjn
Copy link
Author

sjn commented Aug 11, 2023

The big clue with package urls (as far as I understand) is to make it possible to refer to packages from one ecosystem to another one. How these URLs are resolved is entirely up to that package system's tooling.

e.g. pkg:cpan/SANDEEPV/[email protected] would ideally be passed along to something that is capable of interacting with CPAN correctly, and leave it to this program (e.g. cpan, cpanm, cpanp, cpm) to install or download or verify or whatever.

I guess these are capable of downloading 02packages and do whatever is necessary to figure out which release the pkg URL refers to.

If we keep a strict separation of concerns in mind, then I suggest our task here to be this:

CPAN Distro -> PackageURL

  1. When producing a PackageURL for consumption by a CPAN client, the produced PURL MUST contain all information necessary for a CPAN client to correctly disambiguate and identify a package in any likely scenario, including when interacting with plain filesystem mirrors, App::opan, CPAN::Mini, Pinto, MetaCPAN, legacy CPAN mirror or whatever else the CPAN client support.
    1. CPAN protocol (http:, https:, ftp:, ftps:, file:)
    2. CPAN server hostname or IP address
    3. CPAN ID of the publisher, including App::opan's custom MY
    4. Publisher subpath (e.g. ILYAZ/modules/Term-Gnuplot-0.90380906.zip)
    5. Distribution name, both original and normalized (e.g. both CPAN and CPAN.pm)
    6. Distro release version
    7. Distro release version variation (e.g. both MINGYILIU/Bio-ASN1-EntrezGene-1.10-withoutworldwriteables.tar.gz and MINGYILIU/Bio-ASN1-EntrezGene-1.10.tar.gz)
    8. Distro file suffix (e.g. .tar.gz, .tgz, .zip or whatever)
  2. Package publisher ID, name and version SHOULD be normalized (note: unsure about this?)
  3. The URL MUST offer enough information for the CPAN client to correctly find the package on BackPAN, CPAN, company-internal DarkPAN, and filesystem mirrors that the client supports.
  4. When fields are not specified, we MUST leave it to the CPAN client to disambiguate, and expect it to report clearly the URL it produced based on the package URL provided.

To me, it seems the "hard" bits are what to do with the crazy stuff in 1.4 and 1.7, and I guess this can be resolved by taking a quick look at how the clients do their disambiguation and just add some exceptions to the CPAN::DistnameInfo code. All the problematic distros seem to be in BackPAN, so I guess we can safely assume no more funky filenames are likely to be uploaded to CPAN?

This leaves only the situation where a dev produces crazy filenames on their own DarkPAN mirror... I'm thinking this can be somewhat avoided by adding some sanity checks to whatever tooling manages these?

Also, I'm starting to think some of this is relevant to package-url/purl-spec#155 and maybe giterlizzi/perl-URI-PackageURL#2 ?

PackageURL -> CPAN Distro

Finally, there's the question about verification that a given PackageURL actually identifies the correct CPAN distro. This is especially important when producing SBOM objects where identifying the source of the software is critical to do correctly.

This is probably not a task for CPAN::DistnameInfo though.

@giterlizzi
Copy link

Very interesting this discussion which is across in different areas such as packaging, SBOM security, etc.

For URI::PackageURL in the latest version I added initial support to derive the repository and package download URLs based on the information in the "purl" string.

Example:

purl-tool pkg:cpan/GDT/[email protected]?checksum=sha1:5959ce5ce45c86f7c04539cbbd46e2084487632e --env
PURL="pkg:cpan/GDT/[email protected]?checksum=sha1:5959ce5ce45c86f7c04539cbbd46e2084487632e"
PURL_TYPE="cpan"
PURL_NAMESPACE="GDT"
PURL_NAME="URI-PackageURL"
PURL_VERSION="2.00"
PURL_SUBPATH=""
PURL_QUALIFIERS="checksum"
PURL_QUALIFIER_checksum="sha1:5959ce5ce45c86f7c04539cbbd46e2084487632e"
PURL_DOWNLOAD_URL="http://www.cpan.org/authors/id/G/GD/GDT/URI-PackageURL-2.00.tar.gz"
PURL_REPOSITORY_URL="https://metacpan.org/release/GDT/URI-PackageURL-2.00"

Using the --download-url option you can use it in combination with cURL, wget and also with cpanm:

cpanm $(purl-tool pkg:cpan/GDT/[email protected] --download-url)

Supported "qualifiers" for CPAN for the moment I only added ext (default is tar.gz) but I can work on adding others "known" PURL qualifiers such as repository_url (1.2) the download_url (which will take priority over everything).

@sjn
Copy link
Author

sjn commented Aug 12, 2023

@giterlizzi good you're working on this! 😄

One thought – I'd like to suggest that we (the Perl Toolchain Gang + the CPAN Security WG + yourself, if you're up for it) make an effort to add purl support to as many of the CPAN clients we can.

Since the SBOM thing has come to stay due to the CRA and NIS2 directives coming to EU in the coming year, and PURL are a central component in these, I'm thinking we might as well make the necessary changes to make it a first-class citizen in the CPAN/Perl world. :-)

Would you be up for that? 😁

@sjn
Copy link
Author

sjn commented Aug 12, 2023

purl-tool pkg:cpan/GDT/[email protected]?checksum=sha1:5959ce5ce45c86f7c04539cbbd46e2084487632e --env

Would you mind if you use sha256 checksums by default, btw? sha1 has been considered completely unsafe since 2017.

@demerphq
Copy link
Member

Even better use sha3. Sha256 is yet another MD based narrow pipe hash whose predecessors are all broken. Might as well switch to a wide pipe hash function instead.

@giterlizzi
Copy link

giterlizzi commented Aug 12, 2023

One thought – I'd like to suggest that we (the Perl Toolchain Gang + the CPAN Security WG + yourself, if you're up for it) make an effort to add purl support to as many of the CPAN clients we can.

Totally agree ;)

- Add "fullversion" parameter
- Don't add ext field to the pkgurl unless it's different than tar.gz
- Add support for specifying t/path.t tests as TODO or SKIP
- Use fullversion to produce pkgurl versions
- Skip CGI.pm test, as it's only available on BackPAN
- Skip Bio-ASN1-EntrezGene-1.10-withoutworldwriteables test, as it's
  only available on BackPAN
@stigtsp
Copy link

stigtsp commented Sep 21, 2023

Even better use sha3. [..]

Some reasons I can think of to use SHA2 (at the moment):

  • SHA3 is not included in core afaik, where SHA2 is via Digest::SHA.
  • CHECKSUMS files contain SHA2 checksums
  • MetaCPAN /v1/release and /v1/download_url endpoints contain SHA2 checksums


CPAN/authors/id/M/MI/MINGYILIU/Bio-ASN1-EntrezGene-1.10-withoutworldwriteables.tar.gz
filename Bio-ASN1-EntrezGene-1.10-withoutworldwriteables.tar.gz
dist Bio-ASN1-EntrezGene
maturity released
distvname Bio-ASN1-EntrezGene-1.10-withoutworldwriteables
version 1.10
fullversion 1.10-withoutworldwritables

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in "fullversion" (1.10-withoutworldwritables --> 1.10-withoutworldwriteables)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants