From 4c09ec264a26f6fb8f9ae2acc6aa19ad02015189 Mon Sep 17 00:00:00 2001 From: Donald Stufft Date: Sun, 11 Jun 2023 21:57:14 -0400 Subject: [PATCH 1/4] PEP 716: Normalization of Project Names in Metadata and Filenames --- pep-0716.rst | 314 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 314 insertions(+) create mode 100644 pep-0716.rst diff --git a/pep-0716.rst b/pep-0716.rst new file mode 100644 index 00000000000..52346842000 --- /dev/null +++ b/pep-0716.rst @@ -0,0 +1,314 @@ +PEP: 716 +Title: Normalization of Project Names in Metadata and Filenames +Author: Donald Stufft +PEP-Delegate: Paul Moore +Discussions-To: +Status: Draft +Type: Standards Track +Topic: Packaging +Content-Type: text/x-rst +Created: 11-Jun-2023 +Post-History: + + +Abstract +======== + +This PEP standardizes on where and when project names should and should not be +normalized in the Packaging toolchain. + + +Motivation +========== + +Historically there was effectively little to no requirements on the valid values +of names in the Packaging ecosystem. Projects that wanted to interpret those +names had to cope with a wide range of values, and had to each implement their +own normalization schemes to try and detect names that were the same, but +"spelled" differently. + +Over the intervening years, various PEPs have ratcheted down various pieces of +metadata such as version numbers (:pep:`440`), filenames in bdists (:pep:`427`), +names in the Simple API (:pep:`503`), filenames in sdists (:pep:`625`). + +Unfortunately, a complex interaction between these various standards *and* +changes made to the specifications without an associated PEP, have created a +situation where the ecosystem is in an inconsistent and broken state with +regards to normalization of names. + +The brokenness is currently around ``.``, but the underlying issue actually +affects any unnormalized name that is being emitted. + +The path to getting to where we are today was roughly: + +1. :pep:`427` was accepted with two different requirements on what was a valid + filename. One requirement, specified in prose which if read strictly did not + actually make sense, and another requirement implemented in code that did. + + Rather than normalization, :pep:`427` focused on what was *valid* and + provided a way to escape characters that were not valid in the filename but + were valid in the "source" metadata (Name and Version). + + All tools at this time, implemented themselves using the second of those + requirements, and escaped as expected. +2. :pep:`440` was accepted, which put strict requirements on what was a valid + version and defined a normalization procedure for valid but differently + specified versions. + + This normalization used ``-``, which was an invalid value for :pep:`427` and + required escaping to ``_``, so :pep:`440` was extended to allow that as an + optional spelling of ``_``, which would normalize to ``-``. +3. :pep:`503` was accepted, which specified a normalization of the project name + *when* querying the Simple API for a project. +4. The spec for ``.dist-info`` required normalization ofthe name (using the + :pep:`503` rules) but did not specify a requirement on version. The :pep:`503` + normalization uses ``-``, but tools that locate ``.dist-info`` use the ``-`` + character to split between name and version, so in practice nobody was + following this requirement. + + Thus, the ``.dist-info`` spec was `updated `__, + without a PEP, to make the spec more closely align with common practice. The + result of that being that the spec states that name must be normalizes as + per :pep:`503` and versions must be normalized as per :pep:`440`, but + escaping ``-`` with ``_``. + + This change recognized that there are many existing ``.dist-info`` directories + that are not normalized, and thus instructs tools to expect ``.dist-info`` + directories with unnormalized values, but that all tools must write normalized + values going forward. +5. It was `noted `__ + that :pep:`427` required that the segments of the filename contain only + alphanumeric characters, ``_``, and ``.``, and that all other characters must + be escaped with ``_``. However, :pep:`440` allows the use of ``!`` and ``+``, + which meant that those characters got escaped to ``_``, which could then not + be parsed back into their original versions. +6. As a result of that discussion, the Wheel specs were `updated `__ + and then `updated again `__, + without a PEP, to require that versions were normalized using :pep:`440` and + then ``-`` was escaped with ``_``. + + That change also required that runs of ``-_.`` should be replaced with ``_`` + as well as lowercasing everything. It noted that it was equivalent to :pep:`503` + normalization followed by replacing ``-`` with ``_``. + + It was `noted 9 months later `__, that + there wasn't much discussion on the change to name normalization, but that it + landed anyways. +7. :pep:`621` was accepted, which provided a way to specify project metadata in + ``pyproject.toml``. This PEP was careful to make the distinction between + static metadata, where tools could trust the values in ``pyproject.toml`` and + dynamic metadata where they could not. + + However, this PEP doesn't clarify whether static values must be identical + values or equivalent values. To make matters worse, it includes the statement + that tools should normalize the name, using :pep:`503` rules, as soon as it + is read for internal consistency. +8. :pep:`625` was accepted, standardizing on a format for filenames for sdists. + This PEP requires that the project names are normalized "as described in the + wheel spec", which at the time meant full :pep:`503` normalization, and + versions normalized as per :pep:`440`. + + +Independently to all of the above, and prior to (4), PyPI had implemented a +check that ensured that the filename being uploaded matched the current project +name. This check did not correctly take into account normalization, but did take +into account filename escaping. It also implements renames by allowing projects +to rename themselves by changing their project name in their metadata. + +The effect of all of the above, is that we're now in a situation where: + +* Some tools will normalize the filename before writing them, either to the + filesystem or to PyPI. +* Some tools will normalize the project name before emitting them to either + ``METADATA`` or to PyPI. +* Some tools (PyPI) require that the filename and the project name match, without + taking normalization into account. +* Some tools (Artifactory) require that the filenames are not normalized. +* The above sets of tools do not perfectly overlap in any direction. + +We've essentially created a mess where nobody is emitting filenames in quite the +same way and the normalization rules, first defined in :pep:`503` are being used +in contexts where it is not appropiate to do so. + + +Rationale +========= + +This PEP follows two guiding principals: + +1. Names are provided by people and should be used as is where possible. The + name of the project, and how it appears, is a fundamental property of the + project. +2. When interpreting names, tooling should normalize values as much as + possible to reduce confusion. + +The follows the original intent behind the normalization in :pep:`503`, which +was designed to be a normalization applied when two computers spoke to each +other, not as something that would "leak" out into the human facing areas. + + +Specification +============= + +The project name that is specified by an author ends up flowing through several +parts of the ecosystem, and each part needs to be considered on it's own what +kind of name (normalized or not) makes sense in that part. + +In general, we follow the guiding principals, use the unnormalized name as +provided by the author wherever possible, and normalize strictly where not. + +In some cases, we are simply repeating the status quo, this is done to provide +clarification and to be explicit which uses were considered as part of this +PEP. + + +Core Metadata +------------- + +The ``Name`` field **MUST NOT** be normalized when emitting into ``METADATA`` +or ``PKG-INFO``. + +The ``Name`` field **MUST NOT** be normalized when uploading to a repository. + +The ``Name`` field **SHOULD NOT** be normalized when being presented for display +to a user. + +The ``Name`` field **MUST** be normalized during comparison. + +Tools that read the ``Name`` field from a core metadata file **MUST** be prepared +to accept unnormalized names. + + +pyproject.toml +-------------- + +The ``project.name`` key **MUST** be preserved exactly as the author chose to +represent it, and **MUST** be emitted in this way into ``METADATA`` or +``PKG-INFO``. + +The ``project.name`` field **MUST** be normalized during comparison. + + +.dist-info directories +---------------------- + +The directory name follows the pattern of ``{name}-{version}.dist-info``. + +The ``name`` field **MUST** be normalized, with any resulting ``-`` escaped to ``_``. + +Tools that read an arbitrary ``.dist-info`` directory **MUST** be prepared to +accept unnormalized values, however tools that work only on *new* ``.dist-info`` +directories **SHOULD** validate that all values are normalized. + + +Source and Binary Distributions +------------------------------- + +Both the sdist and bdist specifications incorporate the project name in their +filenames (``{name}-{version}.tar.gz`` and +``{distribution}-{version}(-{build tag})?-{python tag}-{abi tag}-{platform tag}.whl`` +respectively). + +The ``name`` field **MUST** be non-normalized, with the exception that any ``-`` +**MUST** be escaped to be ``_``. + +Tools that accept an arbitrary distribution **MUST** be prepared to accept both +non-normalized and normalized filenames. However, tools that only work on *new* +distributions **SHOULD** validate that the distribution filenames are not +normalizing ``name``. + + +Simple Repository API +--------------------- + +The project name, when returned in the "index" URL (e.g. ``/simple/``) +**MUST** be non-normalized. + +The project name when used in the URL (e.g. ``/simple/$project/``) **MUST** be +normalized. + +The project name, when used on the Project detail page +(e.g. ``/simple/$project/``), **MUST** be non-normalized. + +Tools that read values for filenames and names from the Simple Repository API +**MUST** be prepared to handle both normalized and non-normalized names. + + +Backwards Compatibility +======================= + +This PEP breaks compatibility in a few ways: + +* Tools that are currently emiting filenames where ``name`` has been normalized + in accordance with the current spec are immediately no longer compliant and + must be updated to emit non-normalized names. + * This is mitigated by the fact that all tools are required to continue to + accept both normalized and non-normalized filenames unless they *know* that + they only work on *new* distributions (PyPI uploads, ``pyproject-build``, etc). +* Tools that emit normalized names into ``METADATA``, ``PKG-INFO``, or when + uploading to a repository are immediately no longer compliant and must be + updated to emit non-normalized names. + * It's unclear in the current spec whether names were intended to be normalized + in this case or not, but the practice of normalization here has caused a + number of people to be confused why their names are different from what + they've entered. +* Tools that are currently emiting the names in the simple API (outside of the URL + itself) as normalized, which is either allowed or required by the spec + currently are immediately not longer complaint and must be updated to emit + non-normalized names. + * Like for filenames, this is mitigated by the fact that all tools are required + to continue to accept both normalized and non-normalized values. + + +Tools that validate *new* values should ideally start warning on now invalid +options for some period of time, before starting to hard fail when encountering +them. + + +Rejected Ideas +============== + +Require Normalization Everywhere +-------------------------------- + +One other possible idea is to simply require normalization everywhere, however +this PEP rejects that. + +The primary reason we reject it is that the name of a project is not an internal +identifier, but is central to that project's identity. Projects often have +strong opinions on the way that their project's name should look, and +normalization removes that from them. + +There are situations where we need a normalized value, so this PEP does use +them, but attempts to use them sparingly, only when they're actually required. +It treats normalization as something that is done when software is talking to +software about a project, and not when humans are talking about it. + + +Require Normalization in Filenames +---------------------------------- + +Filenames sit in a weird place, in most cases they are produced for by software +and are consumed by software, so in theory it should be fine to normalize them +which has some nice properties. + +However, this PEP rejects doing that. + +Although they are often a software to software identifier, they are also used by +humans when sharing and manually downloading the software. They appear in places +like the PyPI UI, GitHub Releases, downstream Linux repositories, etc. In some +cases the only incanation of the project's name someone might see is the name +embedded into the filename. + +Further, historically filenames were not normalized, and a change to the spec +that did not go through the PEP process is what required it. However, prior to +that change, people have created systems that rely on encoding information into +the project name, such as namespaces using the ``.`` character, which a +requirement to normalize would break. + + +Copyright +========= + +This document is placed in the public domain or under the +CC0-1.0-Universal license, whichever is more permissive. From e83314c7f7dc596e0b3552078793c9bf0462793e Mon Sep 17 00:00:00 2001 From: Donald Stufft Date: Sun, 11 Jun 2023 21:58:43 -0400 Subject: [PATCH 2/4] codeowners --- .github/CODEOWNERS | 1 + 1 file changed, 1 insertion(+) diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS index 300a1da34b4..797861bc4a7 100644 --- a/.github/CODEOWNERS +++ b/.github/CODEOWNERS @@ -596,6 +596,7 @@ pep-0712.rst @ericvsmith pep-0713.rst @ambv pep-0714.rst @dstufft pep-0715.rst @dstufft +pep-0716.rst @dstufft # ... # pep-0754.txt # ... From adc8e178ab82eecb6e014598d90c22d3a2d983da Mon Sep 17 00:00:00 2001 From: Donald Stufft Date: Sun, 11 Jun 2023 22:04:26 -0400 Subject: [PATCH 3/4] fix --- pep-0716.rst | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/pep-0716.rst b/pep-0716.rst index 52346842000..aca4a310087 100644 --- a/pep-0716.rst +++ b/pep-0716.rst @@ -105,8 +105,8 @@ The path to getting to where we are today was roughly: is read for internal consistency. 8. :pep:`625` was accepted, standardizing on a format for filenames for sdists. This PEP requires that the project names are normalized "as described in the - wheel spec", which at the time meant full :pep:`503` normalization, and - versions normalized as per :pep:`440`. + wheel spec", which at the time meant full :pep:`503` normalization, and + versions normalized as per :pep:`440`. Independently to all of the above, and prior to (4), PyPI had implemented a @@ -242,20 +242,25 @@ This PEP breaks compatibility in a few ways: * Tools that are currently emiting filenames where ``name`` has been normalized in accordance with the current spec are immediately no longer compliant and must be updated to emit non-normalized names. + * This is mitigated by the fact that all tools are required to continue to accept both normalized and non-normalized filenames unless they *know* that they only work on *new* distributions (PyPI uploads, ``pyproject-build``, etc). + * Tools that emit normalized names into ``METADATA``, ``PKG-INFO``, or when uploading to a repository are immediately no longer compliant and must be updated to emit non-normalized names. + * It's unclear in the current spec whether names were intended to be normalized in this case or not, but the practice of normalization here has caused a number of people to be confused why their names are different from what they've entered. + * Tools that are currently emiting the names in the simple API (outside of the URL itself) as normalized, which is either allowed or required by the spec currently are immediately not longer complaint and must be updated to emit non-normalized names. + * Like for filenames, this is mitigated by the fact that all tools are required to continue to accept both normalized and non-normalized values. From f55d7dd894b41a5efc904792271e7264c2887b6a Mon Sep 17 00:00:00 2001 From: Donald Stufft Date: Mon, 12 Jun 2023 13:30:12 -0400 Subject: [PATCH 4/4] Apply suggestions from code review Co-authored-by: Hugo van Kemenade --- pep-0716.rst | 28 ++++++++++++++-------------- 1 file changed, 14 insertions(+), 14 deletions(-) diff --git a/pep-0716.rst b/pep-0716.rst index aca4a310087..e21b8d21cbd 100644 --- a/pep-0716.rst +++ b/pep-0716.rst @@ -15,21 +15,21 @@ Abstract ======== This PEP standardizes on where and when project names should and should not be -normalized in the Packaging toolchain. +normalized in the packaging toolchain. Motivation ========== Historically there was effectively little to no requirements on the valid values -of names in the Packaging ecosystem. Projects that wanted to interpret those +of names in the packaging ecosystem. Projects that wanted to interpret those names had to cope with a wide range of values, and had to each implement their own normalization schemes to try and detect names that were the same, but "spelled" differently. Over the intervening years, various PEPs have ratcheted down various pieces of metadata such as version numbers (:pep:`440`), filenames in bdists (:pep:`427`), -names in the Simple API (:pep:`503`), filenames in sdists (:pep:`625`). +names in the Simple API (:pep:`503`), and filenames in sdists (:pep:`625`). Unfortunately, a complex interaction between these various standards *and* changes made to the specifications without an associated PEP, have created a @@ -60,7 +60,7 @@ The path to getting to where we are today was roughly: optional spelling of ``_``, which would normalize to ``-``. 3. :pep:`503` was accepted, which specified a normalization of the project name *when* querying the Simple API for a project. -4. The spec for ``.dist-info`` required normalization ofthe name (using the +4. The spec for ``.dist-info`` required normalization of the name (using the :pep:`503` rules) but did not specify a requirement on version. The :pep:`503` normalization uses ``-``, but tools that locate ``.dist-info`` use the ``-`` character to split between name and version, so in practice nobody was @@ -68,7 +68,7 @@ The path to getting to where we are today was roughly: Thus, the ``.dist-info`` spec was `updated `__, without a PEP, to make the spec more closely align with common practice. The - result of that being that the spec states that name must be normalizes as + result of that being that the spec states that name must be normalized as per :pep:`503` and versions must be normalized as per :pep:`440`, but escaping ``-`` with ``_``. @@ -128,7 +128,7 @@ The effect of all of the above, is that we're now in a situation where: We've essentially created a mess where nobody is emitting filenames in quite the same way and the normalization rules, first defined in :pep:`503` are being used -in contexts where it is not appropiate to do so. +in contexts where it is not appropriate to do so. Rationale @@ -142,16 +142,16 @@ This PEP follows two guiding principals: 2. When interpreting names, tooling should normalize values as much as possible to reduce confusion. -The follows the original intent behind the normalization in :pep:`503`, which +This follows the original intent behind the normalization in :pep:`503`, which was designed to be a normalization applied when two computers spoke to each -other, not as something that would "leak" out into the human facing areas. +other, not as something that would "leak" out into the human-facing areas. Specification ============= The project name that is specified by an author ends up flowing through several -parts of the ecosystem, and each part needs to be considered on it's own what +parts of the ecosystem, and each part needs to be considered on its own to determine what kind of name (normalized or not) makes sense in that part. In general, we follow the guiding principals, use the unnormalized name as @@ -239,7 +239,7 @@ Backwards Compatibility This PEP breaks compatibility in a few ways: -* Tools that are currently emiting filenames where ``name`` has been normalized +* Tools that are currently emitting filenames where ``name`` has been normalized in accordance with the current spec are immediately no longer compliant and must be updated to emit non-normalized names. @@ -256,7 +256,7 @@ This PEP breaks compatibility in a few ways: number of people to be confused why their names are different from what they've entered. -* Tools that are currently emiting the names in the simple API (outside of the URL +* Tools that are currently emitting the names in the simple API (outside of the URL itself) as normalized, which is either allowed or required by the spec currently are immediately not longer complaint and must be updated to emit non-normalized names. @@ -265,7 +265,7 @@ This PEP breaks compatibility in a few ways: to continue to accept both normalized and non-normalized values. -Tools that validate *new* values should ideally start warning on now invalid +Tools that validate *new* values should ideally start warning on now-invalid options for some period of time, before starting to hard fail when encountering them. @@ -293,13 +293,13 @@ software about a project, and not when humans are talking about it. Require Normalization in Filenames ---------------------------------- -Filenames sit in a weird place, in most cases they are produced for by software +Filenames sit in a weird place, in most cases they are produced by software and are consumed by software, so in theory it should be fine to normalize them which has some nice properties. However, this PEP rejects doing that. -Although they are often a software to software identifier, they are also used by +Although they are often a software-to-software identifier, they are also used by humans when sharing and manually downloading the software. They appear in places like the PyPI UI, GitHub Releases, downstream Linux repositories, etc. In some cases the only incanation of the project's name someone might see is the name