Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Determine how different sorts of file names should be normalized #28

Open
nightlark opened this issue Dec 22, 2024 · 2 comments
Open

Determine how different sorts of file names should be normalized #28

nightlark opened this issue Dec 22, 2024 · 2 comments
Assignees

Comments

@nightlark
Copy link
Collaborator

Different file names need to be normalized to have good odds of finding a match in our datasets. Generally, these are centered around a few things:

  • Removing identifiers for a specific version from files names
  • Removing platform/architecture specific information from file/folder names

What needs to get done may vary based on the type of file:

  • Linux shared libraries (file name needs normalizing to remove architecture and version identifiers)
  • C/C++ headers, pkgconfig/CMake Config files (need to recognize common include file paths, including ones that specify a multiarch triplet or GNU triplet)
  • Linux binaries (need to recognize common bin folder paths, including ones that specif a multiarch triplet and maybe GNU triplet)
  • Language/Ecosystem specific - different languages or ecosystems like Python, Windows, NuGet, and macOS could require handling the above in ways that are different than Linux
@nightlark nightlark self-assigned this Dec 22, 2024
@nightlark
Copy link
Collaborator Author

nightlark commented Dec 30, 2024

For other Linux distribution file names to look at, here are counts of how many packages each one has (snapshot of most of the counts from https://pkgs.org/ on Dec 30, 2024):
  • Adélie
    • System: 729 (x86_64), 728 (aarch64)
    • User: 5617 (x86_64), 5598 (aarch64)
  • AlmaLinux 9
    • AppStream: 6526 (x86_64), 5188 (aarch64)
    • BaseOS: 1480 (x86_64), 1201 (aarch64)
    • CRB: 2102 (x86_64), 1463 (aarch64)
    • Devel: 1974 (x86_64), 2135 (aarch64)
    • Extras: 44 (x86_64), 44 (aarch64)
    • HighAvailability: 93 (x86_64), 79 (aarch64)
    • NFV: 39 (x86_64), - (aarch64)
    • Plus: 0 (x86_64), 0 (aarch64)
    • RT: 29 (x86_64), - (aarch64)
    • ResilientStorage: 96 (x86_64), 82 (aarch64)
    • SAP: 11 (x86_64), 8 (aarch64)
    • SAPHANA: 10 (x86_64), 7 (aarch64)
  • AlmaLinux 8
    • AppStream: 9700 (x86_64), 7691 (aarch64)
    • BaseOS: 3098 (x86_64), 2482 (aarch64)
    • Devel: 2900 (x86_64), 3227 (aarch64)
    • Extras: 9 (x86_64), 6 (aarch64)
    • HighAvailability: 123 (x86_64), 94 (aarch64)
    • NFV: 70 (x86_64), - (aarch64)
    • Plus: 0 (x86_64), 0 (aarch64)
    • PowerTools: 2865 (x86_64), 1870 (aarch64)
    • RT: 51 (x86_64), - (aarch64)
    • ResilientStorage: 125 (x86_64), 99 (aarch64)
    • SAP: 12 (x86_64), 8 (aarch64)
    • SAPHANA: 10 (x86_64), 6 (aarch64)
  • Alpine 3.20
    • Community: 18478 (x86_64), 18339 (aarch64)
    • Main: 5438 (x86_64), 5455 (aarch64)
  • Alpine 3.19
    • Community: 17436 (x86_64), 17281 (aarch64)
    • Main: 5359 (x86_64), 5376 (aarch64)
  • Alpine 3.18
    • Community: 14791 (x86_64), 14654 (aarch64)
    • Main: 5051 (x86_64), 5063 (aarch64)
  • Alpine Edge
    • Community: 19806 (x86_64), 19666 (aarch64)
    • Main: 5464 (x86_64), 5474 (aarch64)
  • ALT Linux P11
    • Autoimports: 0 (x86_64), 0 (noarch)
    • Classic: 20559 (x86_64), 20056 (aarch64), 20460 (noarch)
  • ALT Linux P10
    • Autoimports: 3493 (x86_64), 36615 (noarch)
    • Classic: 19768 (x86_64), 18894 (aarch64), 20359 (noarch)
  • ALT Linux Sisyphus
    • Autoimports: 3468 (x86_64), 38296 (noarch)
    • Classic: 21130 (x86_64), 20607 (aarch64), 20697 (noarch)
  • Amazon Linux 2023
    • 14588 (x86_64)
  • Amazon Linux 2
    • 27702 (x86_64)
  • Arch Linux
    • Core: 263 (x86_64), 280 (aarch64)
    • Core Testing: 1 (x86_64), - (aarch64)
    • Extra: 14056 (x86_64), 12030 (aarch64)
    • Extra Testing: 147 (x86_64), - (aarch64)
    • Multilib: 285 (x86_64), - (aarch64)
    • Multilib Testing: 1 (x86_64), - (aarch64)
  • CentOS 9 Stream
    • AppStream: 17433 (x86_64), 13681 (aarch64)
    • BaseOS: 4525 (x86_64), 3598 (aarch64)
    • CRB: 4826 (x86_64), 3271 (aarch64)
    • HighAvailability: 443 (x86_64), 371 (aarch64)
    • NFV: 81 (x86_64), - (aarch64)
    • RT: 61 (x86_64), - (aarch64)
    • ResilientStorage: 457 (x86_64), - (aarch64)
    • EPEL Next: 147 (x86_64), 147 (aarch64)
    • EPEL Next Testing: 15 (x86_64), 15 (aarch64)
  • Enterprise Linux 9 (RHEL 9, Rocky Linux 9, Alma Linux 9, CentOS 9 Stream)
    • EPEL: 22736 (x86_64), 22614 (aarch64)
    • EPEL Testing: 345 (x86_64), 339 (aarch64)
  • Enterprise Linux 8 (RHEL 8, Rocky Linux 8, Alma Linux 8)
    • EPEL: 10304 (x86_64), 10135 (aarch64)
    • EPEL Modular: 290 (x86_64), 290 (aarch64)
    • EPEL Testing: 62 (x86_64), 62 (aarch64)
  • Debian 12 (Bookworm)
    • Contrib: 303 (x86_64), 256 (arm64)
    • Main: 63192 (x86_64), 62442 (arm64)
    • Nonfree: 718 (x86_64), 540 (arm64)
    • Nonfree Firmware: 38 (x86_64), 36 (arm64)
  • Debian 11 (Bullseye)
    • Contrib: 293 (x86_64), 236 (arm64)
    • Main: 58261 (x86_64), 57426 (arm64)
    • Nonfree: 699 (x86_64), 512 (arm64)
  • Debian Sid
    • Contrib: 351 (x86_64), 293 (arm64)
    • Main: 71388 (x86_64), 70590 (arm64)
    • Nonfree: 938 (x86_64), 636 (arm64)
    • Nonfree Firmware: 44 (x86_64), 42 (arm64)
  • Fedora 41
    • 76560 (x86_64), 66978 (aarch64)
    • Updates: 13685 (x86_64), 11737 (aarch64)
    • Updates Testing: 12499 (x86_64), 10777 (aarch64)
  • Fedora 40
    • 74816 (x86_64), 65446 (aarch64)
    • Updates: 28141 (x86_64), 24658 (aarch64)
    • Updates Testing: 5304 (x86_64), 4689 (aarch64)
  • Fedora Rawhide
    • 77025 (x86_64), 67479 (aarch64)
  • FreeBSD 14
    • 35895 (amd64), 34777 (aarch64)
  • FreeBSD 13
    • 35874 (amd64), 34834 (aarch64)
  • KaOS
    • Apps: 959 (x86_64)
    • Build: 25 (x86_64)
    • Core: 249 (x86_64)
    • KDE Next: 57 (x86_64)
    • Main: 966 (x86_64)
  • Mageia 9
    • Core: 30364 (x86_64), 30129 (aarch64)
    • Core Backports: 447 (x86_64), 544 (aarch64)
    • Core Backports Testing: 164 (x86_64), 68 (aarch64)
    • Core Updates: 11171 (x86_64), 10808 (aarch64)
    • Core Updates Testing: 294 (x86_64), 249 (aarch64)
    • Nonfree: 108 (x86_64), 72 (aarch64)
    • Nonfree Backports: 0 (x86_64), 0 (aarch64)
    • Nonfree Backports Testing: 0 (x86_64), 0 (aarch64)
    • Nonfree Updates: 164 (x86_64), 23 (aarch64)
    • Nonfree Updates Testing: 33 (x86_64), 2 (aarch64)
    • Tainted: 288 (x86_64), 278 (aarch64)
    • Tainted Backports: 0 (x86_64), 0 (aarch64)
    • Tainted Backports Testing: 0 (x86_64), 0 (aarch64)
    • Tainted Updates: 800 (x86_64), 756 (aarch64)
    • Tainted Updates Testing: 0 (x86_64), 0 (aarch64)
  • Mageia Cauldron
    • Core: 36209 (x86_64), 35967 (aarch64)
    • Core Updates Testing: 754 (x86_64), 753 (aarch64)
    • Nonfree: 124 (x86_64), 73 (aarch64)
    • Nonfree Updates Testing: 11 (x86_64), 1 (aarch64)
    • Tainted: 320 (x86_64), 309 (aarch64)
    • Tainted Updates Testing: 2 (x86_64), 0 (aarch64)
  • Mint 22 (Wilma)
    • Backport: 81 (amd64)
    • Import: 16 (amd64)
    • Main: 100 (amd64)
    • Upstream: 322 (amd64)
  • Mint 21.3 (Virginia)
    • Backport: 179 (amd64)
    • Import: 25 (amd64)
    • Main: 96 (amd64)
    • Upstream: 235 (amd64)
  • Mint 21.2 (Victoria)
    • Backport: 179 (amd64)
    • Import: 23 (amd64)
    • Main: 95 (amd64)
    • Upstream: 233 (amd64)
  • NetBSD 10.0
    • 24986 (amd64), 24381 (aarch64)
  • NetBSD 9.4
    • 25006 (amd64), 19935 (aarch64)
  • OpenMandriva Lx 5.0
  • OpenMandriva Rolling
  • OpenMandriva Cooker
  • openSUSE Leap 15.6
    • Nvidia drivers: 251 (x86_64)
    • Non-Oss: 61 (x86_64)
    • Oss: 65612 (x86_64)
    • Updates Non-Oss: 18 (x86_64)
    • Update Oss: 88 (x86_64)
    • Update Test: 20 (x86_64)
  • openSUSE Leap 15.5
    • Nvidia drivers: 300 (x86_64)
    • Non-Oss: 54 (x86_64)
    • Oss: 63106 (x86_64)
    • Updates Non-Oss: 33 (x86_64)
    • Update Oss: 702 (x86_64)
    • Update Test: 24 (x86_64)
  • openSUSE Tumbleweed
    • Non-Oss: 41 (x86_64), 70 (aarch64)
    • Oss: 57952 (x86_64), 56386 (aarch64)
  • OpenWrt 23.05
    • Base: 676 (x86_64), 671 (aarch64)
    • Luci: 2889 (x86_64), 2889 (aarch64)
    • Packages: 4435 (x86_64), 4410 (aarch64)
    • Routing: 88 (x86_64), 88 (aarch64)
    • Telephony: 887 (x86_64), 874 (aarch64)
  • Oracle Linux 9
    • Addons: 735 (x86_64), 683 (aarch64)
    • AppStream: 22285 (x86_64), 17586 (aarch64)
    • BaseOS Latest: 9134 (x86_64), 7867 (aarch64)
    • CodeReady Builder: 5571 (x86_64), 3806 (aarch64)
    • Distro Builder: 985 (x86_64), 1004 (aarch64)
    • KVM Utilities: 243 (x86_64), 243 (aarch64)
    • RDMA: 59 (x86_64), - (aarch64)
    • RHCK: 501 (x86_64), - (aarch64)
    • UEK Release 7: 520 (x86_64), 21 (aarch64)
  • Oracle Linux 8
    • Addons: 575 (x86_64), 331 (aarch64)
    • AppStream: 42776 (x86_64), 33949 (aarch64)
    • BaseOS Latest: 20737 (x86_64), 16681 (aarch64)
    • CodeReady Builder: 8468 (x86_64), 4902 (aarch64)
    • Distro Builder: 1426 (x86_64), 1331 (aarch64)
    • KVM AppStream: 3996 (x86_64), 3619 (aarch64)
    • Leapp: 8 (x86_64), 8 (aarch64)
    • RHCK: 150 (x86_64), - (aarch64)
    • UEK Release 7: 528 (x86_64), 450 (aarch64)
    • UEK Release 7 RDMA: 59 (x86_64), - (aarch64)
  • PCLinuxOS
    • KDE5: 924 (x86_64)
    • Retro: 21 (x86_64)
    • Test: 0 (x86_64)
    • Xfce4: 103 (x86_64)
    • x86_64: 16208 (x86_64)
  • Rocky Linux 9
    • AppStream: 5926 (x86_64), 4671 (aarch64)
    • BaseOS: 1162 (x86_64), 893 (aarch64)
    • CRB: 1998 (x86_64), 1399 (aarch64)
    • Devel: 9160 (x86_64), 9035 (aarch64)
    • Extras: 52 (x86_64), 54 (aarch64)
    • High Availability: 91 (x86_64), 77 (aarch64)
    • NFV: 14 (x86_64), 0 (aarch64)
    • Plus: 1 (x86_64), 1 (aarch64)
    • Realtime: 10 (x86_64), - (aarch64)
    • Resilient Storage: 94 (x86_64), 0 (aarch64)
    • SAP: 10 (x86_64), 0 (aarch64)
  • Rocky Linux 8
    • AppStream: 9851 (x86_64), 7767 (aarch64)
    • BaseOS: 2847 (x86_64), 2246 (aarch64)
    • Devel: 15538 (x86_64), 12066 (aarch64)
    • Extras: 57 (x86_64), 636 (aarch64)
    • HighAvailability: 123 (x86_64), 91 (aarch64)
    • NFV: 70 (x86_64), 14 (aarch64)
    • Plus: 0 (x86_64), 3 (aarch64)
    • PowerTools: 2899 (x86_64), 1925 (aarch64)
    • Realtime: 60 (x86_64), - (aarch64)
    • Resilient Storage: 125 (x86_64), 93 (aarch64)
  • Slackware 15.0
    • 1590 (x86_64)
    • Extra: 111 (x86_64)
    • Patches: 162 (x86_64)
    • Testing: 0 (x86_64)
  • Slackware Current
    • 1695 (x86_64)
    • Extra: 104 (x86_64)
    • Patches: 0 (x86_64)
    • Testing: 2 (x86_64)
  • Solus
    • Shannon: 7762 (x86_64)
    • Unstable: 7772 (x86_64)
  • Ubuntu 24.10 (Oracular Oriole)
    • Kubuntu Backports: 2 (amd64), 2 (arm64)
    • Main: 6163 (amd64), 6095 (arm64)
    • Multiverse: 1044 (amd64), 838 (arm64)
    • Proposed Main: 180 (amd64), 176 (arm64)
    • Proposed Multiverse: 0 (amd64), 0 (arm64)
    • Proposed Universe: 146 (amd64), 153 (arm64)
    • Restricted: 269 (amd64), 369 (arm64)
    • Universe: 64671 (amd64), 63875 (arm64)
    • Updates Main: 819 (amd64), 826 (arm64)
    • Updates Multiverse: 36 (amd64), 36 (arm64)
    • Updates Restricted: 409 (amd64), 620 (arm64)
    • Updates Universe: 471 (amd64), 448 (arm64)
  • Ubuntu 24.04 LTS (Noble Numbat)
    • Kubuntu Backports: 0 (amd64), 0 (arm64)
    • Main: 6050 (amd64), 5937 (arm64)
    • Multiverse: 1153 (amd64), 942 (arm64)
    • Proposed Main: 618 (amd64), 563 (arm64)
    • Proposed Multiverse: 1 (amd64), 1 (arm64)
    • Proposed Universe: 2415 (amd64), 2164 (arm64)
    • Restricted: 492 (amd64), 485 (arm64)
    • Universe: 64342 (amd64), 63273 (arm64)
    • Updates Main: 3696 (amd64), 3669 (arm64)
    • Updates Multiverse: 62 (amd64), 46 (arm64)
    • Updates Restricted: 3147 (amd64), 3400 (arm64)
    • Updates Universe: 4314 (amd64), 4195 (arm64)
  • Void Linux
    • Main: 14230 (x86_64), 13615 (aarch64)
    • Multilib: 5942 (x86_64), - (aarch64)
    • Multilib Nonfree: 20 (x86_64), - (aarch64)
    • Nonfree: 59 (x86_64), 27 (aarch64)
  • Wolfi
    • Base: 114622 (x86_64), 113982 (aarch64)

Interesting Linux distros to also examine packages for could be Wolfi (has most packages), Debian vs Ubuntu, Alpine, Arch, Mageia, openSUSE, OpenWrt, Fedora, and Oracle vs RHEL.

@nightlark
Copy link
Collaborator Author

As mentioned in #5 (comment), there could be some cases where recognizing the package that created a particular folder name could be used to identify the package. Would need to look into this to make sure it doesn't introduce false positives and is fairly accurate (also a lot of packages with a "plugins" subfolder).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant