Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Folder cases #37

Open
fredrikekre opened this issue Jun 17, 2020 · 15 comments · May be fixed by #38
Open

Folder cases #37

fredrikekre opened this issue Jun 17, 2020 · 15 comments · May be fixed by #38

Comments

@fredrikekre
Copy link
Member

If there is a package Aa in the registry, and you register a package AA, then the folders needs to be e.g. AA2 and not `AA, see JuliaRegistries/General#16341.

@pfitzseb
Copy link
Contributor

Are we handling the case for packages with the exact same name?

In general it might make sense to change the whole directory structure from /A/ACME/... to either /A/59NoK/... or A/acme/59NoK/... (with a canonicalized package name).

@GunnarFarneback
Copy link
Collaborator

The folder name 'AA2' is not that great if someone later wants to register a package called 'AA2'. If we want to disambiguate, the obvious way would be to involve the uuid.

For this particular case I think it's more a question about registry policy. Registration of "SAMtools" when "SAMTools" already exists is an extreme case of typo-squatting and should just not be allowed. Assuming that it was intended to be a name change, that should have been done with a custom PR that changed folder names rather than a new registration, if we want to allow name changes in the first place.

@fredrikekre
Copy link
Member Author

It was actually two different packages with different authors, but yea, we need a policy for that. But RegistryTools still need to make this work regardless.

@GunnarFarneback
Copy link
Collaborator

There is also the option to flat out refuse case insensitivity collisions, which is probably the quickest place to start.

@fredrikekre
Copy link
Member Author

Right, but that seems more of a registry-specific policy. This issue is about the fact that the tooling should do something better.

@StefanKarpinski
Copy link
Contributor

StefanKarpinski commented Jun 17, 2020

Is the concern that this can't be faithfully unpacked on a case-insensitive filesystem? If so, this is just a generalization of the problem of name collisions where two packages with the same name aside from case are a collision on some systems. We may as well figure out how to support exact name collisions while we're at it. My first guess would be to use some or all of the UUID as a slug value and do A/APackage/$slug/ as the path but only if there's a collision. Possible slugs:

  • the whole UUID
  • the first n bytes of the UUID
  • enough of the UUID to be unique
  • the slug computation we use for package versions
  • a number based on UUID ordering so the first UUID is 1, second is 2, etc.

I kind of like the last one.

@StefanKarpinski
Copy link
Contributor

StefanKarpinski commented Jun 17, 2020

Eh, sorry. My original version of this had the slug first then the name, but then I realized that a lot of the slug choices don't work for that since the slug directory is top-level, so I changed the ordering, which works for exact name collisions but not for inexact case collisions—which version of the name to you choose? For case collisions, you want the slug first and the name after so that each one can have it's own folder. Alternately: don't use a / between the two and do A/APackage.$slug/ instead since then each can have its own capitalization.

More musing:

  • Re "number based on the ordering of UUIDs" option, we probably don't want the same directory to potentially change which package is held in it. It would be much better if adding a new package can move an existing package but not lead to a new package being in the same place.

  • The whole UUID option I think takes up too much path length—older Windows versions really do not like long paths; newer ones are fine with it, but there are still tools that get upset. But maybe this is ok, after all we control the tooling that will access these files and we'd like to switch to not unpacking the registry at all, especially on Windows for just these kinds of reasons.

  • A fixed number of UUID bytes still has the potential for collisions.

  • The slug computation has the same issue although collisions are quite unlikely—but unfortunately much less so on a case insensitive file system! In that case there are 36^5 unique slug values and the probability of collision on the order of one in sqrt(36^5) = 7776 which is not nearly safe enough for my taste. It's actually worse than that because we're choosing between case-distinguishable slugs uniformly, but case-indistinguishable ones non-uniformly, so it may be about 1/5000 chance of collision on a case-insensitive file system.

That only leaves two viable options:

  • All of the UUID (despite concerns about path length)

  • Enough of the UUID to be unique. I would suggest using 8 digits as a minimum (1/65k chance of collision) but adding more of the UUID if necessary to avoid a collision.

If we went with the second option, the total path would be A/APackage.$(string(uuid)[1:n])/ where n is chosen to be enough of the UUID to be unique and at least 8.

@pfitzseb
Copy link
Contributor

pfitzseb commented Jun 17, 2020

Well, /A/apackage.$(N) with

  • N being auto-incremented when a collision is detected
  • the package name being in lowercase to prevent case collisions

would of course also work and be stable (as already kinda suggested by Fredrik above).

That said, I think I'd prefer /A/$(uuid), despite possible concerns about the path length.

@StefanKarpinski
Copy link
Contributor

StefanKarpinski commented Jun 17, 2020

I think the name of the folder should match the name of the package and not be made lowercase (we don't lowercase everything in the case of non-collisions). Auto-incrementing when a collision is detected makes this stateful: it matters what order packages were added. That seems like it's not entirely ideal. If we do /A/$uuid/APackage then we lose a bit of easy navigation by package name. The /A/APackage.$uuid_prefix/ scheme is:

  • still easy to navigate by name
  • each package name will appear correctly since even case-insensitive filesystems are case-preserving

@fredrikekre
Copy link
Member Author

Is the concern that this can't be faithfully unpacked on a case-insensitive filesystem?

Yes, some users got errors this morning which triggered this issue: https://discourse.julialang.org/t/persistently-dirty-registry-case-insensitive-file-collision-when-cloning-juliaregistries-general-from-github/41577/7, JuliaRegistries/General#16501

@StefanKarpinski
Copy link
Contributor

I saw that after asking. Thanks for clarifying.

@GunnarFarneback
Copy link
Collaborator

I'm not at all concerned about statefulness but see a significant value in easy navigation. I also don't think collisions will be very frequent and that the unique case should be as unencumbered and space-saving as possible. I think the A/APackage.$(string(uuid)[1:n])/ proposal is fine except that I'd prefer n to be zero whenever possible, which I expect is 99.9% of the cases or so. Optionally I would be fine with an n >= M rule whenever n = 0 isn't unique, but don't think M needs to be more than 4.

@StefanKarpinski
Copy link
Contributor

To be clear, when there are no name collisions, I think we should keep the same layout as we currently have, i.e. just A/APackage/. I would only use A/APackage.$(string(uuid)[1:n])/ when there is some other package such that lowercase(pkg1) == lowercase(pkg2), in which case we would use $(uppercase(first(pkg)))/$pkg.$(string(uuid))[1:n]/ as the directory for pkg in [pkg1, pkg2] instead of the simpler normal layout. You're right that using 4 hex UUID characters is probably generally enough (there's a ~1/256 chance of collision); the reason I thought of 8 is because that initial leading 8 chars is used as an abbreviation in some other places as well.

@GunnarFarneback
Copy link
Collaborator

How much interest is there in this functionality? I could rebase #38 if someone is sufficiently interested and willing to review it.

@StefanKarpinski
Copy link
Contributor

I think we will inevitably need this at some point, so 👍🏻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants