-
-
Notifications
You must be signed in to change notification settings - Fork 15.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix release-cuda
eval errors
#379768
base: master
Are you sure you want to change the base?
fix release-cuda
eval errors
#379768
Conversation
Requires proper scoping of commit messages, if you don't intend them to get squashed. |
@mweinelt my bad, I totally forgot about the commit messages. I am fine with either squashing or not. Although, I guess, the changes are big enough to keep the separate commits. The "commit conventions for P.S.: looks like this requires a manual rebase anyway |
d707bff
to
c65ed11
Compare
I rebased and updated the commit names to be more in line with the commit conventions. Let me know if I missed anything else. |
Triggered an eval with these changes: https://hydra.nix-community.org/eval/265502 |
@zowoq thanks! Would I be correct in assuming that you've rebased this PR on top of 58edd1e (264446) before running it? Unfortunately, it's a bit hard to verify whether the eval errors decreased as expected since the Also, the |
Yes. I ran it again to get the errors:
|
👍 Looks correct to me. Here is the same list of errors, but without the stack traces, one line per eval error.
As mentioned previously, I'll probably work on the remaining errors in separate PRs. |
|
Rename the jobs from cudaPackages*.cudaPackages*.blah to just cudaPackages*.blah so that they match the nixpkgs attribute paths.
The theano package itself was originally removed in NixOS#313148.
Both `pytorch` and `torch` refer to the same package, but only the second one is the canonical name. The canonical `torch` name is already mentioned once a bit lower in the file.
Both `Keras` and `keras` refer to the same package, but only the second one is the canonical name.
Both `scikitimage` and `scikit-image` refer to the same package, but only the second one is the canonical name.
Using primary/canonical names for packages makes it easier to keep track of stuff. Additionally, this stops the recursion into cudaPackages* from including fake "aliases" for old (unsupported) cuda versions.
c65ed11
to
cf624a8
Compare
Rebased to fix merge conflict. |
cf624a8
to
04edda0
Compare
04edda0
to
98e9f51
Compare
As discussed in one of the threads and on matrix. Commits in the latest force-push remove the overengineered This should be ready to merge unless someone has any further concerns. |
packageSets = builtins.filter (lib.strings.hasPrefix "cudaPackages") (builtins.attrNames pkgs); | ||
evalPackageSet = pset: mapTestOn { ${pset} = packagePlatforms pkgs.${pset}; }; | ||
evalPackageSetPlatforms = lib.genAttrs packageSets (pset: packagePlatforms pkgs.${pset}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC eval
in evalPackageSet
was meant as a verb (it was a function); the new variable is an attrset? packageSets
? platformPackageSets
?
Does the new code generate identical tree structure? AFAIU python3Packages.tensorflow.x86_64-linux
in https://hydra.nix-community.org/job/nixpkgs/cuda/python3Packages.tensorflow.x86_64-linux
corresponds to a path in this attrset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This refactoring is technically no longer needed. We can drop commit 7d9676c, if you want (or we can keep the refactoring, but bikeshed the variable name a bit). This commit does not change any behavior (it just inlines the function call that used to be done on line 162).
I just left this refactor in for now, because I found the original structure a bit unintuitive.
evalPackageSetPlatforms
is a nested attrSet where each leaf contains a list of Platforms on which each Package should be evaluated. I admit that the name could use some improvements.
So for example evalPackageSetPlatforms.cudaPackages.cudnn == [ "x86_64-linux" ]
, but evalPackageSetPlatforms.cudaPackages_12_3.cudnn_8_0 == [ ]
(because of the badPlatformConditions
).
mapTestOn
then converts this attrset into an attrset of actual jobs that Hydra is going to evaluate/build by looking up those attribute paths in pkgs
. The only change to the job names were done in a8e9088 - 0c12821 (and those changes should have made the job names match the attr paths better). So yes, the job name should match the attribute path in nixpkgs.
I find the current setup a bit easier to follow, because evalPackageSetPlatforms
has the exact same structure as the explicit/literal attrset defined later in the file (with all the somepackage = linux;
lines).
Perhaps it should be called something along the lines of packageSetPlatformsToEval
or packagePlatformsToEval
or packageSetPlatforms
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PS I played with
❯ nix repl -I nixpkgsRuro=https://github.com/ruro/nixpkgs/archive/fix-release-cuda-eval-errors.tar.gz
nix-repl> orig = import <nixpkgs/pkgs/top-level/release-cuda.nix> { }
nix-repl> new = import <nixpkgsRuro/pkgs/top-level/release-cuda.nix> { }
...and everything seems good
Currently, the nix-community CUDA builder has a bunch of Eval Errors. This PR fixes some of the "lower hanging fruit" among those eval errors.
You can see the list of current eval errors by opening the "Evaluation Errors" tab for the
nixpkgs:cuda
jobset. Note that the "Evaluation Errors" tab for individual runs is currently broken, so only the jobset-level "Evaluation Errors" tab is populated (which contains evaluation errors for the latest job).You can reproduce the evaluation errors locally by running
(
hydra-eval-jobs
is provided by thehydra
package). When reviewing this PR, I recommend running the above command on each commit starting with the merge base and thendiff
ing the json data printed onstdout
to verify that each commit does what I claim it does.Things done
cudaPackages*
prefix twiceThis is mostly a cosmetic change so that the job name matches the nixpkgs attribute path.
The next bunch of commits remove the accidental use of non-standard aliases for package names and then sets
allowAliases = false
to ensure that aliases aren't used in the future.Theano
from the list of evaluated packages (Theano was removed from nixpkgs)pytorch
which is a deprecated alias fortorch
which is already being evaluatedKeras
(deprecated alias) tokeras
scikitimage
(deprecated alias) toscikit-image
allowAliases = false
(this ensures that aliases aren't used accidentally in the future)Notably, the last commit also stops trying to evaluate the deprecated
cudaPackages_10_*
package sets and should also automatically stop evaluating CUDA versions prior to 12.0 after 25.05 gets released.release-cuda.nix
The next few commits disable the evaluation of "known" broken derivations by moving some
brokenConditions
tobadPlatformConditions
:cudnn
nsight_systems
nvidia_driver
Additionally, the
cudaPackages.tensorrt
package is marked withhydraPlatforms = none
, because this package is fundamentally impossible to build on Hydra due torequireFile
d sources and its non-redistributable LICENSE.tensorrt
Applying the above changes should fix 238 out of 259 eval errors that are currently happening in the
nixpkgs:cuda
jobset on the nix-community builders. The remaining 21 errors are slightly more tricky, so I am leaving them for a later PR.BuiltEvaluated on platform(s)nix.conf
? (See Nix manual)sandbox = relaxed
sandbox = true
nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD"
. Note: all changes have to be committed, also see nixpkgs-review usage./result/bin/
)Add a 👍 reaction to pull requests you find important.