Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make nvptx-tools robust against cuda bugs #14

Open
vries opened this issue Mar 15, 2017 · 0 comments
Open

make nvptx-tools robust against cuda bugs #14

vries opened this issue Mar 15, 2017 · 0 comments

Comments

@vries
Copy link
Contributor

vries commented Mar 15, 2017

The purpose of the nvptx-tools is to test a gcc toolchain generating code for a single threaded ptx interpreter. The interpreter is provided by the cuda platform. Part of this platform is ptxas.

This is no ordinary assembler, but one with different optimization levels (-O0 ... -O4), and different code generators (-ori/-noori), and the task to insert the missing instructions and annotations handling the convergence stack (ssy, .s postfix). Consequently, ptxas can be less stable than you'd like for an assembler.

Nvptx-as.c calls ptxas with -O0. The purpose here is to do a minimal verification of the ptx validity. The output of ptxas is thrown away (so there's no great value in spending time optimizing the generated code), and the ptx is compiled again in nvptx-run.c (here using the cuda runtime functions rather than ptxas). The default optimization setting in nvptx-run.c though is -O4. Presumably the intention here is to achieve the fastest execution possible.

My observation is that while ptxas sigsegvs (and equivalent failures in nxptv-run) are of interest to nvidia, they are usually uninteresting from the point of view of gcc code generation.

Typically, when encountering such a sigsegv in the gcc test suite, we manually:

  • try different ptxas optimization levels
  • try different code generators
  • try different cuda versions

and if we find that the sigsegv goes away when changing any of those parameters, we conclude it's a cuda bug, and move on (perhaps by xfailing the testcase or some such).

I wonder if it makes sense to automate this:

  • let nvptx-as.c first try -O0, then -O0 -ori
  • let nvptx-run.c first try -O4, then -O4 -ori, then -O3, the -O3 -ori, etc (or some such)
    (assuming we can achieve the -ori equivalent using CUjit_option CU_JIT_NEW_SM3X_OPT)

This would reduce testsuite noise, save time, and reduce the number of xfails (and xpasses when running with different ptxas flags or cuda versions).

This could be implemented as the new default behaviour, or we could add a --fallback switch to nvptx-as.c and nvptx-run.c.

Eventually we could warn against using cuda versions which are known to be buggy to the point that that this fallback scenario doesn't help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant