Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto-generate tests using mutation testing #4299

Open
ambergorzynski opened this issue Mar 19, 2025 · 3 comments
Open

Auto-generate tests using mutation testing #4299

ambergorzynski opened this issue Mar 19, 2025 · 3 comments

Comments

@ambergorzynski
Copy link

Hello! @afd and I have been experimenting with a technique to auto-generate tests for the CTS using mutation testing.

In short: we deliberately mutate (i.e. mess with) some part of a WebGPU implementation or downstream driver and run the CTS on the mutated version. If no tests fail,1 then this indicates a gap in the ability of the CTS to fully exercise the implementation. We use a WGSL fuzzer to create a test that does fail when run on the mutated code. Our idea is that adding such tests to the CTS will detect future bugs that creep into that part of the code.

Below is an example to make the idea more concrete. Execution of the shader covers a part of the Mesa's Lavapipe driver code that the CTS does not currently cover relating to this statement. The test passes when run using Dawn and the unmutated Lavapipe, but fails when the statement of interest is altered. The shaders we generate are all equipped with an expected output buffer value; any deviation from this value is a failure. In this case, the shader should output 1i. When the mutation is in place, the shader outputs -400i and the test fails.2

We have a couple other initial examples, and a workflow set up to generate a large number (hundreds/thousands) of similar tests that exercise code that is not currently exercised by the CTS. I say exercise rather than cover because in some cases the CTS may cover code but not actually detect a problem when this code is altered.

What do you think? It would be great to get any general thoughts, plus a couple specific questions:

  • I'm aware that these tests are somewhat unusual - is my explanation of why these tests are useful clear? If not, I can try to explain again!
  • Is the CTS the right home for these kinds of tests? If not, do you have other ideas about where they could live?

Example

export const description = `Example mutant test`;

import { makeTestGroup } from '../../common/framework/test_group.js';
import { GPUTest } from '../gpu_test.js';
import { checkElementsEqual } from '../util/check_contents.js';

export const g = makeTestGroup(GPUTest);

g.test('mutant_killing_test')
  .desc(`Test that exercises SPIR-V structured control flow`)
  .fn(async t => {
    const code = `
    struct StorageBuffer {
        a: i32,
    }
    @group(0)
    @binding(1)
    var<storage, read_write> s_output: StorageBuffer;
    fn f() -> i32 {
        switch (1i) {
            case 2i: {
                switch (1i) {
                    case 1i: {
                        return -400i;
                    }
                    default: {
                    }
                }
                let v = 1i;
            }
            default: {
            }
        }
        return 1i;
    }
    @compute
    @workgroup_size(1)
    fn main() {
        s_output = StorageBuffer(f());
    }
    `;
    // Corresponds to 1 in i32 representation 
    const expectedArray = new Uint8Array([1, 0, 0, 0]);

    const pipeline = t.device.createComputePipeline({
      layout: 'auto',
      compute: {
        module: t.device.createShaderModule({
          code,
        }),
        entryPoint: 'main',
      },
    });

    const outputBuffer = t.makeBufferWithContents(
      expectedArray,
      GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC | GPUBufferUsage.COPY_DST
    );

    const bg = t.device.createBindGroup({
      layout: pipeline.getBindGroupLayout(0),
      entries: [
        {
          binding: 1,
          resource: {
            buffer: outputBuffer,
          },
        },
      ],
    });

    const encoder = t.device.createCommandEncoder();
    const pass = encoder.beginComputePass();
    pass.setPipeline(pipeline);
    pass.setBindGroup(0, bg);
    pass.dispatchWorkgroups(1, 1, 1);
    pass.end();
    t.queue.submit([encoder.finish()]);

    const bufferReadback = await t.readGPUBufferRangeTyped(outputBuffer, {
      srcByteOffset: 0,
      type: Uint8Array,
      typedLength: expectedArray.length,
      method: 'copy',
    });
    const got: Uint8Array = bufferReadback.data;

    t.expectOK(checkElementsEqual(got, expectedArray));
  });

Footnotes

  1. By 'no tests fail' I actually mean 'no previously-passing tests fail', since some tests fail on the current implementations, which I believe is a known issue.

  2. We have a couple detailed questions about the best way to handle different number representations in our expected output buffer, but we can get into that later :)

@afd
Copy link

afd commented Mar 20, 2025

@dneto0 and @alan-baker for info.

@dneto0
Copy link
Contributor

dneto0 commented Mar 20, 2025

Thanks for the offer to contribute!

  • I would be happy to accept the test you've given.
    • It's not excessively full of boilerplate code. It's readable and to the point (heh, I guess WebGPU is a nice API!).
    • It's easy to see that failing the test definitely indicates incorrect behaviour.
  • These kinds of tests have value. They increase the ability of the suite to find potential defects in implementations.
    • They are of most value to the maintainers of the implementation that was mutated to find the test. In this case Dawn and Mesa.
    • As a maintainer of a different implementation I would have less interest in running these frequently, but more than zero interest.
    • As an idealistic user, I'd want all implementations to pass all such tests.
  • There is also a tension with how long it takes to run the test suite. (e.g. the Dawn team runs the full suite across multiple implementations, for every code change)

This suggests to me it's appropriate to land such tests, somewhere in the tree but in a segregated bucket, possibly organized according to which backends were mutated to find these separation tests.

I'd advise that each test come with (machine-readable) labeling to indicate which implementation (and version?) was mutated to find the separating bug, and where that mutation occurred. E.g. in this case it might be something like:

  mutations: [  { package: 'dawn', version: '<git-hash>' }, 
                { package:  'mesa', version: '<git hash>', mutated: { file: '<path>', line: number }}
  ]

Also, if a test was found by mutating Mesa, then put it under a 'mesa' tree, e.g.

 webgpu-mutation:mesa,*

I don't know if it's preferable to have a subdirectory for mutation tests webgpu:mutation,* , or a separate top-level tree webgpu-mutation:* That depends how consumers are configured to run "all" or "all+plus mutations".

Tagging @kainino0x for further advice.

@ambergorzynski
Copy link
Author

Great! I attach a trio of examples (including the one above) so that you can see the degree of similarity / difference between a couple tests. These examples do not feature input buffers, but other tests may.

On your point about organisation according to the mutation subject - I agree that keeping track of this information is useful and that maintainers are more likely to be interested in tests created in response to their own implementation. But there is likely also value in encouraging the running of these tests across all implementations (perhaps only occasionally, due to the time budget issue). I understand from @afd that the GraphicsFuzz tests created based on driver-specific coverage turned out to be effective in exposing problems in a range of drivers. So I wonder if there is a way to achieve a balance between harnessing this potential value and CTS running time?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants