-
Notifications
You must be signed in to change notification settings - Fork 69
nvFuser Python Benchmarks
The python benchmarks use pytest-benchmark and torch.profiler. Most of the CPP benchmarks have been ported to Python. The key differences as compared to the CPP interface are:
- Validation: Python benchmarks validate the nvFuser output against torch output to verify correctness.
- PyTorch baselines (
torch.compileandeager): Python benchmarks support benchmarking other executors such astorch.compileandeager. - Python benchmarks use CUPTI (through
torch.profiler) for accurate and low-overhead kernel measurements.
-
Running a benchmark file:
NVFUSER_DISABLE=kernel_reuse pytest [options] <benchmark-file>. -
Running the complete benchmark suite:
NVFUSER_DISABLE=kernel_reuse pytest [options] python_benchmarks/ -
Sharding: Pytest is memory-intensive resulting in CPU OOMs when running a large number of tests. Sharding is recommended when running the complete benchmarking suite. We use
pytest-shardin our CI. To execute a specific shard withntotal shards:NVFUSER_DISABLE=kernel_reuse pytest --shard-id=i --num-shards=n [options]wherei = {0..n-1}. -
Running a subset of the inputs for any benchmark:
NVFUSER_DISABLE=kernel_reuse pytest <benchmark-file> --benchmark-num-inputs=10. This will randomly sample 10 input sizes to run the given benchmark.
Note: It is recommended to disable kernel reuse to get reliable performance measurements in all benchmarks.
Pytest/Pytest-benchmark options:
- Filtering benchmarks:
-k <filter> - Saving benchmarks:
--benchmark-save=NAME,--benchmark-autosave,--benchmark-json=PATH - Debugging:
--benchmark-verbose.
Custom command-line options:
- Disable output validation:
--disable-validationSkips the output validation in the nvFuser benchmarks. - Disable benchmarking:
--disable-benchmarkingSkips the nvFuser benchmarking, useful for only testing correctness of fusion definitions without benchmarking the fusions. - Run eager mode benchmarks:
--benchmark-eager - Run torch.compile mode benchmarks:
--benchmark-torchcompile - Setting custom rounds / warmup-rounds:
--benchmark-roundsand--benchmark-warmup-roundscan be used to override the default values (rounds=10,warmup_rounds=1) - Running subset of input sizes:
--benchmark-num-inputs=nwill randomly sampleninput sizes out of the complete input set to run the benchmark. This is useful for testing new changes.
To benchmark any target function, use run_benchmark (python_benchmarks/core.py):
run_benchmark(benchmark, target_function, function_inputs, iobytes=None)
Arguments:
- benchmark: pytest-benchmark fixture passed to every function intended to be run as a benchmark by pytest.
- target_function: Function to benchmark
- function_inputs: List of inputs to the target_function
- iobytes (Optional): This should be used for any executor other than nvFuser if the inputs/outputs are not the same as nvFuser. See PR #1725. By default, we compute the IObytes automatically based on the inputs/outputs of the target function.
Example:
# Parametrize over any number of arguments (e.g., input sizes, dtypes)
@pytest.mark.parametrize("param1", ...)
@pytest.mark.parametrize("param2", ...)
def test_example_benchmark(
benchmark, param1, param2, ...
):
# Setup function inputs
run_benchmark(benchmark, target_function, function_inputs)
The benchmark name should start with test_ to be automatically discovered by pytest.
- Nvfuser benchmarks: For benchmarking nvfuser fusion definitions
- Benchmark name:
test_{benchmark_name}_nvf_benchmark - It is recommended to validate the nvfuser fusion outputs.
- The
disable_validationargument is used to skip validation through CLI arguments. - The
disable_benchmarkingargument is used to skip nvfuser benchmarks which run by default.
def test_{benchmark}_nvf_benchmark(
benchmark, ..., disable_validation (recommended), disable_benchmarking
):
# Setup function inputs and initialize the fusion definition
if not disable_validation:
# Validate the fusion output against reference torch implementation
if not disable_benchmarking:
run_benchmark(benchmark, target_function, function_inputs)
- Torch/Thunder benchmarks: For benchmarking eager, torch.compile, and thunder.jit (with nvfuser/torchcompile executors)
@pytest.mark.parametrize("executor", [...]) # Supported executors are 'eager', 'torchcompile', 'thunder', 'thunder-torchcompile'
def test_{benchmark}_baseline_benchmark(
benchmark, executor, ..., disable_validation (optional)
):
benchmark_fn = with_executor(fn, executor, **kwargs)
run_benchmark(benchmark, benchmark_fn, function_inputs)
- For benchmarks for backward pass, the forward function is compiled, and only the backward call is profiled. The utility function
unary_bwd_torchis used. The inputs here are:[output, grad, fwd_inputs]. The inputs to forward function are used to clear gradients. These benchmarks should also set theiobytessince all the inputs/outputs of the backward pass are not explicit (for instance, intermediate variables saved from forward for backward or the computed gradients).
- Pytest: https://docs.pytest.org/en/stable/
- Pytest-benchmarks: https://pytest-benchmark.readthedocs.io/en/latest/index.html
- Pytest-shard: https://pypi.org/project/pytest-shard/