Skip to content

Benchmarking framework: Updated #703

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
Apr 25, 2025
Merged

Benchmarking framework: Updated #703

merged 24 commits into from
Apr 25, 2025

Conversation

kotsaloscv
Copy link
Collaborator

@kotsaloscv kotsaloscv commented Mar 25, 2025

Update the benchmarking infrastructure and extend it for the granule tests.

The helpers.run_verify_and_benchmark function runs, verifies and benchmarks (if selected) the investigated function, either a stencil test or a granule run. Therefore, a benchmark guarantees that the function is always verified. Additionally, the infrastructure wraps in a very compact form the running, the testing and the benchmarking (just one function).

@kotsaloscv kotsaloscv self-assigned this Mar 25, 2025
@kotsaloscv kotsaloscv requested review from havogt and egparedes March 26, 2025 08:57
@kotsaloscv kotsaloscv marked this pull request as ready for review April 8, 2025 06:18
@kotsaloscv
Copy link
Collaborator Author

@egparedes any feedback on these changes?
Thanks!

Copy link
Contributor

@egparedes egparedes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the general cleanup of the test logic, but there are a couple of issues to fix.

):
if orchestration and not helpers.is_dace(backend):
pytest.skip("Orchestration test requires a dace backend.")

if benchmark.enabled and experiment == dt_utils.REGIONAL_EXPERIMENT:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will skip the validation test for the regional experiment if the benchmark is enabled, which I don't think is the right behavior. See next comment below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is issue hasn't been addressed yet. One solution could be to make the benchmark argument optional in the run_verify_and_benchmark helper and then do something like:

Suggested change
if benchmark.enabled and experiment == dt_utils.REGIONAL_EXPERIMENT:
if experiment == dt_utils.REGIONAL_EXPERIMENT:
# Skip benchmarks for this experiment
benchmark = None

In this case the helper would always verify the function but would only run the benchmark if the benchmark fixture is passed and enabled.

@kotsaloscv kotsaloscv requested a review from egparedes April 9, 2025 11:13
Copy link
Contributor

@egparedes egparedes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the current approach and have a couple of suggestions to improve it further.

@kotsaloscv kotsaloscv requested a review from egparedes April 10, 2025 06:45
Copy link
Contributor

@egparedes egparedes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor style-related comments. Additionally, it would be good to add some tests for the most critical/complex functions here.

Comment on lines 130 to 131
input_data: dict,
reference_outputs,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing and too generic type hints.

Comment on lines 152 to 155
connectivities_as_numpy: dict,
input_data: dict,
benchmark, # benchmark fixture
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing and too generic type hints.

@kotsaloscv
Copy link
Collaborator Author

Minor style-related comments. Additionally, it would be good to add some tests for the most critical/complex functions here.

FYI, @egparedes

Once you are happy with the verification/benchmarking infrastructure, I will add more granule tests (I just wanted to stabilize it first and then simply apply it to other tests -easy part ;) -)

@kotsaloscv kotsaloscv requested a review from egparedes April 10, 2025 14:24
@egparedes
Copy link
Contributor

FYI, @egparedes

Once you are happy with the verification/benchmarking infrastructure, I will add more granule tests (I just wanted to stabilize it first and then simply apply it to other tests -easy part ;) -)

I'm sorry if my comment about adding tests was misleading, but I meant to add unit tests for the benchmarking infrastructure enhanced in this PR.

@kotsaloscv
Copy link
Collaborator Author

FYI, @egparedes
Once you are happy with the verification/benchmarking infrastructure, I will add more granule tests (I just wanted to stabilize it first and then simply apply it to other tests -easy part ;) -)

I'm sorry if my comment about adding tests was misleading, but I meant to add unit tests for the benchmarking infrastructure enhanced in this PR.

How about now?

Copy link
Contributor

@egparedes egparedes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More comments.

Comment on lines 157 to 158
benchmark, # benchmark fixture
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still missing type hints

):
if orchestration and not helpers.is_dace(backend):
pytest.skip("Orchestration test requires a dace backend.")

if benchmark.enabled and experiment == dt_utils.REGIONAL_EXPERIMENT:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is issue hasn't been addressed yet. One solution could be to make the benchmark argument optional in the run_verify_and_benchmark helper and then do something like:

Suggested change
if benchmark.enabled and experiment == dt_utils.REGIONAL_EXPERIMENT:
if experiment == dt_utils.REGIONAL_EXPERIMENT:
# Skip benchmarks for this experiment
benchmark = None

In this case the helper would always verify the function but would only run the benchmark if the benchmark fixture is passed and enabled.

Comment on lines 29 to 32
all_correct = np.all(field == (base_value + increment))

assert type(all_correct) is np.bool_
assert all_correct, f"Field verification failed"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
all_correct = np.all(field == (base_value + increment))
assert type(all_correct) is np.bool_
assert all_correct, f"Field verification failed"
np.testing.assert_all_close(field == (base_value + increment))


def test_verification_benchmarking_infrastructure(benchmark):
base_value = 1
field = base_value*np.ones((1000, 1000), dtype=base_type)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to use an array, a simple 0d field should be ok for this function

Suggested change
field = base_value*np.ones((1000, 1000), dtype=base_type)
field = base_value*np.ones((), dtype=base_type)

Comment on lines 50 to 51
incr_func(field, increment)
verify_field(field, increment, current_base_value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this testing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure that python passes correctly through the run_verify_and_benchmark function and increments the field. Therefore, the second time that we call these functions, the base value is already altered.

benchmark,
)

current_base_value = np.random.choice(field.flat)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For 0d fields:

Suggested change
current_base_value = np.random.choice(field.flat)
current_base_value = field[()]


from icon4py.model.testing import helpers

base_type = np.int64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is a constant, use the proper naming conventions (BASE_TYPE), but I don't see the need for it in such a small example. Additionally, in numpy world, this is almost always called dtype not type

Comment on lines 41 to 45
helpers.run_verify_and_benchmark(
functools.partial(incr_func, field=field, increment=increment),
functools.partial(verify_field, field=field, increment=increment, base_value=base_value),
benchmark,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only testing the happy path where everything works, but it's not testing that the verify fails appropriately when needed. Also, we shouldn't request and use the actual benchmark fixture here to avoid messing up with the results of the benchmarks. Just use a mock, setting the enabled attribute to both True and False to check that the dispatching logic is ok and only calls the benchmarking function when needed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @egparedes
I now fixed the testing of the non-happy path as well.
Let me know if that's ok.

@kotsaloscv kotsaloscv requested a review from egparedes April 14, 2025 14:42
@kotsaloscv kotsaloscv requested a review from halungge April 16, 2025 08:25
@kotsaloscv
Copy link
Collaborator Author

@halungge do you think that we can add more granule tests for benchmarking, other than the ones that I have already included? Thanks!

Copy link
Contributor

@egparedes egparedes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only small things from previous rounds which have not been yet addressed.

np.testing.assert_allclose(field, base_value + increment)


def test_verification_benchmarking_infrastructure(benchmark: Optional[pytest.FixtureRequest]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benchmark fixture doesn't need to be used here, so it shouldn't be requested. Please remove it.

Comment on lines 42 to 46
helpers.run_verify_and_benchmark(
functools.partial(incr_func, field=field, increment=increment),
functools.partial(verify_field, field=field, increment=increment, base_value=base_value),
benchmark_fixture=None, # no need to benchmark this test
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said in a previous review, instead of passing the real benchmark fixture you should pass a mock (please read the documentation at https://devdocs.io/python~3.13/library/unittest.mock#unittest.mock.Mock) here with the enabled attribute set to True/False and then assert if it has been called or not. Something like:

Suggested change
helpers.run_verify_and_benchmark(
functools.partial(incr_func, field=field, increment=increment),
functools.partial(verify_field, field=field, increment=increment, base_value=base_value),
benchmark_fixture=None, # no need to benchmark this test
)
benchmark = mock.Mock(enabled=False)
helpers.run_verify_and_benchmark(
functools.partial(incr_func, field=field, increment=increment),
functools.partial(verify_field, field=field, increment=increment, base_value=base_value),
benchmark_fixture=benchmark,
)
assert not benchmark.called
benchmark.enabled = True
helpers.run_verify_and_benchmark(
functools.partial(incr_func, field=field, increment=increment),
functools.partial(verify_field, field=field, increment=increment, base_value=base_value),
benchmark_fixture=benchmark,
)
assert benchmark.called

This is only an example to show how to use it, don't take it as a real suggestion. Try to fit something like that in the test in a way that makes sense and tests all relevant code paths.

connectivities_as_numpy: dict[str, np.ndarray],
input_data: dict[str, gtx.Field],
benchmark: pytest.FixtureRequest,
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing return type hint

@@ -199,8 +211,7 @@ def __init_subclass__(cls, **kwargs):
# reflect the name of the test we do this dynamically here instead of using regular
# inheritance.
super().__init_subclass__(**kwargs)
setattr(cls, f"test_{cls.__name__}", _test_validation)
setattr(cls, f"test_{cls.__name__}_benchmark", _test_execution_benchmark)
setattr(cls, f"test_{cls.__name__}", _test_and_benchmark)


def reshape(arr: np.ndarray, shape: tuple[int, ...]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to your PR, but since you're touching already this file, maybe you could the proper return type hint here.

@kotsaloscv kotsaloscv requested a review from egparedes April 16, 2025 13:32
@@ -103,6 +106,11 @@ def test_advection_run_single_step(
pytest.xfail(
"This test is skipped until the cause of nonzero horizontal advection if revealed."
)

if ntracer == 1:
# Skip benchmarks for this experiment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this currently always the case?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we pass the benchmark fixture by default, if we do not set this condition , then it will benchmark the test for multiple parametrizations. By setting this, it benchmarks only the case with 4 tracers. But this is just a suggestion from my side. If want otherwise, please let me know.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would call this file test_helpers.py. It is not always like that in icon4py but the tests for something in a module foo.py should go into test_foo.py for easier overview.

), "Base values should not be equal. Otherwise, the test did not go through incr_func/ verify_field functions."

# Expect AssertionError
with pytest.raises(AssertionError):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand this test. Why should it raise? the function under test run_verify_and_benchmark does not raise, does it? And shouldn't there be a test for the case that the benchmark si called, and one for the case that it isn't?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It raises an error because of this:

            functools.partial(
                verify_field, field=field, increment=increment, base_value=base_value
            ),  # base_value should be current_base_value

i.e. to be correct we need base_value= current_base_value. It is there just to test if run_verify_and_benchmark raises errors correctly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding your comment about testing the benchmark branch, we implicitly test it with the mock object benchmark = mock.Mock(enabled=False), i.e. we pass the benchmark fixture (not None), but with the enabled parameter to False.

@kotsaloscv
Copy link
Collaborator Author

FYI @egparedes @halungge
is it ok now?
If yes, we can merge it.

Copy link
Contributor

@halungge halungge Apr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is still hard to understand what you want to test here. You can use mocks not only for the benchmark but also for the functions, since all you want check is whether they are called or not. I would do something like that:

@pytest.mark.parametrize("benchmark_enabled", [True, False])
def test_verification_benchmarking_infrastructure(benchmark_enabled):
    base_value = 1
    field = np.array((base_value * np.ones((), dtype=BASE_DTYPE)))

    increment = 6

    benchmark = mock.Mock(enabled=benchmark_enabled)
    test_func = mock.Mock(wraps=incr_func)
    verification_func = mock.Mock(wraps=verify_field)

    helpers.run_verify_and_benchmark(
        functools.partial(test_func, field=field, increment=increment),
        functools.partial(verification_func, field=field, increment=increment, base_value=base_value),
        benchmark_fixture=benchmark,
    )
    test_func.assert_called_once()
    verification_func.assert_called_once()
    benchmark.assert_called() if benchmark_enabled else benchmark.assert_not_called()

(using your functions, you could even use empty functions). I think that tests all variants except for the benchmark = None, which I don't know whether you can mock that

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The advantage of the current solution, is that it tests all branches of the helper function and it provides a concrete example of how the helper function is used (with no overhead compared to the mock objects).
I would prefer to leave it as is, but if you prefer otherwise I can change it.
@egparedes any comment on the test?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not the current version tests all branches, if it does it does it in a very obfuscated way imho. It took me way too long to understand what you were trying to test....

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok fixed @halungge

Copy link
Contributor

@egparedes egparedes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final changes. After fixing these changes it should be fine for me so I don't need another round and I'll leave the final approval to @halungge


input_data = allocate_data(backend, input_data)
Note:
- test_func and verification_func should be provided with binded arguments, i.e. with functools.partial.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- test_func and verification_func should be provided with binded arguments, i.e. with functools.partial.
- test_func and verification_func should be provided with bound arguments, i.e. with functools.partial.



@pytest.mark.parametrize("benchmark_enabled", [True, False])
def test_verification_benchmarking_infrastructure(benchmark_enabled):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def test_verification_benchmarking_infrastructure(benchmark_enabled):
def test_run_verify_and_benchmark(benchmark_enabled):

Comment on lines 39 to 45
failing_verification_func = mock.Mock(side_effect=AssertionError("Verification failed."))
with pytest.raises(AssertionError):
helpers.run_verify_and_benchmark(
functools.partial(test_func, req_arg=mock.Mock()),
failing_verification_func,
benchmark_fixture=None,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not really testing anything new because we already tested earlier that verification_func is always called.

However there is still one test case missing: calling the function with benchmark_fixture = None which should still call the verification function.

It should be very simple moving this part to a new function with fresh mocks so we can simply assert:

Suggested change
failing_verification_func = mock.Mock(side_effect=AssertionError("Verification failed."))
with pytest.raises(AssertionError):
helpers.run_verify_and_benchmark(
functools.partial(test_func, req_arg=mock.Mock()),
failing_verification_func,
benchmark_fixture=None,
)
helpers.run_verify_and_benchmark(
functools.partial(test_func, req_arg=mock.Mock()),
verification_func,
benchmark_fixture=None,
)
test_func.assert_called_once()
verification_func.assert_called_once()

Copy link

Mandatory Tests

Please make sure you run these tests via comment before you merge!

  • cscs-ci run default
  • launch jenkins spack

Optional Tests

To run benchmarks you can use:

  • cscs-ci run benchmark-bencher

To run tests and benchmarks with the DaCe backend you can use:

  • cscs-ci run dace

In case your change might affect downstream icon-exclaim, please consider running

  • launch jenkins icon

For more detailed information please look at CI in the EXCLAIM universe.

Copy link
Contributor

@halungge halungge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, can you add a PR description? It will be taken as the merge commit comment.

@kotsaloscv kotsaloscv merged commit 2f97bc1 into main Apr 25, 2025
2 checks passed
@kotsaloscv kotsaloscv deleted the benchmark branch April 25, 2025 11:47
@egparedes
Copy link
Contributor

@kotsaloscv you forgot to add the PR description before merging. Could you add it now so it would be at least visible following the PR number in the commit log?

@kotsaloscv kotsaloscv changed the title Add more benchmark tests Benchmarking framework: Updated Apr 30, 2025
@kotsaloscv
Copy link
Collaborator Author

@kotsaloscv you forgot to add the PR description before merging. Could you add it now so it would be at least visible following the PR number in the commit log?

Indeed, my bad ;) Thanks for reminding me.

jcanton added a commit that referenced this pull request May 9, 2025
* main: (39 commits)
  Combining solve nonhydro stencils 1 to 13 (#682)
  Update gt4py to icon4py_staging_20250507 (#736)
  add test level marker (#722)
  Fix refactor diffusion metrics (#715)
  Re-add conftest again (#734)
  Separate datatests (#714)
  Rename tests (#733)
  Update to icon4py_staging_20250502 (#732)
  Simplified velocity advection stencil (#725)
  add doc string on the combined stencils on velocity advection (#707)
  Updated Benchmarking Infrastructure (#703)
  Vectorize compute vwind impl wgt (#721)
  Combining solve nonhydro stencils 14 to 28 (#292)
  Cleanup metrics fields and factory tests (cont) (#690)
  Disable tracer-advection tests (#720)
  remove unused arguments (concat_where) (#716)
  Update GT4Py version to icon4py_staging_20250417 (#717)
  enable interpolation_factory tests on GPU (#681)
  Fix fix dataset -> datatest typo and strict checking (#713)
  Reenable graupel tests (#712)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants