Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[experimental] Run crosshair in CI #4034

Draft
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

Zac-HD
Copy link
Member

@Zac-HD Zac-HD commented Jul 7, 2024

See #3914

To reproduce this locally, you can run make check-crosshair-cover/nocover/niche for the same command as in CI, but I'd recommend pytest --hypothesis-profile=crosshair hypothesis-python/tests/{cover,nocover,datetime} -m xf_crosshair --runxfail to select and run only the xfailed tests.

Hypothesis' problems

  • Vast majority of failures are Flaky: Inconsistent results from replaying a failing test... - mostly backend-specific failures; we've both
    • improved reporting in this case to show the crosshair-specific traceback
    • got most of the affected tests passing
  • Invalid internal boolean probability, e.g. "hypothesis/internal/conjecture/data.py", line 2277, in draw_boolean assert p > 2 ** (-64), fixed in 1f845e0 (#4049)
  • many of our test helpers involved nested use of @given, fixed in 3315be6
  • symbolic outside context
  • avoid uninstalling typing_extensions when crosshair depends on it
  • tests which are not really expected to pass on other backends. I'm slowly applying a backend-specific xfail decorator to them, @xfail_on_crosshair(...).
    • tests which expect to raise a healthcheck, and fail because our crosshair profile disables healthchecks. Disable only .too_slow and .filter_too_much, and skip remaining affected tests under crosshair.
    • undo some over-broad skips, e.g. various xfail decorators, pytestmarks, -k 'not decimal' once we're closer
  • provide a special exception type for when running the test or realizing values would hit a PathTimeout; see Rare PathTimeout errors in provider.realize(...) pschanely/hypothesis-crosshair#21 and Stable support for symbolic execution #3914 (comment)
    • and something to signal that we've exhausted Crosshair's ability to explore the test. If this is sound, we've verified the function and can stop! (and should record that in the stop_reason). If unsound, we can continue testing with Hypothesis' default backend - so it's important to distinguish.
      Add BackendCannotProceed to improve integration #4092

Probably Crosshair's problems

Error in operator.eq(Decimal('sNaN'), an_int)

____ test_rewriting_does_not_compare_decimal_snan ____
  File "hypothesis/strategies/_internal/strategies.py", line 1017, in do_filtered_draw
    if self.condition(value):
TypeError: argument must be an integer
while generating 's' from integers(min_value=1, max_value=5).filter(functools.partial(eq, Decimal('sNaN')))

Cases where crosshair doesn't find a failing example but Hypothesis does

Seems fine, there are plenty of cases in the other direction. Tracked with @xfail_on_crosshair(Why.undiscovered) in case we want to dig in later.

Nested use of the Hypothesis engine (e.g. given-inside-given)

This is just explicitly unsupported for now. Hypothesis should probably offer some way for backends to declare that they don't support this, and then raise a helpful error message if you try anyway.

@Zac-HD Zac-HD added tests/build/CI about testing or deployment *of* Hypothesis interop how to play nicely with other packages labels Jul 7, 2024
@tybug

This comment was marked as outdated.

@Zac-HD

This comment was marked as outdated.

@Zac-HD Zac-HD force-pushed the crosshair-in-ci branch 3 times, most recently from 175b347 to 424943f Compare July 7, 2024 20:26
@Zac-HD Zac-HD force-pushed the crosshair-in-ci branch from 424943f to b2d11c7 Compare July 7, 2024 20:56
@pschanely
Copy link
Contributor

@Zac-HD your triage above is SO great. I am investigating.

@pschanely
Copy link
Contributor

pschanely commented Jul 8, 2024

Knocked out a few of these in 0.0.60.
I think that means current status on my end is:

  • TypeError: conversion from SymbolicInt to Decimal is not supported
  • Unsupported operand type(s) for -: 'float' and 'SymbolicFloat' in test_float_clamper
  • TypeError: descriptor 'keys' for 'dict' objects doesn't apply to a 'ShellMutableMap' object (or 'values' or 'items').
  • TypeError: _int() got an unexpected keyword argument 'base'
  • Symbolic not realized (in e.g. test_suppressing_filtering_health_check)
  • Error in operator.eq(Decimal('sNaN'), an_int)
  • Zac's cursed example below!

More soon.

@Zac-HD Zac-HD force-pushed the crosshair-in-ci branch from b2d11c7 to 98ccf44 Compare July 11, 2024 07:23
@Zac-HD
Copy link
Member Author

Zac-HD commented Jul 12, 2024

Ah - the Flaky failures are of course because we had some failure under the Crosshair backend, which did not reproduce under the Hypothesis backend. This is presumably going to point to a range of integration bugs, but is also something that we'll want to clearly explain to users because integration bugs are definitely going to happen in future and users will need to respond (by e.g. using a different backend, ignoring the problem, whatever).

  • improve the reporting around Flaky failures where the differing or missing errors are related to a change of backend while shrinking. See also Change Flaky to be an ExceptionGroup #4040.
  • triage all the current failures so we can fix them

@Zac-HD

This comment was marked as outdated.

@Zac-HD Zac-HD force-pushed the crosshair-in-ci branch from 98ccf44 to 4bd7e45 Compare July 12, 2024 07:48
@tybug
Copy link
Member

tybug commented Jul 12, 2024

Most/all of the "expected x, got symbolic" errors are symptoms of an underlying error in my experience (often operation on symbolic while not tracing). In this case running with export HYPOTHESIS_NO_TRACEBACK_TRIM=1 reveals limited_category_index_cache in cm.query is at fault.

@Zac-HD
Copy link
Member Author

Zac-HD commented Jul 12, 2024

ah-ha, seems like we might want some #4029 - style 'don't cache on backends with avoid_realize=True' logic.

@Zac-HD Zac-HD force-pushed the crosshair-in-ci branch 2 times, most recently from 1d2345d to 7bf8983 Compare July 12, 2024 20:15
@pschanely
Copy link
Contributor

Still here and excited about this! I am on a detour of doing a real symbolic implementation of the decimal module - should get that out this weekend.

@Zac-HD Zac-HD force-pushed the crosshair-in-ci branch 2 times, most recently from cc07927 to 018ccab Compare July 13, 2024 07:23
@Zac-HD
Copy link
Member Author

Zac-HD commented Jul 13, 2024

Triaging a pile of the Flaky erorrs, most were due to getting a RecursionError under crosshair and then passing under Hypothesis - and it looks like most of those were in turn because of all our nested-@given() test helpers.

So I've tried de-nesting those, which seems to work nicely and even makes things a bit faster by default; and when CI finishes we'll see how much it helps on crosshair 🤞

@Zac-HD

This comment was marked as outdated.

@tybug
Copy link
Member

tybug commented Jan 13, 2025

To follow up on the premature realization in ir_size: I'm waiting to complete #3921 (we are very close!) before fixing this. The notion of "size" and overruns is one of the last things to fully hash out on the typed choice sequence, which may change the resolution here.

@tybug
Copy link
Member

tybug commented Jan 21, 2025

@pschanely OK, after #4247 we should not be prematurely realizing crosshair values anywhere. Please let me know if that's not the case!

The semantics after that pull is that we do not overrun (abort too-large test cases) from avoid_realization backends, as this would require realizing the values to get their size. So CrosshairProvider should be somewhat careful that it's not generating extremely large inputs. The size limit here is pretty generous and I think crosshair already has internal heuristics to this effect, so I don't expect this to be a problem. We've just removed the guardrails for now.

@tybug
Copy link
Member

tybug commented Jan 22, 2025

Hmm, lots of nondeterministic errors in CI. We're now calling provider.draw_* from multiple distinct stacktraces. Is this something the crosshair provider could ignore? I think we'd like to only make guarantees that the (ir_type, kwargs) of each draw is deterministic across iterations, not necessarily the stacktrace at which it was drawn.

@pschanely
Copy link
Contributor

The semantics after that pull is that we do not overrun (abort too-large test cases) from avoid_realization backends, as this would require realizing the values to get their size. So CrosshairProvider should be somewhat careful that it's not generating extremely large inputs. The size limit here is pretty generous and I think crosshair already has internal heuristics to this effect, so I don't expect this to be a problem. We've just removed the guardrails for now.

Sounds good; I expect that CrossHair will time itself out well before this becomes a problem.

Hmm, lots of nondeterministic errors in CI. We're now calling provider.draw_* from multiple distinct stacktraces. Is this something the crosshair provider could ignore? I think we'd like to only make guarantees that the (ir_type, kwargs) of each draw is deterministic across iterations, not necessarily the stacktrace at which it was drawn.

Yes; the stack-based check is pretty useful when debugging nondeterminism, but can be overly conservative in complex cases. The stacks are also used by some crosshair heuristics, and it may impact performance slightly. (e.g. it'll try to figure out how to get to a prior trace and never be able to get there) At any rate, I am not FULLY sure I want to just disable the stack-based check in the long run, but I've disabled it in CrossHair 0.0.82 for now. (there's also some (hopefully net-positive) changes in v 0.0.19 of the hypothesis-crosshair plugin.

At any rate, I am still working on trying to get a full clean run on my side. You might be interested to take a look at these hacks I've made for race-patching random and fresh_data().

@tybug
Copy link
Member

tybug commented Jan 27, 2025

Nice! I think we can take the random patch more or less as is, and I'll look into the fresh_data (possibly we should skip those tests on crosshair) and zero_data pin (initial reaction is that these choices should already have been realized by the time we try to cache them, so something is going on) fixes. Both look like reasonable local hacks for enabling forward progress 🙂

@Zac-HD Zac-HD force-pushed the crosshair-in-ci branch from 648bbe9 to 30fe0b8 Compare March 2, 2025 01:20
@Zac-HD Zac-HD force-pushed the crosshair-in-ci branch from 30fe0b8 to d51ad9f Compare March 2, 2025 09:17
@Zac-HD
Copy link
Member Author

Zac-HD commented Mar 2, 2025

OK, we have very many crosshair.util.CrossHairInternal: Not in a statespace context errors, which I think must have some systematic underlying cause. Not looking further at this time of day, but I'd like to get this merged in the next week or two, so we can show it off at BugBash as a stable feature 😁 (and maybe also integrate into HypoFuzz!)

@Zac-HD Zac-HD force-pushed the crosshair-in-ci branch 2 times, most recently from 102058b to 262dba3 Compare March 2, 2025 23:53
@Zac-HD Zac-HD force-pushed the crosshair-in-ci branch from 262dba3 to 95e734c Compare March 3, 2025 00:05
@tybug tybug force-pushed the crosshair-in-ci branch 2 times, most recently from 0e4d99d to 95e734c Compare March 3, 2025 07:39
@tybug
Copy link
Member

tybug commented Mar 3, 2025

(sorry, my local was out of date and force pushes above were me rebasing and subsequently realizing that had already been done in the up-to-date branch. nothing should have changed)

@Zac-HD
Copy link
Member Author

Zac-HD commented Mar 3, 2025

HYPOTHESIS_NO_TRACEBACK_TRIM=1 pytest --hypothesis-profile=crosshair hypothesis-python/tests/cover/test_deadline.py -x points here, which I consider pretty suspicious from a "we seem to have a lifetime error" perspective. I'm not actively chasing this, but let's stick to non-force pushes?

@tybug
Copy link
Member

tybug commented Mar 3, 2025

Yup, Phillip beat us to it 😅 pschanely@fd6958f

Underlying reason is crosshair raising discard_test_case for the initial zero_data call. We then pin it and check for health checks - but it was discarded, and therefore not realized. Maybe we should just not do any of those things if a backend raises discard_test_case? I'm hesitant to disable this for all alternative backends, but we don't currently have a way to check the BackendCannotProceed status of a data, though we could add that status to the data before freezing it.

This is also in part because we decided to give alternative backends control over NodeTemplate(type="simplest") draws in #4247.

@tybug
Copy link
Member

tybug commented Mar 17, 2025

I've addressed the zero_pin issue discussed above by tracking the cannot_proceed reason on ConjectureData, and not pinning if the backend could not proceed, assuming that if the test case cannot proceed then the values cannot be realized either.

@pschanely after doing this the CI run shows crosshair returning values of the wrong type from CrosshairPrimitiveProvider.realize on the second test case, after the first test case raises BackendCannotProceed (search hypothesis.errors.HypothesisException: expected <class 'int'>). I wonder if the provider is not removing the choices made during cannot-proceed iterations from the list of values to return while realizing?

Unfortunately this is flaky locally, possibly due to speed-related time outs. HYPOTHESIS_NO_TRACEBACK_TRIM=1 pytest --hypothesis-profile=crosshair -k test_adds_note_showing_which_strategy_stateful should reproduce in theory. Run in cd hypothesis-python, not the root dir.

@pschanely
Copy link
Contributor

I've addressed the zero_pin issue discussed above by tracking the cannot_proceed reason on ConjectureData, and not pinning if the backend could not proceed, assuming that if the test case cannot proceed then the values cannot be realized either.

Thank you!

@pschanely after doing this the CI run shows crosshair returning values of the wrong type from CrosshairPrimitiveProvider.realize on the second test case, after the first test case raises BackendCannotProceed (search hypothesis.errors.HypothesisException: expected <class 'int'>). I wonder if the provider is not removing the choices made during cannot-proceed iterations from the list of values to return while realizing?

SGTM; I will investigate tomorrow! One change that we made a while back is possibly relevant here: to avoid false-positives from approximations like real-based floats, we don't actually report the error right away; instead we'll abort the failing pass and try to recreate the pass next time with concrete draws. This can go haywire though if the requested draw types aren't the same on the next pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
interop how to play nicely with other packages tests/build/CI about testing or deployment *of* Hypothesis
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants