Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tagging of test suites #350

Merged
merged 40 commits into from
Mar 13, 2025
Merged

Tagging of test suites #350

merged 40 commits into from
Mar 13, 2025

Conversation

katrinafyi
Copy link
Member

@katrinafyi katrinafyi commented Mar 6, 2025

In the Github actions, test cases are grouped into a number of coarse categories (unit tests, standard system tests, and analysis system tests). This PR aims to make these groups more concrete by defining them within the test files, instead of in the Github actions workflow. This will also allow you to run a particular group on your local machine while excluding the slower-running tests.

These tags are defined as annotations within the test_util.tags package, and an annotation can be attached to any test suite class:

@test_util.tags.UnitTest
class IRTest extends AnyFunSuite { ... }

When running tests, one can include or exclude a particular tag by using the -n or -l flags of the scalatest Runner. With SBT, these arguments can be affixed to the testOnly command:

sbt "testOnly * -- -n test_util.tags.UnitTest"  

With mill, it is more complicated. Adding -n or -l to ./mill test.testOnly will not work :( (all test suites will still be run). I think there is a problem with how Mill invokes the scalatest runner. The workaround is to manually invoke the runner:

./mill test.runMain org.scalatest.tools.Runner -o -R out/test/compile.dest/classes -n test_util.tags.UnitTest

After the -R .../classes, the usual test.testOnly arguments can be added. Note that the -o flag is important - without it, scalatest will try (and likely fail) to open a GUI. The -R argument should point to the output directory containing compiled class files.

The Github CI jobs are changed to use the ./mill test.runMain command. Also, there is a new CI job which reports the test suites which do not have any tag. It is my hope that every test case will eventually have at least one tag - I will need some help with this. This will make sure that every test written is being exercised by the CI. Currently, it is too easy for tests to fall through the craks if they are not run regularly.


This PR is part of an effort to improve the testing in Basil and force us to be more honest about the tests. This means that we should have a clear understanding of what each test is testing, whether it currently passes and, if not, why it is currently not working. I will point out these scalatest features to help this goal:

  • pendingUntilFixed should be used to mark a block of code (typically containing an assertion) as one which is currently expected to fail. This should be used to record test cases which fail due to not-yet-implemented features or known bugs. If a change is made and the code no longer fails, this will cause the test to fail until the pendingUntilFixed is removed. It should be used within a test() { ... } block like so:

    test("1 == 2 sometime soon?") {
      assert(1 == 1, "obviously")
      pendingUntilFixed {
        assert(1 == 2, "todo")
      }
    }
  • Tests can be marked as ignored by replacing test() with ignore(). An entire suite can be marked as ignored with the @org.scalatest.Ignore annotation.

  • We also add a trait trait TestCustomisation to help with customising particular dynamically-generated tests based on their test case name (which includes file path and compiler/lifter options).

    Individual test cases can be marked as retry, disabled, or expect failure. This modified the behaviour of the test case and prints helpful output when running the test. For example:

- correct/malloc_with_local3/clang:BAP (pending)
  + NOTE: Test case is customised with: "ExpectFailure(previous failure was: Expected verification success, but got failure. Failing assertion is: assert (load37_1 == R30_in))"
 
  + Failing assertion ./src/test/correct/malloc_with_local3/clang/malloc_with_local3_bap.bpl:264 
  + 261 |     load36_1, Gamma_load36_1 := memory_load64_le(mem_9, bvadd64(R31_in, 18446744073709551600bv64)), (gamma_load64(Gamma_mem_9, bvadd64(R31_in, 18446744073709551600bv64)) || L(bvadd64(R31_in, 18446744073709551600bv64)));
    262 |     call rely();
    263 |     load37_1, Gamma_load37_1 := memory_load64_le(mem_10, bvadd64(R31_in, 18446744073709551608bv64)), (gamma_load64(Gamma_mem_10, bvadd64(R31_in, 18446744073709551608bv64)) || L(bvadd64(R31_in, 18446744073709551608bv64)));
 >  264 |     assert (load37_1 == R30_in); //is returning to caller-set R30
    265 |     goto printCharValue_2260_basil_return;
    266 |   printCharValue_2260_basil_return:
    267 |     assume {:captureState "printCharValue_2260_basil_return"} true; 
- analysis_differential:malloc_with_local/gcc_O2:GTIRB (pending)
  + NOTE: Test case is customised with: "ExpectFailure(needs printf_chk)"
 
  + STDOUT: "" 

Note: The tag annotations must be defined in Java (apparently) to make them visible to the test runner at runtime.


TODO

  • Maybe some of your explanation in the PR comment could be copied to a documentation page? - ailrst

@katrinafyi
Copy link
Member Author

SATTest fails. I tried with both z3 4.8.17 and 4.13.3. it exits non-zero with this output:

(push)
(declare-const R0 (_ BitVec 64))
(assert (= (not (= R0 (_ bv0 64))) (= (_ bv0 64) R0)))
(set-option :smt.timeout 1)
(check-sat)
(pop)
(error "line 4 column 25: unknown parameter 'timeout' at module 'smt'
Legal parameters are:
  arith.auto_config_simplex (bool) (default: false)
  arith.bprop_on_pivoted_rows (bool) (default: true)
  arith.branch_cut_ratio (unsigned int) (default: 2)
  arith.dump_lemmas (bool) (default: false)
[...]

@ailrst
Copy link
Contributor

ailrst commented Mar 7, 2025

Yeah I've seen this before I thought I removed the option sorry

diff --git a/src/main/scala/translating/IRExpToSMT2.scala b/src/main/scala/translating/IRExpToSMT2.scala
index 88ae2336c..279fe5693 100644
--- a/src/main/scala/translating/IRExpToSMT2.scala
+++ b/src/main/scala/translating/IRExpToSMT2.scala
@@ -205,7 +205,7 @@ object BasilIRToSMT2 extends BasilIRExpWithVis[Sexp] {
     }
 
     val terms = list(sym("push")) :: BasilIRToSMT2.extractDecls(e)
-      ++ List(assert, list(sym("set-option"), sym(":smt.timeout"), sym("1")), list(sym("check-sat")))
+      ++ List(assert, list(sym("check-sat")))
       ++ (if (getModel) then
             List(list(sym("echo"), sym("\"" + name.getOrElse("") + "  ::  " + e + "\"")), list(sym("get-model")))
           else List())

We have timeouts through a cli flag to z3, not sure where I got the smt.timeout setting from but it doesn't seem to be a real setting

@katrinafyi
Copy link
Member Author

With this change, the wall time of GitHub actions increases from 20 minutes to 28 minutes (this is dominated by AnalysisSystemTests). Unit tests are still fast at about 30 seconds.

@katrinafyi
Copy link
Member Author

katrinafyi commented Mar 10, 2025

todo:

  • remove parallel commits (separate PR)
  • add back Failed(_) case

@ailrst
Copy link
Contributor

ailrst commented Mar 10, 2025

This is really great, thanks for doing this. Looking at the output of CheckTestTagging https://github.com/UQ-PAC/BASIL/actions/runs/13734137233/job/38415722041#step:5:52 your categorisations look reasonable.

Maybe some of your explanation in the PR comment could be copied to a documentation page?

@katrinafyi katrinafyi mentioned this pull request Mar 10, 2025
1 task
Comment on lines 37 to 38
case "jumptable3/clang:GTIRB" | "switch2/clang:GTIRB" =>
Mode.ExpectFailure("Unable to evaluate expr: bitvector size mismatch")
Copy link
Contributor

@l-kent l-kent Mar 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How long have these tests been failing for this reason? They should not be failing for this reason, was this caused by the simplification merge?

To be clear, these tests were not passing, but should run fine without bitvector size issues.

Copy link
Contributor

@ailrst ailrst Mar 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, the function unwrapPaddingAndSlicing in can produce ill-typed expressions and the greater expression variety produced by the simplification pass makes it more likely to hit those cases.

The crash here is at LocalPhase.scala:291

        val indexUnwrapped = unwrapPaddingAndSlicing(index)
        val lhsCell = graph.adjust(graph.varToCell(n)(lhs))
        assert(size % 8 == 0)
        val byteSize = size / 8
        lhsCell.node.get.flags.read = true
        val global = isGlobal(indexUnwrapped, n, byteSize) /* calls eval , crash */

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realised obviously simplification isn't actually run so I'm not sure what introduced the problem, there's a fix here #357 but I'm not sure it makes any more tests pass here due to other missing precision/functionality.

Comment on lines 28 to 36
case "functionpointer/clang:BAP" | "functionpointer/clang_O2:BAP" | "functionpointer/clang_O2:GTIRB" |
"functionpointer/clang_pic:BAP" | "functionpointer/gcc:BAP" | "functionpointer/gcc_O2:BAP" |
"functionpointer/gcc_pic:BAP" | "indirect_call/clang:BAP" | "indirect_call/clang_pic:BAP" |
"indirect_call/gcc:BAP" | "indirect_call/gcc_pic:BAP" | "indirect_call_outparam/clang:BAP" |
"indirect_call_outparam/clang:GTIRB" | "indirect_call_outparam/gcc:BAP" | "indirect_call_outparam/gcc:GTIRB" |
"jumptable/clang:BAP" | "jumptable/clang:GTIRB" | "jumptable/gcc:BAP" | "jumptable/gcc:GTIRB" |
"jumptable2/clang:BAP" | "jumptable2/clang_O2:BAP" | "jumptable2/clang_pic:BAP" | "jumptable2/gcc:BAP" |
"jumptable2/gcc_O2:BAP" | "jumptable2/gcc_pic:BAP" | "jumptable3/clang_O2:GTIRB" =>
Mode.ExpectFailure("appear to be genuine failures in call resolution")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indirect call resolution appears to have been completely broken by the simplification pass merge. Many of these tests were working previously.

Copy link
Contributor

@l-kent l-kent Mar 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are not all failures in call resolution, a significant number are failing due to IndirectCallTests not being updated for the simplification pass. This is fixed in #359.

The following are all expected to pass:

  • functionpointer/clang:BAP
  • functionpointer/clang_O2:BAP
  • functionpointer/clang_O2:GTIRB
  • functionpointer/clang_pic:BAP
  • functionpointer/gcc:BAP
  • functionpointer/gcc_O2:BAP
  • functionpointer/gcc_pic:BAP
  • indirect_call/clang:BAP
  • indirect_call/clang_pic:BAP
  • indirect_call/gcc:BAP
  • indirect_call/gcc_pic:BAP
  • jumptable2/clang:BAP
  • jumptable2/clang_O2:BAP
  • jumptable2/clang_pic:BAP
  • jumptable2/gcc:BAP
  • jumptable2/gcc_O2:BAP
  • jumptable2/gcc_pic:BAP
  • syscall/clang_O2:BAP

The rest are not currently expected to pass because they require additional functionality that has not been implemented yet.

@@ -43,6 +79,12 @@ trait SystemTests extends AnyFunSuite, BASILTest {

private val testPath = "./src/test/"

override def customiseTestsByName(name: String) = name match {
case "procedure_summaries/procedure_summary3/gcc_O2:BAP" | "procedure_summaries/procedure_summary3/gcc_O2:GTIRB" =>
Mode.Disabled("this procedure summaries test is unpredictably flaky")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does 'unpredictably flaky' mean? It only verifies sometimes due to non-determinism issues, or something else? Previously, it was verifying despite the procedure summary analysis giving incorrect results, due to BASIL's lack of proper support for non-returning calls, so it makes sense to disable the test anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, flaky means that the test non-deterministically passes or fails. Unpredictable means that I don't know what causes the non-determinism. On my computer at least, this behaviour can be seen by:

# run only these two test cases, passes sometimes and fails sometimes.
./mill  test.runMain org.scalatest.tools.Runner -R out/test/compile.dest/classes/ -oID -z 'procedure_summaries/procedure_summary3/gcc_O2'

# run full suite of ProcedureSummaryTests, almost always fails.
./mill  test.runMain org.scalatest.tools.Runner -R out/test/compile.dest/classes/ -oID -s ProcedureSummaryTests

Also observe that sometimes the BAP and GTIRB variants take turns passing/failing. This is deeply troubling.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The source of this appears to be non-determinism in stronglyConnectedComponents in IRCursor.scala. This non-determinism seems to have been around for a while, but the changes made to the IR with the simplification pass seem to have changed how it interacts with the procedure summary generation in this test case so the non-determinism now directly leads to differing results (it didn't previously in any of our test cases but that's probably just luck).

Comment on lines 403 to 366
case "correct/malloc_with_local3/clang:BAP" =>
Mode.ExpectFailure(
"previous failure was: Expected verification success, but got failure. Failing assertion is: assert (load37_1 == R30_in)"
)
Copy link
Contributor

@l-kent l-kent Mar 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we expecting a failure here? This failing due to a bug in the simplification, not because the test itself is broken, or because it requires some new feature - it is a regression.

@l-kent
Copy link
Contributor

l-kent commented Mar 11, 2025

I think with Mode.ExpectFailure we want to distinguish between tests that fail because they require as-yet unimplemented features to pass, and tests that fail due to recent regressions introduced by the simplification pass. It is fine to expect failure for tests that will only pass when new functionality is implemented, but we don't want to ignore the recent regressions which we need to work on fixing.

this saves having to use the ./mill test.runMain ... ... command
@katrinafyi
Copy link
Member Author

I've tried to clarify to separate roles of ExpectFailure in this commit: 15e8801. i've also merged in the fixes for indirect call tests and re-classified the tests.

Also, while this PR does reveal some issues with the existing unit tests, I don't think it's necessary to interrogate and fix all the tests before merging. As long as they are all marked with the right Mode, I think this PR should be mergable - any later fixes can simply adjust the Mode.

Comment on lines 27 to 36
override def customiseTestsByName(name: String) = name match {
case "indirect_call_outparam/clang:BAP" | "indirect_call_outparam/clang:GTIRB" | "indirect_call_outparam/gcc:BAP" |
"indirect_call_outparam/gcc:GTIRB" | "jumptable/clang:BAP" | "jumptable/clang:GTIRB" =>
Mode.TempFailure("indirect call not resolved to correct target")

case "jumptable3/clang:GTIRB" | "jumptable3/clang_O2:GTIRB" | "switch2/clang:GTIRB" =>
Mode.TempFailure("indirect call not resolved to goto")

case "jumptable/gcc:BAP" | "jumptable/gcc:GTIRB" =>
Mode.TempFailure("Expected verification success, but got failure")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should all be NotImplemented, they are not temporary recent regressions

Copy link
Member Author

@katrinafyi katrinafyi Mar 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is that right? even when it "resolves" to an incorrect function? I would consider that a bug in the indirect call resolution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When does it do that?

Copy link
Contributor

@l-kent l-kent Mar 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh it was only doing saying that it's resolving to incorrect functions (it wasn't actually, the tests just needed some minor updates) due to issues that were fixed in #357.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you mean jumptable/clang:BAP and jumptable/clang:GTIRB, it produces sound but imprecise results as-is for those, but this lack of precision is a long-standing issue that would require expanding the functionality of the indirect call resolution - it isn't a recent regression.

Copy link
Member Author

@katrinafyi katrinafyi Mar 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, thanks. What about jumptable/gcc:BAP jumptable/gcc:GTIRB which fail verification?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those correctly resolve the indirect calls but fail because they need specifications about the security level of the jumptable in the binary's data section, which isn't really possible to easily express with the current specification system.

Comment on lines 280 to 285
case "extraspec_correct/malloc_memcpy_strlen_memset_free/gcc_O2:GTIRB" |
"extraspec_correct/malloc_memcpy_strlen_memset_free/clang_O2:GTIRB" =>
Mode.TempFailure("Expected verification success, but got failure -- cause unknown")
case "extraspec_incorrect/malloc_memcpy_strlen_memset_free/gcc_O2:BAP" =>
Mode.TempFailure("Expected verification failure, but got success -- cause unknown")
case _ => Mode.Normal
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ExtraSpecTests should probably be disabled or something. They're in their own test suite because we don't want to run them with all the other tests specifically because they have timeout issues, take a long time to run, and have been partially broken for a long time (for fairly inscrutable reasons, seemingly quirks of Z3)?

It's definitely not a recent regression anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kirsten is keen to use this as a demo example so we should leave the suite enabled, only 3/8 of the cases fail and this pr is already disabling them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ig we can NotImplemented these cases if they're not worth focusing on yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just fixed them lol #362

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They should at least have their own tag because they take forever to run and someone wanting to run all the standard system tests would not be wanting to run those.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a supplementary Slow tag for the ExtraSpec suite 25a25ca. You can use this to exclude tests when running them locally:

./scripts/scalatest.sh -oID -n test_util.tags.StandardSystemTest -l test_util.tags.Slow  

I don't know if there's a way to have a "disabled by default" behaviour for these slow tests.

@katrinafyi
Copy link
Member Author

katrinafyi commented Mar 12, 2025

@sadrabt just checking, are you aware of the failing tests in IntervalDSATest? i've just marked them as pendingUntilFixed for now.

IntervalDSATest:
- jumptable main *** FAILED ***
  java.util.NoSuchElementException: key not found: Procedure main_1936 at 1936 with 5 blocks and 23 in and 5 out parameters
  at scala.collection.immutable.Map$EmptyMap$.apply(Map.scala:243)
  at scala.collection.immutable.Map$EmptyMap$.apply(Map.scala:239)
  at IntervalDSATest.testFun$proxy1$1(IntervalDSATest.scala:77)
  at IntervalDSATest.$init$$$anonfun$1(IntervalDSATest.scala:75)
- Global Dereference *** FAILED ***
  Set() had size 0 instead of expected size 1 (IntervalDSATest.scala:119)
1 targets failed
test.testOnly 2 tests failed: 
  IntervalDSATest jumptable main
  IntervalDSATest Global Dereference

Edit: fixed now

@ailrst ailrst merged commit b45b98f into main Mar 13, 2025
6 checks passed
@katrinafyi katrinafyi deleted the scalatest-grouping branch March 13, 2025 04:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants