-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tagging of test suites #350
Conversation
Conflicts: .github/workflows/run-examples.yml
SATTest fails. I tried with both z3 4.8.17 and 4.13.3. it exits non-zero with this output: (push)
(declare-const R0 (_ BitVec 64))
(assert (= (not (= R0 (_ bv0 64))) (= (_ bv0 64) R0)))
(set-option :smt.timeout 1)
(check-sat)
(pop)
(error "line 4 column 25: unknown parameter 'timeout' at module 'smt'
Legal parameters are:
arith.auto_config_simplex (bool) (default: false)
arith.bprop_on_pivoted_rows (bool) (default: true)
arith.branch_cut_ratio (unsigned int) (default: 2)
arith.dump_lemmas (bool) (default: false)
[...] |
Yeah I've seen this before I thought I removed the option sorry diff --git a/src/main/scala/translating/IRExpToSMT2.scala b/src/main/scala/translating/IRExpToSMT2.scala
index 88ae2336c..279fe5693 100644
--- a/src/main/scala/translating/IRExpToSMT2.scala
+++ b/src/main/scala/translating/IRExpToSMT2.scala
@@ -205,7 +205,7 @@ object BasilIRToSMT2 extends BasilIRExpWithVis[Sexp] {
}
val terms = list(sym("push")) :: BasilIRToSMT2.extractDecls(e)
- ++ List(assert, list(sym("set-option"), sym(":smt.timeout"), sym("1")), list(sym("check-sat")))
+ ++ List(assert, list(sym("check-sat")))
++ (if (getModel) then
List(list(sym("echo"), sym("\"" + name.getOrElse("") + " :: " + e + "\"")), list(sym("get-model")))
else List())
We have timeouts through a cli flag to z3, not sure where I got the smt.timeout setting from but it doesn't seem to be a real setting |
Conflicts: src/test/scala/InterpretTestConstProp.scala
With this change, the wall time of GitHub actions increases from 20 minutes to 28 minutes (this is dominated by AnalysisSystemTests). Unit tests are still fast at about 30 seconds. |
todo:
|
This is really great, thanks for doing this. Looking at the output of CheckTestTagging https://github.com/UQ-PAC/BASIL/actions/runs/13734137233/job/38415722041#step:5:52 your categorisations look reasonable. Maybe some of your explanation in the PR comment could be copied to a documentation page? |
case "jumptable3/clang:GTIRB" | "switch2/clang:GTIRB" => | ||
Mode.ExpectFailure("Unable to evaluate expr: bitvector size mismatch") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How long have these tests been failing for this reason? They should not be failing for this reason, was this caused by the simplification merge?
To be clear, these tests were not passing, but should run fine without bitvector size issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, the function unwrapPaddingAndSlicing
in can produce ill-typed expressions and the greater expression variety produced by the simplification pass makes it more likely to hit those cases.
The crash here is at LocalPhase.scala:291
val indexUnwrapped = unwrapPaddingAndSlicing(index)
val lhsCell = graph.adjust(graph.varToCell(n)(lhs))
assert(size % 8 == 0)
val byteSize = size / 8
lhsCell.node.get.flags.read = true
val global = isGlobal(indexUnwrapped, n, byteSize) /* calls eval , crash */
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realised obviously simplification isn't actually run so I'm not sure what introduced the problem, there's a fix here #357 but I'm not sure it makes any more tests pass here due to other missing precision/functionality.
case "functionpointer/clang:BAP" | "functionpointer/clang_O2:BAP" | "functionpointer/clang_O2:GTIRB" | | ||
"functionpointer/clang_pic:BAP" | "functionpointer/gcc:BAP" | "functionpointer/gcc_O2:BAP" | | ||
"functionpointer/gcc_pic:BAP" | "indirect_call/clang:BAP" | "indirect_call/clang_pic:BAP" | | ||
"indirect_call/gcc:BAP" | "indirect_call/gcc_pic:BAP" | "indirect_call_outparam/clang:BAP" | | ||
"indirect_call_outparam/clang:GTIRB" | "indirect_call_outparam/gcc:BAP" | "indirect_call_outparam/gcc:GTIRB" | | ||
"jumptable/clang:BAP" | "jumptable/clang:GTIRB" | "jumptable/gcc:BAP" | "jumptable/gcc:GTIRB" | | ||
"jumptable2/clang:BAP" | "jumptable2/clang_O2:BAP" | "jumptable2/clang_pic:BAP" | "jumptable2/gcc:BAP" | | ||
"jumptable2/gcc_O2:BAP" | "jumptable2/gcc_pic:BAP" | "jumptable3/clang_O2:GTIRB" => | ||
Mode.ExpectFailure("appear to be genuine failures in call resolution") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indirect call resolution appears to have been completely broken by the simplification pass merge. Many of these tests were working previously.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are not all failures in call resolution, a significant number are failing due to IndirectCallTests not being updated for the simplification pass. This is fixed in #359.
The following are all expected to pass:
- functionpointer/clang:BAP
- functionpointer/clang_O2:BAP
- functionpointer/clang_O2:GTIRB
- functionpointer/clang_pic:BAP
- functionpointer/gcc:BAP
- functionpointer/gcc_O2:BAP
- functionpointer/gcc_pic:BAP
- indirect_call/clang:BAP
- indirect_call/clang_pic:BAP
- indirect_call/gcc:BAP
- indirect_call/gcc_pic:BAP
- jumptable2/clang:BAP
- jumptable2/clang_O2:BAP
- jumptable2/clang_pic:BAP
- jumptable2/gcc:BAP
- jumptable2/gcc_O2:BAP
- jumptable2/gcc_pic:BAP
- syscall/clang_O2:BAP
The rest are not currently expected to pass because they require additional functionality that has not been implemented yet.
src/test/scala/SystemTests.scala
Outdated
@@ -43,6 +79,12 @@ trait SystemTests extends AnyFunSuite, BASILTest { | |||
|
|||
private val testPath = "./src/test/" | |||
|
|||
override def customiseTestsByName(name: String) = name match { | |||
case "procedure_summaries/procedure_summary3/gcc_O2:BAP" | "procedure_summaries/procedure_summary3/gcc_O2:GTIRB" => | |||
Mode.Disabled("this procedure summaries test is unpredictably flaky") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does 'unpredictably flaky' mean? It only verifies sometimes due to non-determinism issues, or something else? Previously, it was verifying despite the procedure summary analysis giving incorrect results, due to BASIL's lack of proper support for non-returning calls, so it makes sense to disable the test anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, flaky means that the test non-deterministically passes or fails. Unpredictable means that I don't know what causes the non-determinism. On my computer at least, this behaviour can be seen by:
# run only these two test cases, passes sometimes and fails sometimes.
./mill test.runMain org.scalatest.tools.Runner -R out/test/compile.dest/classes/ -oID -z 'procedure_summaries/procedure_summary3/gcc_O2'
# run full suite of ProcedureSummaryTests, almost always fails.
./mill test.runMain org.scalatest.tools.Runner -R out/test/compile.dest/classes/ -oID -s ProcedureSummaryTests
Also observe that sometimes the BAP and GTIRB variants take turns passing/failing. This is deeply troubling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The source of this appears to be non-determinism in stronglyConnectedComponents in IRCursor.scala. This non-determinism seems to have been around for a while, but the changes made to the IR with the simplification pass seem to have changed how it interacts with the procedure summary generation in this test case so the non-determinism now directly leads to differing results (it didn't previously in any of our test cases but that's probably just luck).
src/test/scala/SystemTests.scala
Outdated
case "correct/malloc_with_local3/clang:BAP" => | ||
Mode.ExpectFailure( | ||
"previous failure was: Expected verification success, but got failure. Failing assertion is: assert (load37_1 == R30_in)" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we expecting a failure here? This failing due to a bug in the simplification, not because the test itself is broken, or because it requires some new feature - it is a regression.
I think with |
6086767
to
efa63f4
Compare
this saves having to use the ./mill test.runMain ... ... command
I've tried to clarify to separate roles of ExpectFailure in this commit: 15e8801. i've also merged in the fixes for indirect call tests and re-classified the tests. Also, while this PR does reveal some issues with the existing unit tests, I don't think it's necessary to interrogate and fix all the tests before merging. As long as they are all marked with the right Mode, I think this PR should be mergable - any later fixes can simply adjust the Mode. |
override def customiseTestsByName(name: String) = name match { | ||
case "indirect_call_outparam/clang:BAP" | "indirect_call_outparam/clang:GTIRB" | "indirect_call_outparam/gcc:BAP" | | ||
"indirect_call_outparam/gcc:GTIRB" | "jumptable/clang:BAP" | "jumptable/clang:GTIRB" => | ||
Mode.TempFailure("indirect call not resolved to correct target") | ||
|
||
case "jumptable3/clang:GTIRB" | "jumptable3/clang_O2:GTIRB" | "switch2/clang:GTIRB" => | ||
Mode.TempFailure("indirect call not resolved to goto") | ||
|
||
case "jumptable/gcc:BAP" | "jumptable/gcc:GTIRB" => | ||
Mode.TempFailure("Expected verification success, but got failure") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should all be NotImplemented, they are not temporary recent regressions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is that right? even when it "resolves" to an incorrect function? I would consider that a bug in the indirect call resolution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When does it do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh it was only doing saying that it's resolving to incorrect functions (it wasn't actually, the tests just needed some minor updates) due to issues that were fixed in #357.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you mean jumptable/clang:BAP and jumptable/clang:GTIRB, it produces sound but imprecise results as-is for those, but this lack of precision is a long-standing issue that would require expanding the functionality of the indirect call resolution - it isn't a recent regression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, thanks. What about jumptable/gcc:BAP jumptable/gcc:GTIRB which fail verification?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those correctly resolve the indirect calls but fail because they need specifications about the security level of the jumptable in the binary's data section, which isn't really possible to easily express with the current specification system.
src/test/scala/SystemTests.scala
Outdated
case "extraspec_correct/malloc_memcpy_strlen_memset_free/gcc_O2:GTIRB" | | ||
"extraspec_correct/malloc_memcpy_strlen_memset_free/clang_O2:GTIRB" => | ||
Mode.TempFailure("Expected verification success, but got failure -- cause unknown") | ||
case "extraspec_incorrect/malloc_memcpy_strlen_memset_free/gcc_O2:BAP" => | ||
Mode.TempFailure("Expected verification failure, but got success -- cause unknown") | ||
case _ => Mode.Normal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ExtraSpecTests should probably be disabled or something. They're in their own test suite because we don't want to run them with all the other tests specifically because they have timeout issues, take a long time to run, and have been partially broken for a long time (for fairly inscrutable reasons, seemingly quirks of Z3)?
It's definitely not a recent regression anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kirsten is keen to use this as a demo example so we should leave the suite enabled, only 3/8 of the cases fail and this pr is already disabling them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ig we can NotImplemented these cases if they're not worth focusing on yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just fixed them lol #362
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They should at least have their own tag because they take forever to run and someone wanting to run all the standard system tests would not be wanting to run those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a supplementary Slow tag for the ExtraSpec suite 25a25ca. You can use this to exclude tests when running them locally:
./scripts/scalatest.sh -oID -n test_util.tags.StandardSystemTest -l test_util.tags.Slow
I don't know if there's a way to have a "disabled by default" behaviour for these slow tests.
Conflicts: src/test/scala/IndirectCallTests.scala
@sadrabt just checking, are you aware of the failing tests in IntervalDSATest? i've just marked them as pendingUntilFixed for now. IntervalDSATest:
- jumptable main *** FAILED ***
java.util.NoSuchElementException: key not found: Procedure main_1936 at 1936 with 5 blocks and 23 in and 5 out parameters
at scala.collection.immutable.Map$EmptyMap$.apply(Map.scala:243)
at scala.collection.immutable.Map$EmptyMap$.apply(Map.scala:239)
at IntervalDSATest.testFun$proxy1$1(IntervalDSATest.scala:77)
at IntervalDSATest.$init$$$anonfun$1(IntervalDSATest.scala:75)
- Global Dereference *** FAILED ***
Set() had size 0 instead of expected size 1 (IntervalDSATest.scala:119)
1 targets failed
test.testOnly 2 tests failed:
IntervalDSATest jumptable main
IntervalDSATest Global Dereference Edit: fixed now |
10c35f6
to
9e65f20
Compare
In the Github actions, test cases are grouped into a number of coarse categories (unit tests, standard system tests, and analysis system tests). This PR aims to make these groups more concrete by defining them within the test files, instead of in the Github actions workflow. This will also allow you to run a particular group on your local machine while excluding the slower-running tests.
These tags are defined as annotations within the test_util.tags package, and an annotation can be attached to any test suite class:
When running tests, one can include or exclude a particular tag by using the
-n
or-l
flags of the scalatest Runner. With SBT, these arguments can be affixed to the testOnly command:sbt "testOnly * -- -n test_util.tags.UnitTest"
With mill, it is more complicated. Adding
-n
or-l
to./mill test.testOnly
will not work :( (all test suites will still be run). I think there is a problem with how Mill invokes the scalatest runner. The workaround is to manually invoke the runner:After the
-R .../classes
, the usual test.testOnly arguments can be added. Note that the-o
flag is important - without it, scalatest will try (and likely fail) to open a GUI. The-R
argument should point to the output directory containing compiled class files.The Github CI jobs are changed to use the
./mill test.runMain
command. Also, there is a new CI job which reports the test suites which do not have any tag. It is my hope that every test case will eventually have at least one tag - I will need some help with this. This will make sure that every test written is being exercised by the CI. Currently, it is too easy for tests to fall through the craks if they are not run regularly.This PR is part of an effort to improve the testing in Basil and force us to be more honest about the tests. This means that we should have a clear understanding of what each test is testing, whether it currently passes and, if not, why it is currently not working. I will point out these scalatest features to help this goal:
pendingUntilFixed
should be used to mark a block of code (typically containing an assertion) as one which is currently expected to fail. This should be used to record test cases which fail due to not-yet-implemented features or known bugs. If a change is made and the code no longer fails, this will cause the test to fail until the pendingUntilFixed is removed. It should be used within atest() { ... }
block like so:Tests can be marked as ignored by replacing
test()
withignore()
. An entire suite can be marked as ignored with the@org.scalatest.Ignore
annotation.We also add a trait
trait TestCustomisation
to help with customising particular dynamically-generated tests based on their test case name (which includes file path and compiler/lifter options).Individual test cases can be marked as retry, disabled, or expect failure. This modified the behaviour of the test case and prints helpful output when running the test. For example:
Note: The tag annotations must be defined in Java (apparently) to make them visible to the test runner at runtime.
TODO