Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests: thread-safe test suites #353

Merged
merged 15 commits into from
Mar 13, 2025
Merged

tests: thread-safe test suites #353

merged 15 commits into from
Mar 13, 2025

Conversation

katrinafyi
Copy link
Member

@katrinafyi katrinafyi commented Mar 8, 2025

This makes steps towards making test suites thread-safe. That is, multiple suites can be executed in parallel, but test cases within the same suite are still run sequentially.

This is done with two changes:

  • Test cases within SystemTests are changed to acquire a lock based on the source file name. This prevents interference where multiple tests attempt to read or write from the same expected files.
    SystemTests are changed so their Boogie and text file outputs have different names per suite, allowing suites to run in parallel.
  • In DataStructureAnalysis, the global NodeCounter is now scoped to a particular instance of a DataStructureAnalysis class.

Testing

These changes are somewhat difficult to test.

The first can be tested by modifying the end of LockManager's withLock function to be:

lock.synchronized {
  println("in " + key)                                                                                                                                                           
  body
  println("out " + key)
}

Then, run all test cases which reference a single source file (in parallel with -P):

./mill  test.runMain org.scalatest.tools.Runner -R out/test/compile.dest/classes/ -oI  -P -z 'correct/basic_function_call_caller/clang_O2:GTIRB'

You should see sequences of in out in out lines, demonstrating mutual exclusion. If you remove the synchronized call, these lines should now be interleaved like in in out out. Maybe this can be made into a unit test.

The second (DataStructureAnalysis) change is exercised by running all the DSA tests in parallel and observing that they pass.

todo:

  • -PS2 -T1000000 for sorting

katrinafyi and others added 8 commits March 8, 2025 13:54
this replaces the global singleton NodeCounter with
a new instance per DSA execution. this allows multiple
DSA runs to run in parallel.

the NodeCounter is constructed by the DataStructureAnalysis class,
then passed through an implicit "using" parameter to the classes
which require it.
LockManager now uses a single lock around the map and this is shared
for reads and writes. this will increase lock contention. however,
the locked duration will be extremely small since it only wraps
a single hashmap operation. this should have no discernible performance
impact.
@ailrst
Copy link
Contributor

ailrst commented Mar 10, 2025

Clearly something is wrong with our test system if we're having to wrap test cases in synchronisation code. This is good though at lest to allow the tests to run a bit faster, and also shake out some of the thread-unsafety within Basil.

Obviously the 'correct' approach here is to have actual sandboxing like dune's test runner, but we can look at later. The only real data contention in the test suites should be the boogie output file which is the same for each test case in every systemtest suite. Ideally we generate a temporary file unique to that test run, and then promote the temporary if the test passes (but you would still want a distinct expected file for each suite which we don't currently have). The test summary should be inherently synchronous within a single suite, and the 'summarytest' can be a test teardown function rather than another test.

@katrinafyi
Copy link
Member Author

I agree that this is symptomatic of bigger problems. Re more robust test sandboxing, that would be nice to have but I think it can be lower urgency.

At the moment, the remaining bottleneck for test runtime is Z3 which is itself multi-threaded. I found that increasing the scalatest parallelism beyond 2 actually l increases total runtime.

@ailrst
Copy link
Contributor

ailrst commented Mar 10, 2025

Yeah, I guess that's a fundamental constraint. Hopefully we can eventually reduce the number of combinations of flags we are testing with the SystemTests suite, and rely more on unit tests for the individual components. For example the AnalysisTests could be ditched as the DSA memory tests also require the analysis pass to run, but it potentially makes it harder to identify where errors are occurring.

I wonder whether its worth adding a 'slow' tag for larger aggregate tests or those less important combinations of flags that we may want to just run locally?

Also I gather the nonsense with mill ./mill test.runMain org.scalatest.tools.Runner -R out/test/compile.dest/classes/ is also required for -P to work?

@katrinafyi
Copy link
Member Author

katrinafyi commented Mar 10, 2025

More tags could definitely be helpful. The tags in #350 partition the test suites, but there's no need that they do that. We can and should add more tags for any interesting features.

The ./mill nonsense is definitely needed for the -n argument to choose tags. I don't remember if i tried -P with the default mill testOnly.

@ailrst
Copy link
Contributor

ailrst commented Mar 10, 2025

At the moment, the remaining bottleneck for test runtime is Z3 which is itself multi-threaded. I found that increasing the scalatest parallelism beyond 2 actually l increases total runtime.

It doesn't look like I consistently get two instances of boogie running, so it might help to shuffle the test case order to reduce contention.

diff --git a/src/test/scala/SystemTests.scala b/src/test/scala/SystemTests.scala
index 74c7a3698..2a1d962c0 100644
--- a/src/test/scala/SystemTests.scala
+++ b/src/test/scala/SystemTests.scala
@@ -52,7 +52,7 @@ trait SystemTests extends AnyFunSuite, BASILTest {
 
   def runTests(folder: String, conf: TestConfig): Unit = {
     val path = testPath + folder
-    val programs = getSubdirectories(path)
+    val programs = scala.util.Random.shuffle(getSubdirectories(path))
 
     // get all variations of each program
     val testSuffix = if conf.useBAPFrontend then ":BAP" else ":GTIRB"

@l-kent
Copy link
Contributor

l-kent commented Mar 11, 2025

I agree that a more straightforward and efficient approach to solve this problem would be to just set up the tests so that they are using different files for each test suite, which would avoid the need for locks entirely.

honestly, i think this might be annoying, especially if you try
to run the tests locally to reproduce an issue and the order keeps
changing.
this should avoid clashes where test suites try to write
to the same output file.

SystemTestsBAP and SystemTestsGTIRB are overriden to have
no suffix, to maintain compatibility with the update-expected
sbt/mill jobs.
@katrinafyi
Copy link
Member Author

I've made the system tests use different output file names per suite 8ed2980. Double-checking this commit would be appreciated, especially the behaviour with updateExpected.

@l-kent
Copy link
Contributor

l-kent commented Mar 13, 2025

It seems like it's working correctly with updateExpected, in that it only creates .expected files for the default SystemTests, not any of the other variations with extra flags, which is the only thing we were using the .expected files for.

This should mean that the locking isn't needed - no tests write to the .expected files, only the default SystemTests read from them, so now that the different suites write to their own output files there shouldn't be any conflicts.

@katrinafyi
Copy link
Member Author

Thanks! In the same commit, I'd also removed the locking code from SystemTests - this is a little buried in the other changes, sorry!

Conflicts:
src/test/scala/SystemTests.scala
-T is used to re-sort the test cases into the original order
before printing them in batches for each suite to avoid interleaving
individual cases from multiple suites running in parallel.
@katrinafyi
Copy link
Member Author

We have some numbers now. This is a pleasant improvement over the baseline.

Before both test tagging and this PR:
image

After test tagging which enabled many more tests but still with one thread:
image

By running tests in parallel (this PR, do not mind the failure):
image

Quickly merge this PR before tests start failing again :")

@ailrst ailrst merged commit 266a28b into main Mar 13, 2025
6 checks passed
@katrinafyi katrinafyi deleted the scalatest-parallel branch March 13, 2025 06:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants