Begin to generalize integration tests to other queue systems #160

ml-evs · 2024-08-02T14:36:41Z

Pending e.g. #159, I have started to generalise the integration tests so we can make a mega combined Dockerfile that runs e.g., Slurm, SGE and potentially other queuing systems in the same container for testing purposes. This PR is the first step in that. Probably we could test remote shell execution first too. Depending on how awkward it is to set up multiple queues together, it might be that each one runs in a different container.

codecov-commenter · 2024-08-03T14:16:08Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 47.92%. Comparing base (8df1979) to head (4b2c416).

Additional details and impacted files

@@           Coverage Diff            @@
##           develop     #160   +/-   ##
========================================
  Coverage    47.92%   47.92%           
========================================
  Files           43       43           
  Lines         5156     5156           
  Branches      1118     1118           
========================================
  Hits          2471     2471           
  Misses        2424     2424           
  Partials       261      261

gpetretto · 2024-08-07T15:47:17Z

Hi @ml-evs, thanks for setting this up. Since the dependence on the queue system is only in the submission and check of the jobs I wonder if it is worth running all the tests for all the systems.
Could it be more effective to have these tests on all the queue systems directly in qtoolkit?

ml-evs · 2024-08-22T17:31:45Z

Hi @ml-evs, thanks for setting this up. Since the dependence on the queue system is only in the submission and check of the jobs I wonder if it is worth running all the tests for all the systems. Could it be more effective to have these tests on all the queue systems directly in qtoolkit?

Hi @gpetretto, sorry I missed this. Happy to have this migrated wherever you see fit, though I think having e2e tests for jobflow-remote are still appropriate here.

A status update on this PR in particular: SGE is proving pretty nasty to set up, attempting to follow the few guides I can find online and sadly the main init_cluster configuration script is segfaulting on what should be a straightforward operation (setting the manager's username). I'll keep probing with more recent versions and see how far I can get.

gpetretto · 2024-08-23T08:23:09Z

Hi @gpetretto, sorry I missed this. Happy to have this migrated wherever you see fit, though I think having e2e tests for jobflow-remote are still appropriate here.

A status update on this PR in particular: SGE is proving pretty nasty to set up, attempting to follow the few guides I can find online and sadly the main init_cluster configuration script is segfaulting on what should be a straightforward operation (setting the manager's username). I'll keep probing with more recent versions and see how far I can get.

Indeed it might be good to have a few different queue managers to test. Especially if it is confirmed that for example SGE does not support query by list of ids.

…' from Liverpool/Michigan

ml-evs · 2024-10-11T14:28:22Z

Hi @gpetretto, I think this is now 99% of the way there... at least locally, I can now SSH into the SGE container and run jobs. Any remaining issues might be related to @QuantumChemist's qtoolkit PR, but it is hard to say at this point. This was really painful to implement, but I think the pain was specific to SGE itself.

The main issue was needing to reverse engineer how to set up user permissions for the SGE queue (see cc613ba) without being able to use the bundled qmon GUI, and without there being any online documentation. With a lot of rubber-ducking with Claude (hours) I was able to get it working so the jobflow user can submit jobs.

Locally, I see the jobs being submitted and the usual JFR processes "working", but for some reason no output is being written or copied back to the runner at the moment.

As a general point, we might consider entirely splitting the integration tests and unit tests into separate jobs. I'm not sure how much the integration tests are really contributing to the codecov, which is the only real reason to run them together, but as the integration tests are slow (both building the Docker containers and actually running them), it's probably beneficial to split them and figure out a way to combine the coverage stats down the line (e.g., the jobs could both create a GH actions artifact containing the coverage, then a separate job can combine them and upload to codecov).

gpetretto

Thanks @ml-evs for all the work! The PR looks good to me.

As for the coverage, indeed most will be from the standard tests, but there are definitely some functionalities that are only tested in the integration tests (e.g. the batch submission). If the merging of the coverage files is done as in #180 it could be fine running the tests separately.

davidwaroquiers · 2024-11-08T14:10:43Z

From discussion in #201, should we add python 3.12 here as part of the github ci testing workflow.

ml-evs force-pushed the ml-evs/generalize-integration-tests branch from 78f6054 to 35d0e40 Compare August 3, 2024 14:07

ml-evs force-pushed the ml-evs/generalize-integration-tests branch 2 times, most recently from 0e54d6b to ae798fd Compare September 1, 2024 13:31

ml-evs force-pushed the ml-evs/generalize-integration-tests branch from 7577c6c to 4062fdf Compare September 9, 2024 10:22

ml-evs mentioned this pull request Sep 17, 2024

New CLI functionalities: tree, report, job info #180

Merged

ml-evs force-pushed the ml-evs/generalize-integration-tests branch from e311fb0 to b1ee827 Compare September 19, 2024 14:15

ml-evs mentioned this pull request Sep 20, 2024

Implementation of SGE interface Matgenix/qtoolkit#43

Open

ml-evs added 19 commits September 25, 2024 10:32

Begin to generalize integration tests to other queue systems

476d9d5

Refactor integration tests

33700f2

Attempt to begin integrating SGE

d14163a

Switch SGE installation method; use patched version 'Some Grid Engine…

920d32c

…' from Liverpool/Michigan

Rejig slurm integration again

34fc29b

Placate bake parser

e9ffd72

Use docker bake definition in CI

0fea7ba

Reorganize build stages

811d385

Fix typo

dcb1e04

Slight tidying of dockerfile

f66a374

Restructure build stages

65c6a75

Add username to docker bake definition

9951330

Fix build stages

87f576e

Fix final CMD

0932717

Tweak bake file

2274ca5

Attempt to strip back to nothing

8bdb071

Strip back to working version

bfb0d82

Use custom startup script again

ad18c48

Reintroduce SGE docker section

c56fe09

ml-evs added 11 commits September 25, 2024 10:32

Remove quotes around CMD...

ca8654e

Fix multistage build

b4e45a8

Revert pytest fixture scoping

c3119e1

Refactor Dockerfile

9f41ca4

Update SGE startup script

24ef162

Unrefactor dockerfile...

1ca9682

Install SGE as jobflow user directly

9a8bc1d

Add SGE runs and builds in integration tests

55659c3

Install qtoolkit from PR branch

752a1ad

Tweak SGE startup script misleading output

c146999

Fix project check test

8637ef5

ml-evs force-pushed the ml-evs/generalize-integration-tests branch from 645d9cb to 8637ef5 Compare September 25, 2024 09:32

ml-evs added 6 commits October 11, 2024 12:19

Rearrange user creation and put user in sge group

c5adce7

Refactor Dockerfile around common uv build of jfr for each queue

93fb597

Painstakingly configure SGE for local use within Docker

cc613ba

Merge branch 'develop' into ml-evs/generalize-integration-tests

8c3548b

Restructure to reduce diff

7b538d7

Test SSH connections properly and force SGE path to contain qstat

05acc12

ml-evs force-pushed the ml-evs/generalize-integration-tests branch from 4a2d843 to 05acc12 Compare October 11, 2024 13:42

Set SGE env explicitly so login shell not required

e3cd1a3

ml-evs force-pushed the ml-evs/generalize-integration-tests branch from 4c7c7ea to e3cd1a3 Compare October 11, 2024 14:06

Linting fix

22b269a

ml-evs marked this pull request as ready for review October 11, 2024 14:23

ml-evs requested a review from gpetretto October 11, 2024 14:23

gpetretto approved these changes Oct 14, 2024

View reviewed changes

ml-evs added 2 commits October 24, 2024 18:17

Merge branch 'develop' into ml-evs/generalize-integration-tests

df29570

Avoid setting resources for SGE in conftest

2c1831c

gpetretto mentioned this pull request Nov 8, 2024

Delete job files when rerunning; customizable execution command #201

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Begin to generalize integration tests to other queue systems #160

Begin to generalize integration tests to other queue systems #160

ml-evs commented Aug 2, 2024

codecov-commenter commented Aug 3, 2024 •

edited

Loading

gpetretto commented Aug 7, 2024

ml-evs commented Aug 22, 2024

gpetretto commented Aug 23, 2024

ml-evs commented Oct 11, 2024

gpetretto left a comment

davidwaroquiers commented Nov 8, 2024

Begin to generalize integration tests to other queue systems #160

Are you sure you want to change the base?

Begin to generalize integration tests to other queue systems #160

Conversation

ml-evs commented Aug 2, 2024

codecov-commenter commented Aug 3, 2024 • edited Loading

Codecov Report

gpetretto commented Aug 7, 2024

ml-evs commented Aug 22, 2024

gpetretto commented Aug 23, 2024

ml-evs commented Oct 11, 2024

gpetretto left a comment

Choose a reason for hiding this comment

davidwaroquiers commented Nov 8, 2024

codecov-commenter commented Aug 3, 2024 •

edited

Loading