support optional continue on failure #20

cijohnson · 2025-11-19T06:24:04Z

Motivation

overnight burn-in tests(aghfc, rvs) should not fail due to one bad node should continue to test other nodes to qualify eligible nodes.

Technical Details

stop_on_errors is an optional arg supported by parallelssh library defaults to True.

Change is to allow callers of Pssh instance to pass optional stop_on_errors and pass it to run_command api in exec and exec_cmd_list methods.

Test Plan

AGHFC and RVS tests should be run with ssh disabled in one of the node in cluster and ensure the tests continues

Test Result

TO BE EXECUTED

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

overnight tests should not fail due to one bad node should continue to test other nodes to qualify eligible nodes. stop_on_errors is an optional arg supported by parallelssh library defaults to True. Change is to allow callers of Pssh instance to pass optional stop_on_errors and pass it to run_command api in exec and exec_cmd_list methods. Signed-off-by: Ignatious Johnson <[email protected]>

tests, so that these tests will continue to run overnight even if one of the node is unresponsive. Signed-off-by: Ignatious Johnson <[email protected]>

Signed-off-by: Ignatious Johnson <[email protected]>

in virtual env conveniently. Signed-off-by: Ignatious Johnson <[email protected]>

covers exec and exec_cmd_list methods Signed-off-by: Ignatious Johnson <[email protected]>

in case of pssh.exceptions.Timeout exception and the node is unreachable. Unreachability is ensured by creating a ssh session to the specific set of nodes which raised Timeout. Added UT to cover these cases Signed-off-by: Ignatious Johnson <[email protected]>

solaiys · 2025-11-20T11:56:56Z

lib/parallel_ssh_lib.py

+        This ensures that the output dictionary reflects the status of pruned hosts.
+        """
+        for host in self.unreachable_hosts:
+            cmd_output[host] = "Host Unreachable"


In case, If the node became unreachable after the partial test execution OR due to some fatal error from the tests, this cmd_output[host] would have some valid output. In this line that data will be overwritten right.

May be just appending the cmd_output with "\n\n\n !!!! Host Unreachable !!!! \n\n\n" will help at the test function where it receives the cmd_output.

Sure, good catch, will append the Host unreachability at the end

solaiys · 2025-11-20T12:02:08Z

Makefile

@@ -0,0 +1,43 @@
+VENV_DIR = test_venv
+PYTHON = python


in somecase, it could be python3.
May be handle with which python or which python3

will take care , thanks

solaiys · 2025-11-20T13:55:39Z

lib/utils_lib.py


 import pytest
-import globals
+from . import globals


i was getting this error:
==================================== ERRORS ==================================== ___________________ ERROR collecting tests/health/rvs_cvs.py ___________________ ImportError while importing test module '/home/ssolaiya/work/11_CVS/cvs-CIgna/cvs/tests/health/rvs_cvs.py'. Hint: make sure your test modules/packages have valid Python names. Traceback: test_venv/lib/python3.12/site-packages/_pytest/python.py:507: in importtestmodule mod = import_path( test_venv/lib/python3.12/site-packages/_pytest/pathlib.py:587: in import_path importlib.import_module(module_name) /usr/lib/python3.12/importlib/__init__.py:90: in import_module return _bootstrap._gcd_import(name[level:], package, level) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ <frozen importlib._bootstrap>:1387: in _gcd_import ??? <frozen importlib._bootstrap>:1360: in _find_and_load ??? <frozen importlib._bootstrap>:1331: in _find_and_load_unlocked ??? <frozen importlib._bootstrap>:935: in _load_unlocked ??? test_venv/lib/python3.12/site-packages/_pytest/assertion/rewrite.py:197: in exec_module exec(co, module.__dict__) tests/health/rvs_cvs.py:22: in <module> from utils_lib import * lib/utils_lib.py:14: in <module> from . import globals E ImportError: attempted relative import with no known parent package

changing back to
import globals
works fine.

will debug this today

solaiys · 2025-11-20T15:29:38Z

lib/parallel_ssh_lib.py

+            self.prune_unreachable_hosts(output)
+            self.inform_unreachability(cmd_output)
+
        return cmd_output


Does the test files need to check this cmd_output for ""Host Unreachable" string and do some action in its phdl obj to remove the bad host ?

Right now, its executing the commands on the bad node as well. and returning with ERROR.

cijohnson requested review from solaiys and venksrin09 November 19, 2025 06:25

cijohnson added 6 commits November 20, 2025 00:28

Pass stop_on_errors=False to aghfc and rvs

5b40529

tests, so that these tests will continue to run overnight even if one of the node is unresponsive. Signed-off-by: Ignatious Johnson <[email protected]>

Adding sample Unittests for lib module

082da6a

Signed-off-by: Ignatious Johnson <[email protected]>

Coming up with makefile to execute UT

d31c829

in virtual env conveniently. Signed-off-by: Ignatious Johnson <[email protected]>

Unittest for parallel ssh lib

cdb24ff

covers exec and exec_cmd_list methods Signed-off-by: Ignatious Johnson <[email protected]>

cijohnson force-pushed the ichristo/support_optional_continue_on_failure branch from 1feb198 to 1e068cd Compare November 20, 2025 00:35

solaiys reviewed Nov 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

support optional continue on failure #20

support optional continue on failure #20

Uh oh!

cijohnson commented Nov 19, 2025

Uh oh!

solaiys Nov 20, 2025

Uh oh!

cijohnson Nov 20, 2025

Uh oh!

solaiys Nov 20, 2025

Uh oh!

cijohnson Nov 20, 2025

Uh oh!

solaiys Nov 20, 2025

Uh oh!

cijohnson Nov 20, 2025

Uh oh!

solaiys Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

support optional continue on failure #20

Are you sure you want to change the base?

support optional continue on failure #20

Uh oh!

Conversation

cijohnson commented Nov 19, 2025

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

solaiys Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

cijohnson Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

solaiys Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

cijohnson Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

solaiys Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

cijohnson Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

solaiys Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants