issue:4173694 adding Standby node health check #283

boazhaim · 2024-11-20T13:19:27Z

What

Adding a standalone script that when executed, checks the if the standby node is configured correctly.

Why ?

A client request

How ?

More info can be found in the README

Testing ?

Tested each step for positive and negative.
Tested the whole script as it is.

Special triggers

Use the following phrases as comments to trigger different runs

bot:retest rerun Jenkins CI (to rerun GitHub CI, use "Checks" tab on PR page and rerun all jobs)
bot:upgrade run additional update tests

samerd

great work Boaz.
please see my comments.

scripts/standby_node_health_check/standby_node_health_check.py

kedeme · 2024-11-24T15:56:51Z

scripts/standby_node_health_check/standby_node_health_check.py

+                )
+                result = False
+        if result:
+            print("All given IB interfaces are active")


if all given IB interfaces are down, need to have a critical error message.

scripts/standby_node_health_check/standby_node_health_check.py

kedeme

please see my comments and fix - in general I think we should have severity (Error / Warning / Info for each message we print to the console.

.ci/cidemo-init.sh

kobibar · 2024-12-02T14:55:53Z

scripts/standby_node_health_check/standby_node_health_check.py

+        help="Specify one or more fabric interfaces (at least one is required), eg ib0, mlx3_0",
+    )
+
+    parser.add_argument(


We only support one management interface for monitoring,

kobibar · 2024-12-02T15:12:50Z

scripts/standby_node_health_check/standby_node_health_check.py

+        if ret_code != 0:
+            print("Failed to run ibdev2netdev")
+            return {}
+        line_regex = re.compile(r"^([\w\d_]+) port \d ==> ([\w\d]+)")


It is more efficient to make the line_regex a class variable rather than a method variable and compile it once.

I'm not sure it is relevant since all the functions are used only once.
But I agree that it will be more readable to have them in one location.

Fixed :)

kobibar · 2024-12-02T15:16:20Z

scripts/standby_node_health_check/standby_node_health_check.py

+            result = subprocess.run(
+                command,
+                shell=True,
+                stdout=subprocess.PIPE,


I recommend to redirect stderr to stdout: "stderr=subprocess.STDOUT"

For the analysis we don't care about the stderr

kobibar · 2024-12-02T15:17:43Z

scripts/standby_node_health_check/standby_node_health_check.py

+        try:
+            result = subprocess.run(
+                command,
+                shell=True,


I recommend to set "shell=False" by default, you can add it as an argument to the function.

Changed to False, I don't think it should be an arg as it is also changes the command we need to pass to the run command (one string or a dict of string)

kobibar · 2024-12-02T15:21:10Z

scripts/standby_node_health_check/standby_node_health_check.py

+    def _get_ib_to_mlx_port_mapping(cls):
+        ret_code, stdout = cls._run_command("ibdev2netdev")
+        if ret_code != 0:
+            print("Failed to run ibdev2netdev")


I recommend to send the error message to syslog for debug

kobibar · 2024-12-02T15:52:35Z

scripts/standby_node_health_check/standby_node_health_check.py

+            ):
+                print(
+                    f"IB interface {ib_interface} is not active "
+                    f"{ib_interfaces_status[ib_interface_to_validate]}"


Not sure printing the dictionary is a user friendly message:
"IB interface ib3 is not active {'State': 'down', 'Physical_state': 'polling'}"

kobibar · 2024-12-04T08:46:59Z

scripts/standby_node_health_check/standby_node_health_check.py

+    def _check_corosync_rings_status(cls):
+        ret_code, corosync_output = cls._run_command("corosync-cfgtool -s")
+        if ret_code != 0:
+            print("Failed to run corosync-cfgtool -s")


Please output to the user more general messages related to HA interfaces rather than corosync/RING/...
You may write to log more informative messages for out debug.

kobibar · 2024-12-04T08:47:48Z

scripts/standby_node_health_check/standby_node_health_check.py

+
+    @classmethod
+    def _check_pcs_status(cls):
+        command = "pcs status"


I recommend to define all the commands as constants
It will be easier to maintain.

kobibar · 2024-12-04T08:53:19Z

scripts/standby_node_health_check/standby_node_health_check.py

+
+
+def main(args):
+    standby_node_checker = StandbyNodeHealthChecker(


I recommend to wrap the main with try...except for unexpected errors + allow the user to press CTRL+C to quit the test

scripts/standby_node_health_check/README.md

kedeme · 2024-12-09T08:46:23Z

scripts/standby_node_health_check/README.md

+
+## What the script is checking
+1. checking if all given fabric interface are up.
+2. Checking if all given management interface are up.


Suggested change

2. Checking if all given management interface are up.

2. Checking if all given management interfaces are up.

No, per Kobi, we are checking only one management interface. I updated a few lines above to reflect it also

scripts/standby_node_health_check/standby_node_health_check.py

kedeme · 2024-12-09T08:54:17Z

scripts/standby_node_health_check/standby_node_health_check.py

+    HA_STATUS_COMMAND = UFM_HA_CLUSTER_COMMAND.format("status")
+    IBDEV2NETDEV_COMMAND = "ibdev2netdev"
+    IBSTAT_COMMAND = "ibstat"
+    IP_LINK_SHOW_COMMAND = "ip --json link show"


are you sure the --json option works on all platforms? on my RHEL7 it doesn't work?

scripts/standby_node_health_check/standby_node_health_check.py

kedeme · 2024-12-09T10:15:27Z

scripts/standby_node_health_check/standby_node_health_check.py

+                ib_interface_to_validate, ib_interfaces_status
+            ):
+                logger.warning(
+                    "IB interface %s is not active",


what happen if all IB interfaces are not active?
shouldn't we log error message and exit?

We decided to run all the checks even if one is failing.
A general warning will be in the summary if any interference is down.

scripts/standby_node_health_check/standby_node_health_check.py

kedeme · 2024-12-09T10:39:59Z

scripts/standby_node_health_check/standby_node_health_check.py

+    def _parse_ip_link_output(cls, ip_link_output: str):
+        interfaces = {}
+        try:
+            link_data = json.loads(ip_link_output)


need to verify the iblinkinfo json output format is available for all OSs -
other wise you will need to use other parser...

kedeme · 2024-12-09T10:41:33Z

scripts/standby_node_health_check/standby_node_health_check.py

+            )
+            result = False
+        elif interfaces_status[self._mgmt_interface] != "up":
+            logger.warning(


what happen if all management interfaces are "down" - shouldn't we get error log message and exit?

same answer as above - we decided to do all the tests no matter what. And for now, only give a result per each input interface

kedeme · 2024-12-09T11:38:22Z

scripts/standby_node_health_check/standby_node_health_check.py

+    def print_summary_information(self):
+        logger.info("")
+        logger.info("Executive summary:")
+        if len(self._summary_actions) > 0:


please add comment - if we have something in the summary it means we had some failures...

kedeme

see my comments and fix.

kobibar

Hi Boaz,
I think some messages still need to be modified.
As agreed, let's give it to QA/VER for testing.
Regards,
Kobi

boazhaim added 6 commits November 18, 2024 21:31

Not tested yet

4ec4a8a

Working ib and eth logic

ee5b6ff

Working until rings check

4298c10

for now

bf5f623

Working with happy flow

f1458a0

Adding ib to mlx map

36ec590

boazhaim added the WIP label Nov 20, 2024

boazhaim requested review from samerd, kobibar and kedeme November 20, 2024 13:19

Changed to class

2c87ad7

samerd reviewed Nov 21, 2024

View reviewed changes

1

2ada768

kedeme reviewed Nov 24, 2024

View reviewed changes

scripts/standby_node_health_check/standby_node_health_check.py Outdated Show resolved Hide resolved

kedeme reviewed Nov 24, 2024

View reviewed changes

scripts/standby_node_health_check/standby_node_health_check.py Outdated Show resolved Hide resolved

kedeme reviewed Nov 24, 2024

View reviewed changes

scripts/standby_node_health_check/standby_node_health_check.py Outdated Show resolved Hide resolved

kedeme reviewed Nov 24, 2024

View reviewed changes

scripts/standby_node_health_check/standby_node_health_check.py Outdated Show resolved Hide resolved

kedeme reviewed Nov 24, 2024

View reviewed changes

scripts/standby_node_health_check/standby_node_health_check.py Outdated Show resolved Hide resolved

kedeme reviewed Nov 24, 2024

View reviewed changes

scripts/standby_node_health_check/standby_node_health_check.py Outdated Show resolved Hide resolved

kedeme reviewed Nov 24, 2024

View reviewed changes

scripts/standby_node_health_check/standby_node_health_check.py Outdated Show resolved Hide resolved

kedeme reviewed Nov 24, 2024

View reviewed changes

scripts/standby_node_health_check/standby_node_health_check.py Outdated Show resolved Hide resolved

kedeme reviewed Nov 24, 2024

View reviewed changes

boazhaim added 4 commits November 24, 2024 21:48

Working with negative testing

34e126b

pylint_ruff

07e165b

remove comments

34faba4

Adding a readme

7c99a29

boazhaim changed the title ~~Standby node health check~~ issue:4173694 adding Standby node health check Nov 27, 2024

boazhaim removed the WIP label Nov 27, 2024

Bring back the rings checks

88fd93e

boazhaim requested a review from alextabachnik November 27, 2024 09:01

Revmoing the jenkins build in case of changes in the scripts dir

8ad8adf

boazhaim commented Nov 27, 2024

View reviewed changes

.ci/cidemo-init.sh Show resolved Hide resolved

alextabachnik approved these changes Nov 27, 2024

View reviewed changes

Removing debug prints and fixing the old rings regex

9951eb8

kobibar reviewed Dec 2, 2024

View reviewed changes

kobibar reviewed Dec 4, 2024

View reviewed changes

boazhaim added 4 commits December 4, 2024 15:29

PR comments

b175ce4

Update the README

6da7584

PR comments

e7fc91a

Linter+ruff

d75a57c

kedeme reviewed Dec 9, 2024

View reviewed changes

scripts/standby_node_health_check/README.md Outdated Show resolved Hide resolved

kedeme reviewed Dec 9, 2024

View reviewed changes

scripts/standby_node_health_check/standby_node_health_check.py Outdated Show resolved Hide resolved

kedeme reviewed Dec 9, 2024

View reviewed changes

scripts/standby_node_health_check/standby_node_health_check.py Show resolved Hide resolved

kedeme reviewed Dec 9, 2024

View reviewed changes

scripts/standby_node_health_check/standby_node_health_check.py Outdated Show resolved Hide resolved

kedeme reviewed Dec 9, 2024

View reviewed changes

boazhaim added 4 commits December 9, 2024 13:47

Merge branch 'main' into Standby-node-health-check

5219ce8

PR comments fixes

0ccdf68

Adding a third option for rings status check

9ebc60e

ruff

5ec4b98

kobibar approved these changes Dec 12, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

issue:4173694 adding Standby node health check #283

issue:4173694 adding Standby node health check #283

boazhaim commented Nov 20, 2024 •

edited

Loading

samerd left a comment

kedeme Nov 24, 2024

kedeme left a comment

kobibar Dec 2, 2024

boazhaim Dec 5, 2024

kobibar Dec 2, 2024

boazhaim Dec 5, 2024

kobibar Dec 2, 2024

boazhaim Dec 5, 2024

kobibar Dec 2, 2024

boazhaim Dec 5, 2024

kobibar Dec 2, 2024

kobibar Dec 2, 2024

boazhaim Dec 5, 2024

kobibar Dec 4, 2024

boazhaim Dec 5, 2024

kobibar Dec 4, 2024

boazhaim Dec 5, 2024

kobibar Dec 4, 2024

boazhaim Dec 5, 2024

kedeme Dec 9, 2024

boazhaim Dec 9, 2024

kedeme Dec 9, 2024

kedeme Dec 9, 2024

boazhaim Dec 9, 2024

kedeme Dec 9, 2024

kedeme Dec 9, 2024

boazhaim Dec 9, 2024

kedeme Dec 9, 2024

kedeme left a comment

kobibar left a comment •

edited

Loading



		def main(args):
		standby_node_checker = StandbyNodeHealthChecker(

	2. Checking if all given management interface are up.
	2. Checking if all given management interfaces are up.

issue:4173694 adding Standby node health check #283

Are you sure you want to change the base?

issue:4173694 adding Standby node health check #283

Conversation

boazhaim commented Nov 20, 2024 • edited Loading

What

Why ?

How ?

Testing ?

Special triggers

samerd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kedeme left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kedeme left a comment

Choose a reason for hiding this comment

kobibar left a comment • edited Loading

Choose a reason for hiding this comment

boazhaim commented Nov 20, 2024 •

edited

Loading

kobibar left a comment •

edited

Loading