Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issue:4173694 adding Standby node health check #283

Open
wants to merge 23 commits into
base: main
Choose a base branch
from

Conversation

boazhaim
Copy link
Collaborator

@boazhaim boazhaim commented Nov 20, 2024

What

Adding a standalone script that when executed, checks the if the standby node is configured correctly.

Why ?

A client request

How ?

More info can be found in the README

Testing ?

Tested each step for positive and negative.
Tested the whole script as it is.

Special triggers

Use the following phrases as comments to trigger different runs

  • bot:retest rerun Jenkins CI (to rerun GitHub CI, use "Checks" tab on PR page and rerun all jobs)
  • bot:upgrade run additional update tests

@boazhaim boazhaim added the WIP label Nov 20, 2024
Copy link
Collaborator

@samerd samerd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great work Boaz.
please see my comments.

)
result = False
if result:
print("All given IB interfaces are active")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if all given IB interfaces are down, need to have a critical error message.

Copy link
Collaborator

@kedeme kedeme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please see my comments and fix - in general I think we should have severity (Error / Warning / Info for each message we print to the console.

@boazhaim boazhaim changed the title Standby node health check issue:4173694 adding Standby node health check Nov 27, 2024
@boazhaim boazhaim removed the WIP label Nov 27, 2024
help="Specify one or more fabric interfaces (at least one is required), eg ib0, mlx3_0",
)

parser.add_argument(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only support one management interface for monitoring,

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

if ret_code != 0:
print("Failed to run ibdev2netdev")
return {}
line_regex = re.compile(r"^([\w\d_]+) port \d ==> ([\w\d]+)")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is more efficient to make the line_regex a class variable rather than a method variable and compile it once.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it is relevant since all the functions are used only once.
But I agree that it will be more readable to have them in one location.

Fixed :)

result = subprocess.run(
command,
shell=True,
stdout=subprocess.PIPE,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend to redirect stderr to stdout: "stderr=subprocess.STDOUT"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the analysis we don't care about the stderr

try:
result = subprocess.run(
command,
shell=True,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend to set "shell=False" by default, you can add it as an argument to the function.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to False, I don't think it should be an arg as it is also changes the command we need to pass to the run command (one string or a dict of string)

def _get_ib_to_mlx_port_mapping(cls):
ret_code, stdout = cls._run_command("ibdev2netdev")
if ret_code != 0:
print("Failed to run ibdev2netdev")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend to send the error message to syslog for debug

):
print(
f"IB interface {ib_interface} is not active "
f"{ib_interfaces_status[ib_interface_to_validate]}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure printing the dictionary is a user friendly message:
"IB interface ib3 is not active {'State': 'down', 'Physical_state': 'polling'}"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

def _check_corosync_rings_status(cls):
ret_code, corosync_output = cls._run_command("corosync-cfgtool -s")
if ret_code != 0:
print("Failed to run corosync-cfgtool -s")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please output to the user more general messages related to HA interfaces rather than corosync/RING/...
You may write to log more informative messages for out debug.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


@classmethod
def _check_pcs_status(cls):
command = "pcs status"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend to define all the commands as constants
It will be easier to maintain.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done



def main(args):
standby_node_checker = StandbyNodeHealthChecker(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend to wrap the main with try...except for unexpected errors + allow the user to press CTRL+C to quit the test

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


## What the script is checking
1. checking if all given fabric interface are up.
2. Checking if all given management interface are up.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
2. Checking if all given management interface are up.
2. Checking if all given management interfaces are up.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, per Kobi, we are checking only one management interface. I updated a few lines above to reflect it also

HA_STATUS_COMMAND = UFM_HA_CLUSTER_COMMAND.format("status")
IBDEV2NETDEV_COMMAND = "ibdev2netdev"
IBSTAT_COMMAND = "ibstat"
IP_LINK_SHOW_COMMAND = "ip --json link show"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you sure the --json option works on all platforms? on my RHEL7 it doesn't work?

ib_interface_to_validate, ib_interfaces_status
):
logger.warning(
"IB interface %s is not active",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happen if all IB interfaces are not active?
shouldn't we log error message and exit?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We decided to run all the checks even if one is failing.
A general warning will be in the summary if any interference is down.

def _parse_ip_link_output(cls, ip_link_output: str):
interfaces = {}
try:
link_data = json.loads(ip_link_output)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to verify the iblinkinfo json output format is available for all OSs -
other wise you will need to use other parser...

)
result = False
elif interfaces_status[self._mgmt_interface] != "up":
logger.warning(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happen if all management interfaces are "down" - shouldn't we get error log message and exit?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same answer as above - we decided to do all the tests no matter what. And for now, only give a result per each input interface

def print_summary_information(self):
logger.info("")
logger.info("Executive summary:")
if len(self._summary_actions) > 0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add comment - if we have something in the summary it means we had some failures...

Copy link
Collaborator

@kedeme kedeme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my comments and fix.

Copy link
Collaborator

@kobibar kobibar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Boaz,
I think some messages still need to be modified.
As agreed, let's give it to QA/VER for testing.
Regards,
Kobi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants