-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
issue:4173694 adding Standby node health check #283
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great work Boaz.
please see my comments.
) | ||
result = False | ||
if result: | ||
print("All given IB interfaces are active") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if all given IB interfaces are down, need to have a critical error message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please see my comments and fix - in general I think we should have severity (Error / Warning / Info for each message we print to the console.
help="Specify one or more fabric interfaces (at least one is required), eg ib0, mlx3_0", | ||
) | ||
|
||
parser.add_argument( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only support one management interface for monitoring,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
if ret_code != 0: | ||
print("Failed to run ibdev2netdev") | ||
return {} | ||
line_regex = re.compile(r"^([\w\d_]+) port \d ==> ([\w\d]+)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is more efficient to make the line_regex a class variable rather than a method variable and compile it once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure it is relevant since all the functions are used only once.
But I agree that it will be more readable to have them in one location.
Fixed :)
result = subprocess.run( | ||
command, | ||
shell=True, | ||
stdout=subprocess.PIPE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend to redirect stderr to stdout: "stderr=subprocess.STDOUT"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the analysis we don't care about the stderr
try: | ||
result = subprocess.run( | ||
command, | ||
shell=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend to set "shell=False" by default, you can add it as an argument to the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to False, I don't think it should be an arg as it is also changes the command we need to pass to the run command (one string or a dict of string)
def _get_ib_to_mlx_port_mapping(cls): | ||
ret_code, stdout = cls._run_command("ibdev2netdev") | ||
if ret_code != 0: | ||
print("Failed to run ibdev2netdev") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend to send the error message to syslog for debug
): | ||
print( | ||
f"IB interface {ib_interface} is not active " | ||
f"{ib_interfaces_status[ib_interface_to_validate]}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure printing the dictionary is a user friendly message:
"IB interface ib3 is not active {'State': 'down', 'Physical_state': 'polling'}"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
def _check_corosync_rings_status(cls): | ||
ret_code, corosync_output = cls._run_command("corosync-cfgtool -s") | ||
if ret_code != 0: | ||
print("Failed to run corosync-cfgtool -s") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please output to the user more general messages related to HA interfaces rather than corosync/RING/...
You may write to log more informative messages for out debug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
|
||
@classmethod | ||
def _check_pcs_status(cls): | ||
command = "pcs status" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend to define all the commands as constants
It will be easier to maintain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
|
||
def main(args): | ||
standby_node_checker = StandbyNodeHealthChecker( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend to wrap the main with try...except for unexpected errors + allow the user to press CTRL+C to quit the test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
## What the script is checking | ||
1. checking if all given fabric interface are up. | ||
2. Checking if all given management interface are up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2. Checking if all given management interface are up. | |
2. Checking if all given management interfaces are up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, per Kobi, we are checking only one management interface. I updated a few lines above to reflect it also
HA_STATUS_COMMAND = UFM_HA_CLUSTER_COMMAND.format("status") | ||
IBDEV2NETDEV_COMMAND = "ibdev2netdev" | ||
IBSTAT_COMMAND = "ibstat" | ||
IP_LINK_SHOW_COMMAND = "ip --json link show" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you sure the --json option works on all platforms? on my RHEL7 it doesn't work?
ib_interface_to_validate, ib_interfaces_status | ||
): | ||
logger.warning( | ||
"IB interface %s is not active", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happen if all IB interfaces are not active?
shouldn't we log error message and exit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We decided to run all the checks even if one is failing.
A general warning will be in the summary if any interference is down.
def _parse_ip_link_output(cls, ip_link_output: str): | ||
interfaces = {} | ||
try: | ||
link_data = json.loads(ip_link_output) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to verify the iblinkinfo json output format is available for all OSs -
other wise you will need to use other parser...
) | ||
result = False | ||
elif interfaces_status[self._mgmt_interface] != "up": | ||
logger.warning( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happen if all management interfaces are "down" - shouldn't we get error log message and exit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same answer as above - we decided to do all the tests no matter what. And for now, only give a result per each input interface
def print_summary_information(self): | ||
logger.info("") | ||
logger.info("Executive summary:") | ||
if len(self._summary_actions) > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add comment - if we have something in the summary it means we had some failures...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see my comments and fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Boaz,
I think some messages still need to be modified.
As agreed, let's give it to QA/VER for testing.
Regards,
Kobi
What
Adding a standalone script that when executed, checks the if the standby node is configured correctly.
Why ?
A client request
How ?
More info can be found in the README
Testing ?
Tested each step for positive and negative.
Tested the whole script as it is.
Special triggers
Use the following phrases as comments to trigger different runs
bot:retest
rerun Jenkins CI (to rerun GitHub CI, use "Checks" tab on PR page and rerun all jobs)bot:upgrade
run additional update tests