Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLI support for SmartSwitch PMON #3271

Open
wants to merge 145 commits into
base: master
Choose a base branch
from

Conversation

rameshraghupathy
Copy link

@rameshraghupathy rameshraghupathy commented Apr 14, 2024

What I did

Enhanced the following CLIs to support SmartSwitch PMON as described in the PMON HLD documentation "https://github.com/sonic-net/SONiC/blob/d19d8933a43d0a31a4f3b2310f4336f289bca340/doc/smart-switch/pmon/smartswitch-pmon.md"

CLIs:
Added new module "DPUX" support for 1 and 2 below
1. "config chassis module startup DPUX" , where X could be 0, to the maximum number of DPUs-1 in the SmartSwitch chassis
2. "config chassis module shutdown DPUX"

Extended the following CLIs to support the new module "DPUX" and also proved a "all" option to display the "SWITCH" and all "DPUX" modules
1. "show reboot-cause" will remain the same and added "show reboot-cause all"
2. "show reboot-cause history" will remain the same and added "show reboot-cause history ", where module name could be DPUX, SWITCH and all.

Extended the following CLIs to support the new module "DPUX" and also proved a "all" option to display the "SWITCH" and all "DPUX" modules
1. "show system-health summary" will remain the same and added sub-command "show system-health summary ", where module name could be DPUX, SWITCH and all.
2. "show system-health monitor-list" will remain the same and added sub-command "show system-health monitor-list ", where module name could be DPUX, SWITCH and all.
3. "show system-health summary" will remain the same and added sub-command "show system-health summary ", where module name could be DPUX, SWITCH and all.etail" will remain the same and added sub-command "show system-health detail ", where module name could be DPUX, SWITCH and all.
4. Added a new sub command "show system-health dpu ", where module name could be DPUX, and all. This new subcommand will provide additional DPU state details as mentioned in the HLD

How I did it

  1. Kept the original CLI output unaltered
  2. Added sub command to support SmartSwitch "DPUs"
  3. Added additional code in chassisd, and in platform modules.py, chassis.py to support it
  4. Updated the DB tables as mentioned in the PMON HLD

How to verify it

  1. Build an image with the required files (refer to the other upstream PRs and the platform PRs)
    Require files:
    - This PR including reboot_cause.py, chassis_modules.py, system_health.py)
    - The other PR including module_base.py, chassis_base.py, docker-pmon.supervisord.conf.j2, chassisd, mock_module_base.py, and the appropriate database_config.json
    - Platform "platform-cisco-8000" supporting PMON (module.py, chassis.py, inventory.py, pmon_daemon_control.json, and the required grpc and DB changes)
  2. Run the CLIs and see the new output

Previous command output (if the output of a command-line utility has changed)

root@sonic:~# show reboot-cause
Unknown

root@sonic:~# show reboot-cause history
Name Cause Time User Comment


2023_06_19_11_00_24 Power Loss N/A N/A Unknown (First boot of SONiC version 202311.10869-dirty-2024044)

New command output (if the output of a command-line utility has changed)

root@sonic:~# show reboot-cause history all
Device Name Cause Time User Comment


SWITCH 2023_06_19_11_00_24 Power Loss N/A N/A Unknown (First boot of SONiC version 202311.10869-dirty-2024044)

root@sonic:~# show reboot-cause history SWITCH
Device Name Cause Time User Comment


SWITCH 2023_06_19_11_00_24 Power Loss N/A N/A Unknown (First boot of SONiC version 202311.10869-dirty-2024044)

root@sonic:~# show reboot-cause history DPU0
Device Name Cause Time User Comment


@oleksandrivantsiv
Copy link
Collaborator

Can you please add UT for the new functions?

show/chassis_modules.py Outdated Show resolved Hide resolved
vvolam
vvolam previously approved these changes Oct 9, 2024
Copy link

@vvolam vvolam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy As discussed offline, please update shutdown CLI to follow pre-shutdown steps listed in sonic-net/SONiC#1699. shutdown or power-down also should follow same pre-shutdown steps for DPU.

@oleksandrivantsiv
Copy link
Collaborator

@rameshraghupathy, @prgeor According to the Smart Switch PMON HLD the DPU reboot cause and the reboot history should be stored in the file on the host side. Hovewer, I don't see this implemented here

3.1.5.1 Need for consistent storage and access of DPU reboot cause, state and health
The smartswitch needs to know the reboot cause for DPUs. Please refer to the CLI section for the various options and their effects when executed on the switch and DPUs.
Each DPU will update its reboot cause history in the Switch ChasissStateDB upon boot up. The recent reboot-cause can be derived from that list of reboot-causes.
The get_reboot_cause will return the current reboot-cause of the module.
For persistent storage of the DPU reboot-cause and reboot-caue-history files use the existing host storage path and mechanism.

@rameshraghupathy
Copy link
Author

@rameshraghupathy As discussed offline, please update shutdown CLI to follow pre-shutdown steps listed in sonic-net/SONiC#1699. shutdown or power-down also should follow same pre-shutdown steps for DPU.
@vvolam Yes, this will be done eventually when the graceful reboot cases are tested.

@rameshraghupathy
Copy link
Author

@rameshraghupathy, @prgeor According to the Smart Switch PMON HLD the DPU reboot cause and the reboot history should be stored in the file on the host side. Hovewer, I don't see this implemented here

3.1.5.1 Need for consistent storage and access of DPU reboot cause, state and health
The smartswitch needs to know the reboot cause for DPUs. Please refer to the CLI section for the various options and their effects when executed on the switch and DPUs.
Each DPU will update its reboot cause history in the Switch ChasissStateDB upon boot up. The recent reboot-cause can be derived from that list of reboot-causes.
The get_reboot_cause will return the current reboot-cause of the module.
For persistent storage of the DPU reboot-cause and reboot-caue-history files use the existing host storage path and mechanism.

@oleksandrivantsiv

  1. As we discussed and also mentioned in the HLD, I'm adding support for persistent storage of DPU reboot cause/history on the NPU sonic-host-services. That will be another PR as it is a different repo.
  2. The get_reboot_cause API will fetch the DPU reboot-cause as implemented by the vendor.
  3. On DPU bootup, during module init get_reboot_cause API will be used to fetch the cause and that will be update into the DB and also be stored in the persistent storage.
  4. On a NPU reboot the reboot-cause for all the DPUs from the persistent storage will be populated into the DB

@@ -110,8 +110,9 @@ def shutdown_chassis_module(db, chassis_module_name):

if not chassis_module_name.startswith("SUPERVISOR") and \
not chassis_module_name.startswith("LINE-CARD") and \
not chassis_module_name.startswith("FABRIC-CARD"):
ctx.fail("'module_name' has to begin with 'SUPERVISOR', 'LINE-CARD' or 'FABRIC-CARD'")
not chassis_module_name.startswith("FABRIC-CARD") and \
Copy link

@gpunathilell gpunathilell Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to perform additional validation to check if the chassis_module_name is actually present (or is an actual valid module name) or not, if user executes config chassis modules startup DPU5 on a system which does not have DPU5, this will cause crash in chassisd for the SmartSwitchConfigManagerTask in chassisd preventing further startup or shutdown calls (even though output of the command would be Starting up chassis module DPU1 or Shutting down chassis module DPU1 the only operation which is performed is addition/removal from the CONFIG_DB )

changes such as 1. STATE_DB vs CHASSIS_STATE_DB and the key info
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants