Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to remove dummy SAI objects after warm-reboot causing orchagent crash #1429

Open
Stephenxf opened this issue Oct 8, 2024 · 28 comments

Comments

@Stephenxf
Copy link

Summary

We have a bunch of switches with old 201811-based image to be upgraded to our next available image based off 202012. The warm-reboot upgrade went through, but the next warm-reboot (same 202012-based image, no upgrade) would end up with orchagent crash. The cause of that is syncd trying to remove some dummy SAI objects created by the previous warm-reboot upgrade but failing with SAI_STATUS_INVALID_PARAMETER error.

Details

1. With the 201811 image (before the warm-reboot upgrade), we have some QoS configs that create 5 SAI_OBJECT_TYPE_BUFFER_PROFILE entries.

SAI_OBJECT_TYPE_BUFFER_PROFILE oid:0x19000000000b10
    SAI_BUFFER_PROFILE_ATTR_POOL_ID              : oid:0x18000000000b0d
    SAI_BUFFER_PROFILE_ATTR_RESERVED_BUFFER_SIZE : 1518
    SAI_BUFFER_PROFILE_ATTR_SHARED_STATIC_TH     : 15982720
    SAI_BUFFER_PROFILE_ATTR_THRESHOLD_MODE       : SAI_BUFFER_PROFILE_THRESHOLD_MODE_STATIC

SAI_OBJECT_TYPE_BUFFER_PROFILE oid:0x19000000000b11
    SAI_BUFFER_PROFILE_ATTR_POOL_ID              : oid:0x18000000000b0e
    SAI_BUFFER_PROFILE_ATTR_RESERVED_BUFFER_SIZE : 1518
    SAI_BUFFER_PROFILE_ATTR_SHARED_DYNAMIC_TH    : 0
    SAI_BUFFER_PROFILE_ATTR_THRESHOLD_MODE       : SAI_BUFFER_PROFILE_THRESHOLD_MODE_DYNAMIC

SAI_OBJECT_TYPE_BUFFER_PROFILE oid:0x19000000000b12
    SAI_BUFFER_PROFILE_ATTR_POOL_ID              : oid:0x18000000000b0f
    SAI_BUFFER_PROFILE_ATTR_RESERVED_BUFFER_SIZE : 0
    SAI_BUFFER_PROFILE_ATTR_SHARED_DYNAMIC_TH    : 3
    SAI_BUFFER_PROFILE_ATTR_THRESHOLD_MODE       : SAI_BUFFER_PROFILE_THRESHOLD_MODE_DYNAMIC

SAI_OBJECT_TYPE_BUFFER_PROFILE oid:0x19000000000b13
    SAI_BUFFER_PROFILE_ATTR_POOL_ID              : oid:0x18000000000b0f
    SAI_BUFFER_PROFILE_ATTR_RESERVED_BUFFER_SIZE : 1248
    SAI_BUFFER_PROFILE_ATTR_SHARED_DYNAMIC_TH    : 0
    SAI_BUFFER_PROFILE_ATTR_THRESHOLD_MODE       : SAI_BUFFER_PROFILE_THRESHOLD_MODE_DYNAMIC
    SAI_BUFFER_PROFILE_ATTR_XOFF_TH              : 138112
    SAI_BUFFER_PROFILE_ATTR_XON_OFFSET_TH        : 2288
    SAI_BUFFER_PROFILE_ATTR_XON_TH               : 2288

SAI_OBJECT_TYPE_BUFFER_PROFILE oid:0x19000000000b14
    SAI_BUFFER_PROFILE_ATTR_POOL_ID              : oid:0x18000000000b0f
    SAI_BUFFER_PROFILE_ATTR_RESERVED_BUFFER_SIZE : 1248
    SAI_BUFFER_PROFILE_ATTR_SHARED_DYNAMIC_TH    : 0
    SAI_BUFFER_PROFILE_ATTR_THRESHOLD_MODE       : SAI_BUFFER_PROFILE_THRESHOLD_MODE_DYNAMIC
    SAI_BUFFER_PROFILE_ATTR_XOFF_TH              : 53664
    SAI_BUFFER_PROFILE_ATTR_XON_OFFSET_TH        : 2288
    SAI_BUFFER_PROFILE_ATTR_XON_TH               : 2288

2. After warm-reboot upgrade to 202012-based image, the 5 objects are still present (with different oid's), and 5 dummy objects are also created. This might be a result of BRCM SAI upgrade.

SAI_OBJECT_TYPE_BUFFER_PROFILE oid:0x19000000000bd3

SAI_OBJECT_TYPE_BUFFER_PROFILE oid:0x19000000000bd4

SAI_OBJECT_TYPE_BUFFER_PROFILE oid:0x19000000000bd5

SAI_OBJECT_TYPE_BUFFER_PROFILE oid:0x19000000000bd6

SAI_OBJECT_TYPE_BUFFER_PROFILE oid:0x19000000000bd7

SAI_OBJECT_TYPE_BUFFER_PROFILE oid:0x19000000000c60
    SAI_BUFFER_PROFILE_ATTR_POOL_ID              : oid:0x18000000000c5d
    SAI_BUFFER_PROFILE_ATTR_RESERVED_BUFFER_SIZE : 1518
    SAI_BUFFER_PROFILE_ATTR_SHARED_STATIC_TH     : 15982720
    SAI_BUFFER_PROFILE_ATTR_THRESHOLD_MODE       : SAI_BUFFER_PROFILE_THRESHOLD_MODE_STATIC

SAI_OBJECT_TYPE_BUFFER_PROFILE oid:0x19000000000c61
    SAI_BUFFER_PROFILE_ATTR_POOL_ID              : oid:0x18000000000c5e
    SAI_BUFFER_PROFILE_ATTR_RESERVED_BUFFER_SIZE : 1518
    SAI_BUFFER_PROFILE_ATTR_SHARED_DYNAMIC_TH    : 0
    SAI_BUFFER_PROFILE_ATTR_THRESHOLD_MODE       : SAI_BUFFER_PROFILE_THRESHOLD_MODE_DYNAMIC

SAI_OBJECT_TYPE_BUFFER_PROFILE oid:0x19000000000c62
    SAI_BUFFER_PROFILE_ATTR_POOL_ID              : oid:0x18000000000c5f
    SAI_BUFFER_PROFILE_ATTR_RESERVED_BUFFER_SIZE : 0
    SAI_BUFFER_PROFILE_ATTR_SHARED_DYNAMIC_TH    : 3
    SAI_BUFFER_PROFILE_ATTR_THRESHOLD_MODE       : SAI_BUFFER_PROFILE_THRESHOLD_MODE_DYNAMIC

SAI_OBJECT_TYPE_BUFFER_PROFILE oid:0x19000000000c63
    SAI_BUFFER_PROFILE_ATTR_POOL_ID              : oid:0x18000000000c5f
    SAI_BUFFER_PROFILE_ATTR_RESERVED_BUFFER_SIZE : 1248
    SAI_BUFFER_PROFILE_ATTR_SHARED_DYNAMIC_TH    : 0
    SAI_BUFFER_PROFILE_ATTR_THRESHOLD_MODE       : SAI_BUFFER_PROFILE_THRESHOLD_MODE_DYNAMIC
    SAI_BUFFER_PROFILE_ATTR_XOFF_TH              : 138112
    SAI_BUFFER_PROFILE_ATTR_XON_OFFSET_TH        : 2288
    SAI_BUFFER_PROFILE_ATTR_XON_TH               : 2288

SAI_OBJECT_TYPE_BUFFER_PROFILE oid:0x19000000000c64
    SAI_BUFFER_PROFILE_ATTR_POOL_ID              : oid:0x18000000000c5f
    SAI_BUFFER_PROFILE_ATTR_RESERVED_BUFFER_SIZE : 1248
    SAI_BUFFER_PROFILE_ATTR_SHARED_DYNAMIC_TH    : 0
    SAI_BUFFER_PROFILE_ATTR_THRESHOLD_MODE       : SAI_BUFFER_PROFILE_THRESHOLD_MODE_DYNAMIC
    SAI_BUFFER_PROFILE_ATTR_XOFF_TH              : 53664
    SAI_BUFFER_PROFILE_ATTR_XON_OFFSET_TH        : 2288
    SAI_BUFFER_PROFILE_ATTR_XON_TH               : 2288

3. With one additional warm-reboot to the same 202012-based image, syncd detects discrepancy of SAI_OBJECT_TYPE_BUFFER_PROFILE between current view (10 objects) and temp view (5 objects).

WARNING syncd#syncd: :- logViewObjectCount: object count for SAI_OBJECT_TYPE_BUFFER_PROFILE on current view 10 is different than on temporary view: 5

Then the attempt to remove the 5 dummy objects via executeOperationsOnAsic leads to SAI errors, eventually syncd stopped and orchagent exited.

Oct  8 09:18:02.721171 fab01 ERR syncd#syncd: [none] SAI_API_BUFFER:brcm_sai_remove_buffer_profile:2423 Getting buffer pool data failed with error -5.
Oct  8 09:18:02.721255 fab01 ERR syncd#syncd: :- asic_handle_generic: remove SAI_OBJECT_TYPE_BUFFER_PROFILE RID: oid:0x1900000001 VID oid:0x19000000000bd3 failed: SAI_STATUS_INVALID_PARAMETER
Oct  8 09:18:02.721255 fab01 ERR syncd#syncd: :- asic_process_event: failed to execute api: remove, key: SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000bd3, status: SAI_STATUS_INVALID_PARAMETER
Oct  8 09:18:02.721419 fab01 NOTICE syncd#syncd: :- executeOperationsOnAsic: asic apply took 0.014296 sec
Oct  8 09:18:02.721445 fab01 ERR syncd#syncd: :- executeOperationsOnAsic: Error while executing asic operations, ASIC is in inconsistent state: :- asic_process_event: failed to execute api: remove, key: SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000bd3, status: SAI_STATUS_INVALID_PARAMETER
Oct  8 09:18:02.765644 fab01 NOTICE syncd#syncd: :- applyView: apply took 0.687701 sec
Oct  8 09:18:02.767236 fab01 ERR syncd#syncd: :- run: Runtime error: :- asic_process_event: failed to execute api: remove, key: SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000bd3, status: SAI_STATUS_INVALID_PARAMETER
Oct  8 09:18:02.767335 fab01 NOTICE syncd#syncd: :- sendShutdownRequest: sending switch_shutdown_request notification to OA for switch: oid:0x21000000000000
Oct  8 09:18:02.769449 fab01 NOTICE syncd#syncd: :- sendShutdownRequestAfterException: notification send successfully
Oct  8 09:18:02.769614 fab01 ERR swss#orchagent: :- syncd_apply_view: Failed to notify syncd APPLY_VIEW -1
Oct  8 09:18:02.769962 fab01 ERR swss#orchagent: :- on_switch_shutdown_request: Syncd stopped
Oct  8 09:18:02.770225 fab01 INFO swss#/supervisord: orchagent free(): corrupted unsorted chunks
Oct  8 09:18:02.770952 fab01 INFO kernel: [   71.180447] traps: orchagent[2815] general protection ip:7f3abd15ae87 sp:7f3ab7ffe710 error:0 in libc-2.28.so[7f3abd143000+147000]
Oct  8 09:18:03.061421 fab01 INFO swss#supervisord 2024-10-08 09:18:03,060 INFO exited: orchagent (terminated by SIGSEGV (core dumped); not expected)

In this problem state, the ASIC_DB has 10 actual objects (including the 5 dummy ones) and 5 temp objects.

127.0.0.1:6379[1]> KEYS *BUFFER_PROFILE*
 1) "TEMP_ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000e2d"
 2) "TEMP_ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000e2c"
 3) "ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000c63"
 4) "ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000bd3"
 5) "ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000c61"
 6) "TEMP_ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000e2e"
 7) "TEMP_ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000e2b"
 8) "ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000c64"
 9) "ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000c60"
10) "ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000c62"
11) "ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000bd6"
12) "ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000bd4"
13) "TEMP_ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000e2f"
14) "ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000bd5"
15) "ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000bd7"

Next Steps

Can you please help us understand the potential issue here?

  • In 2, is it expected to have extra SAI_OBJECT_TYPE_BUFFER_PROFILE objects created? This is different from PORT_SERDES objects, which were not there before upgrade but introduced as default objects after upgrade. Are these extra objects supposed to stay even after warm-reboot process is complete?
  • In 3, is it expected to call SAI remove() API to remove those extra/dummy objects? If so, is it possibly a BRCM SAI bug that the remove() operation fails?

Thanks in advance for assistance.

@kcudnik
Copy link
Collaborator

kcudnik commented Oct 9, 2024

hey, what do you mean dummy objects? syncd is creating objects based on real object id that could be discovered during warm boot to new firmware, then those objects will be created in asic-db without any attributes.

Since there is some issue here, it could be due to a bug in brcm or syncd, but i having a lot of similar issues with brcm, i would lean to that this is a bug on their side, probably, the same objects get different object IDs after warm boot

from your description it seems like you are doing :

  • cold boot to 201811
  • warm boot to 202012
  • warm boot to 202012

and that second warm boot fails.

for better analyze can you attach syslog from entire process from first cold boot to the last ?, if it's confidential, please send that to me directly on my email or teams

@Stephenxf
Copy link
Author

Regarding dummy objects, I was referring to the ones created in SaiSwitch::checkWarmBootDiscoveredRids():

1185           // this means that some new objects were discovered but they are
1186           // not present in current ASIC_VIEW, and we need to create dummy
1187           // entries for them
1188
1189           redisSetDummyAsicStateForRealObjectId(rid);

The problem seems to be that after the warm-reboot from 201811 to 202012, apart from the 5 real objects that are same as what existed on 201811 prior to upgrade, there are 5 additional objects that are the dummy ones with no attributes. I shared the saidump output earlier (after step 2); asic_db output showed the same:

$ redis-cli
127.0.0.1:6379> SELECT 1
OK
127.0.0.1:6379[1]> KEYS *BUFFER_PROFILE*
 1) "ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000c62"
 2) "ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000c64"
 3) "ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000bd4"
 4) "ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000c60"
 5) "ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000bd3"
 6) "ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000c63"
 7) "ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000bd7"
 8) "ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000c61"
 9) "ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000bd5"
10) "ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000bd6"
127.0.0.1:6379[1]>
127.0.0.1:6379[1]> HGETALL "ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000c62"
1) "SAI_BUFFER_PROFILE_ATTR_THRESHOLD_MODE"
2) "SAI_BUFFER_PROFILE_THRESHOLD_MODE_DYNAMIC"
3) "SAI_BUFFER_PROFILE_ATTR_SHARED_DYNAMIC_TH"
4) "3"
5) "SAI_BUFFER_PROFILE_ATTR_POOL_ID"
6) "oid:0x18000000000c5f"
7) "SAI_BUFFER_PROFILE_ATTR_RESERVED_BUFFER_SIZE"
8) "0"
127.0.0.1:6379[1]> HGETALL "ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000bd4"
1) "NULL"
2) "NULL"
127.0.0.1:6379[1]>

I assume the 5 objects with oid ending with c60, c61, c62, c63 and c64 are the real objects; the 5 ones with oid ending with bd3, bd4, bd5, bd6 and bd7 are the dummy ones that are supposed to be transient, but remain there after warm-reboot is complete.

Yes, you got the steps all correct. I have records for all 3 steps (one cold boot followed by two warm boots). I'll get you the logs separately shortly. Thanks!

@kcudnik
Copy link
Collaborator

kcudnik commented Oct 10, 2024

my suspicion is those additional 5 objects, are the same as they were on 202018, but broadcom changed something that their OIDs are differenet then previously, this actually happened before, and im not sure whether that was BUFFER_PROFILE as well,

so what was happening since those buffer profiles were created by orchagend and assigned to other objects, they maintain this OID value, but when doing GET after warm boot, other OID value was returned, im 99% sure that this is the case

@kcudnik
Copy link
Collaborator

kcudnik commented Oct 10, 2024

please also provide recordings from /var/log/swss/sairedis.rec* files as they will be essential to have configuration

@Stephenxf
Copy link
Author

Appreciate your input! I've sent the sairedis.rec files. If it's something to do with Broadcom SAI, I wouldn't expect any change from them as the version is pretty old. In this situation, for the upgrade to proceed with warm-reboot, we would be looking for sort of a workaround. Is there any fix or workaround available as you know? Specifically:

  • (preferred) we can make changes in syncd and/or sairedis of our own codebase.
  • (doable) after the first warm boot (upgrade), we can trigger a process to clean up the problem state (e.g., update/delete some SAI objects) to facilitate next warm boots.

@kcudnik
Copy link
Collaborator

kcudnik commented Oct 10, 2024

ok, so i analyzed logs you sent me, and those logs confirms my suspicion:

  1. you do cold boot to 201811 syncd discovered 0 (zero) buffer profiles and get orchagent created 5 buffer profiles (s1_saidump):
SAI_OBJECT_TYPE_BUFFER_PROFILE oid:0x19000000000b10
    SAI_BUFFER_PROFILE_ATTR_POOL_ID              : oid:0x18000000000b0d
    SAI_BUFFER_PROFILE_ATTR_RESERVED_BUFFER_SIZE : 1518
    SAI_BUFFER_PROFILE_ATTR_SHARED_STATIC_TH     : 15982720
    SAI_BUFFER_PROFILE_ATTR_THRESHOLD_MODE       : SAI_BUFFER_PROFILE_THRESHOLD_MODE_STATIC

SAI_OBJECT_TYPE_BUFFER_PROFILE oid:0x19000000000b11
    SAI_BUFFER_PROFILE_ATTR_POOL_ID              : oid:0x18000000000b0e
    SAI_BUFFER_PROFILE_ATTR_RESERVED_BUFFER_SIZE : 1518
    SAI_BUFFER_PROFILE_ATTR_SHARED_DYNAMIC_TH    : 0
    SAI_BUFFER_PROFILE_ATTR_THRESHOLD_MODE       : SAI_BUFFER_PROFILE_THRESHOLD_MODE_DYNAMIC

SAI_OBJECT_TYPE_BUFFER_PROFILE oid:0x19000000000b12
    SAI_BUFFER_PROFILE_ATTR_POOL_ID              : oid:0x18000000000b0f
    SAI_BUFFER_PROFILE_ATTR_RESERVED_BUFFER_SIZE : 0
    SAI_BUFFER_PROFILE_ATTR_SHARED_DYNAMIC_TH    : 3
    SAI_BUFFER_PROFILE_ATTR_THRESHOLD_MODE       : SAI_BUFFER_PROFILE_THRESHOLD_MODE_DYNAMIC

SAI_OBJECT_TYPE_BUFFER_PROFILE oid:0x19000000000b13
    SAI_BUFFER_PROFILE_ATTR_POOL_ID              : oid:0x18000000000b0f
    SAI_BUFFER_PROFILE_ATTR_RESERVED_BUFFER_SIZE : 1248
    SAI_BUFFER_PROFILE_ATTR_SHARED_DYNAMIC_TH    : 0
    SAI_BUFFER_PROFILE_ATTR_THRESHOLD_MODE       : SAI_BUFFER_PROFILE_THRESHOLD_MODE_DYNAMIC
    SAI_BUFFER_PROFILE_ATTR_XOFF_TH              : 138112
    SAI_BUFFER_PROFILE_ATTR_XON_OFFSET_TH        : 2288
    SAI_BUFFER_PROFILE_ATTR_XON_TH               : 2288

SAI_OBJECT_TYPE_BUFFER_PROFILE oid:0x19000000000b14
    SAI_BUFFER_PROFILE_ATTR_POOL_ID              : oid:0x18000000000b0f
    SAI_BUFFER_PROFILE_ATTR_RESERVED_BUFFER_SIZE : 1248
    SAI_BUFFER_PROFILE_ATTR_SHARED_DYNAMIC_TH    : 0
    SAI_BUFFER_PROFILE_ATTR_THRESHOLD_MODE       : SAI_BUFFER_PROFILE_THRESHOLD_MODE_DYNAMIC
    SAI_BUFFER_PROFILE_ATTR_XOFF_TH              : 53664
    SAI_BUFFER_PROFILE_ATTR_XON_OFFSET_TH        : 2288
    SAI_BUFFER_PROFILE_ATTR_XON_TH               : 2288

you have 5 profiles ending from b10 to b14

  1. you boot to 202012 warm boot and this is what you get in DB (s2_saidump)

SAI_OBJECT_TYPE_BUFFER_PROFILE oid:0x19000000000bd3

SAI_OBJECT_TYPE_BUFFER_PROFILE oid:0x19000000000bd4

SAI_OBJECT_TYPE_BUFFER_PROFILE oid:0x19000000000bd5

SAI_OBJECT_TYPE_BUFFER_PROFILE oid:0x19000000000bd6

SAI_OBJECT_TYPE_BUFFER_PROFILE oid:0x19000000000bd7

SAI_OBJECT_TYPE_BUFFER_PROFILE oid:0x19000000000c60
    SAI_BUFFER_PROFILE_ATTR_POOL_ID              : oid:0x18000000000c5d
    SAI_BUFFER_PROFILE_ATTR_RESERVED_BUFFER_SIZE : 1518
    SAI_BUFFER_PROFILE_ATTR_SHARED_STATIC_TH     : 15982720
    SAI_BUFFER_PROFILE_ATTR_THRESHOLD_MODE       : SAI_BUFFER_PROFILE_THRESHOLD_MODE_STATIC

SAI_OBJECT_TYPE_BUFFER_PROFILE oid:0x19000000000c61
    SAI_BUFFER_PROFILE_ATTR_POOL_ID              : oid:0x18000000000c5e
    SAI_BUFFER_PROFILE_ATTR_RESERVED_BUFFER_SIZE : 1518
    SAI_BUFFER_PROFILE_ATTR_SHARED_DYNAMIC_TH    : 0
    SAI_BUFFER_PROFILE_ATTR_THRESHOLD_MODE       : SAI_BUFFER_PROFILE_THRESHOLD_MODE_DYNAMIC

SAI_OBJECT_TYPE_BUFFER_PROFILE oid:0x19000000000c62
    SAI_BUFFER_PROFILE_ATTR_POOL_ID              : oid:0x18000000000c5f
    SAI_BUFFER_PROFILE_ATTR_RESERVED_BUFFER_SIZE : 0
    SAI_BUFFER_PROFILE_ATTR_SHARED_DYNAMIC_TH    : 3
    SAI_BUFFER_PROFILE_ATTR_THRESHOLD_MODE       : SAI_BUFFER_PROFILE_THRESHOLD_MODE_DYNAMIC

SAI_OBJECT_TYPE_BUFFER_PROFILE oid:0x19000000000c63
    SAI_BUFFER_PROFILE_ATTR_POOL_ID              : oid:0x18000000000c5f
    SAI_BUFFER_PROFILE_ATTR_RESERVED_BUFFER_SIZE : 1248
    SAI_BUFFER_PROFILE_ATTR_SHARED_DYNAMIC_TH    : 0
    SAI_BUFFER_PROFILE_ATTR_THRESHOLD_MODE       : SAI_BUFFER_PROFILE_THRESHOLD_MODE_DYNAMIC
    SAI_BUFFER_PROFILE_ATTR_XOFF_TH              : 138112
    SAI_BUFFER_PROFILE_ATTR_XON_OFFSET_TH        : 2288
    SAI_BUFFER_PROFILE_ATTR_XON_TH               : 2288

SAI_OBJECT_TYPE_BUFFER_PROFILE oid:0x19000000000c64
    SAI_BUFFER_PROFILE_ATTR_POOL_ID              : oid:0x18000000000c5f
    SAI_BUFFER_PROFILE_ATTR_RESERVED_BUFFER_SIZE : 1248
    SAI_BUFFER_PROFILE_ATTR_SHARED_DYNAMIC_TH    : 0
    SAI_BUFFER_PROFILE_ATTR_THRESHOLD_MODE       : SAI_BUFFER_PROFILE_THRESHOLD_MODE_DYNAMIC
    SAI_BUFFER_PROFILE_ATTR_XOFF_TH              : 53664
    SAI_BUFFER_PROFILE_ATTR_XON_OFFSET_TH        : 2288
    SAI_BUFFER_PROFILE_ATTR_XON_TH               : 2288

you have now 10 buffer profiles, and since OA generated only 5 buffer profiles, they were matched from previous existed ones:

with syncd comparison logic
buffer 0x19000000000c60 was matched with 0x19000000000b10
0x19000000000c61 with 0x19000000000b11
0x19000000000c62 with 0x19000000000b12 
0x19000000000c63 with 0x19000000000b13
0x19000000000c64 with 0x19000000000b14

you can check that with matching RID values from table RIDTOVID and VIDTORID but those tables exists only on redisdb not in saidump since saidump dont dump any non SAI related objects

VID value oid:0x19000000000b12 from 201811 will have the same RID as VID buffer 0x19000000000c62 for 202012

so objects from b10 to b14 on 201811 were correctly matched with temporary objects on 202012 and reassigned values from c60 to c64 (they could have different order, but the range is correct)

now for remaining objects from 202012 warm boot from bd3 to bd7, those 5 objects with null values in s2_asic_db, those are objects which are discovered during 1st warm boot after switch create when we do discovery logic

warm boot succeeded, previous 5 buffer profiles were remapped to new temporary 5 buffer profiles,
but those "new" 5 buffer profiles were discovered, and they are artifacts and have different VID values, but they actually correspond to those existing buffer profiles created in cold boot, because BRCM changed OID encoding on buffer profiles (or a bug encoding), and what happens buffer profiles created in orchagent are assigned in different places

and since previously those buffer profiles had RID values created by OA already have RID in database (RID from 201811) this RID is different from 202012 RID when broadcom changed OID encoding (or bug endoding)

so comparison logic is reassigning old RID since it's exists on ASIC_DB, but syncd is discovering new RID which don't match anything in APP_DB and creates dummy objects. those newly discovered RID values are probably correct ones (if broadcom changed encoding) and those RID which were returned in 201811 branch are now incorrect (unless enoding bug and both them should be the same)

of course from our perspective RID values for those buffers should be the same in 201811 and in 202012 branches

on 1st warm boot you can see what discovery returned:

 113 Oct  8 09:04:36.126100 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: discover took 0.075666 sec
 114 Oct  8 09:04:36.126100 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: discovered objects count: 2718
 115 Oct  8 09:04:36.127944 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: SAI_OBJECT_TYPE_PORT: 65
 116 Oct  8 09:04:36.127944 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: SAI_OBJECT_TYPE_VIRTUAL_ROUTER: 1
 117 Oct  8 09:04:36.127944 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: SAI_OBJECT_TYPE_STP: 1
 118 Oct  8 09:04:36.127944 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: SAI_OBJECT_TYPE_HOSTIF_TRAP_GROUP: 1
 119 Oct  8 09:04:36.127944 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: SAI_OBJECT_TYPE_POLICER: 1
 120 Oct  8 09:04:36.127944 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: SAI_OBJECT_TYPE_QUEUE: 1328
 121 Oct  8 09:04:36.127944 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: SAI_OBJECT_TYPE_SCHEDULER: 3
 122 Oct  8 09:04:36.127944 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: SAI_OBJECT_TYPE_SCHEDULER_GROUP: 715
 123 Oct  8 09:04:36.127944 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: SAI_OBJECT_TYPE_BUFFER_POOL: 5
 124 Oct  8 09:04:36.127944 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: SAI_OBJECT_TYPE_BUFFER_PROFILE: 10
 125 Oct  8 09:04:36.127944 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: SAI_OBJECT_TYPE_INGRESS_PRIORITY_GROUP: 520
 126 Oct  8 09:04:36.127944 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: SAI_OBJECT_TYPE_SWITCH: 1
 127 Oct  8 09:04:36.127944 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: SAI_OBJECT_TYPE_VLAN: 1
 128 Oct  8 09:04:36.127944 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: SAI_OBJECT_TYPE_BRIDGE: 1
 129 Oct  8 09:04:36.128022 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: SAI_OBJECT_TYPE_PORT_SERDES: 65
 130 Oct  8 09:04:36.134349 lnos-x1-a-fab01 NOTICE syncd#syncd: :- checkWarmBootDiscoveredRids: check warm boot RIDs

as you can see that it discovered 10 buffer profiles, this IS A BUG from broadcom, it happens because those buffer profiles are assigned on different places and in one place it returns one RID value on same buffer profiles and 2nd RID value on other places

you can see NEW (invalid RIDS) disovered on buffer profiles:

 133 Oct  8 09:04:36.161373 lnos-x1-a-fab01 NOTICE syncd#syncd: :- checkWarmBootDiscoveredRids: spotted new RID oid:0x1900000001 missing from current RID2VID (new VID oid:0x19000000000bd3) (SAI_OBJECT_TYPE_BUFFER_PROFILE) on WARM BOOT
 134 Oct  8 09:04:36.161966 lnos-x1-a-fab01 NOTICE syncd#syncd: :- checkWarmBootDiscoveredRids: spotted new RID oid:0x1900000002 missing from current RID2VID (new VID oid:0x19000000000bd4) (SAI_OBJECT_TYPE_BUFFER_PROFILE) on WARM BOOT
 135 Oct  8 09:04:36.162524 lnos-x1-a-fab01 NOTICE syncd#syncd: :- checkWarmBootDiscoveredRids: spotted new RID oid:0x1900000003 missing from current RID2VID (new VID oid:0x19000000000bd5) (SAI_OBJECT_TYPE_BUFFER_PROFILE) on WARM BOOT
 136 Oct  8 09:04:36.163090 lnos-x1-a-fab01 NOTICE syncd#syncd: :- checkWarmBootDiscoveredRids: spotted new RID oid:0x1900000004 missing from current RID2VID (new VID oid:0x19000000000bd6) (SAI_OBJECT_TYPE_BUFFER_PROFILE) on WARM BOOT
 137 Oct  8 09:04:36.163658 lnos-x1-a-fab01 NOTICE syncd#syncd: :- checkWarmBootDiscoveredRids: spotted new RID oid:0x1900000005 missing from current RID2VID (new VID oid:0x19000000000bd7) (SAI_OBJECT_TYPE_BUFFER_PROFILE) on WARM BOOT

and its creating dummy objects:

 256 Oct  8 09:04:55.076416 lnos-x1-a-fab01 NOTICE syncd#syncd: :- populateExistingObjects: creating temporary object for new discovered VID SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000bd3
 257 Oct  8 09:04:55.076467 lnos-x1-a-fab01 NOTICE syncd#syncd: :- populateExistingObjects: creating temporary object for new discovered VID SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000bd4
 258 Oct  8 09:04:55.076518 lnos-x1-a-fab01 NOTICE syncd#syncd: :- populateExistingObjects: creating temporary object for new discovered VID SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000bd5
 259 Oct  8 09:04:55.076578 lnos-x1-a-fab01 NOTICE syncd#syncd: :- populateExistingObjects: creating temporary object for new discovered VID SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000bd6
 260 Oct  8 09:04:55.076628 lnos-x1-a-fab01 NOTICE syncd#syncd: :- populateExistingObjects: creating temporary object for new discovered VID SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000bd7

at this point there is inconsistent state in ASIC DB and syncd since 10 RIDs point to the same 5 objects, because of brcm bug

  1. then you do another warm boot 202012 to 202012

discovery on syncd still discovers 10 buffer profiles:

117 Oct  8 09:17:43.547277 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: discover took 0.089495 sec
118 Oct  8 09:17:43.547277 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: discovered objects count: 2718
119 Oct  8 09:17:43.547277 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: SAI_OBJECT_TYPE_PORT: 65
120 Oct  8 09:17:43.547277 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: SAI_OBJECT_TYPE_VIRTUAL_ROUTER: 1
121 Oct  8 09:17:43.547277 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: SAI_OBJECT_TYPE_STP: 1
122 Oct  8 09:17:43.547277 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: SAI_OBJECT_TYPE_HOSTIF_TRAP_GROUP: 1
123 Oct  8 09:17:43.547277 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: SAI_OBJECT_TYPE_POLICER: 1
124 Oct  8 09:17:43.547277 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: SAI_OBJECT_TYPE_QUEUE: 1328
125 Oct  8 09:17:43.547277 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: SAI_OBJECT_TYPE_SCHEDULER: 3
126 Oct  8 09:17:43.547277 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: SAI_OBJECT_TYPE_SCHEDULER_GROUP: 715
127 Oct  8 09:17:43.547277 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: SAI_OBJECT_TYPE_BUFFER_POOL: 5
128 Oct  8 09:17:43.547277 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: SAI_OBJECT_TYPE_BUFFER_PROFILE: 10
129 Oct  8 09:17:43.547277 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: SAI_OBJECT_TYPE_INGRESS_PRIORITY_GROUP: 520
130 Oct  8 09:17:43.547277 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: SAI_OBJECT_TYPE_SWITCH: 1
131 Oct  8 09:17:43.547277 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: SAI_OBJECT_TYPE_VLAN: 1
132 Oct  8 09:17:43.547277 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: SAI_OBJECT_TYPE_BRIDGE: 1
133 Oct  8 09:17:43.547277 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: SAI_OBJECT_TYPE_PORT_SERDES: 65
134 Oct  8 09:17:43.551337 lnos-x1-a-fab01 NOTICE syncd#syncd: :- checkWarmBootDiscoveredRids: check warm boot RIDs

but since OA created only 5 of them now there is discrepency:

242 Oct  8 09:18:02.495061 lnos-x1-a-fab01 WARNING syncd#syncd: :- logViewObjectCount: object count for SAI_OBJECT_TYPE_HOSTIF_TRAP_GROUP on current view 5 is different than on temporary view: 1
243 Oct  8 09:18:02.495192 lnos-x1-a-fab01 WARNING syncd#syncd: :- logViewObjectCount: object count for SAI_OBJECT_TYPE_POLICER on current view 3 is different than on temporary view: 0
244 Oct  8 09:18:02.496285 lnos-x1-a-fab01 WARNING syncd#syncd: :- logViewObjectCount: object count for SAI_OBJECT_TYPE_BUFFER_POOL on current view 5 is different than on temporary view: 3
245 Oct  8 09:18:02.496411 lnos-x1-a-fab01 WARNING syncd#syncd: :- logViewObjectCount: object count for SAI_OBJECT_TYPE_BUFFER_PROFILE on current view 10 is different than on temporary view: 5
246 Oct  8 09:18:02.496553 lnos-x1-a-fab01 WARNING syncd#syncd: :- logViewObjectCount: object count for SAI_OBJECT_TYPE_HOSTIF_TRAP on current view 10 is different than on temporary view: 1
247 Oct  8 09:18:02.496677 lnos-x1-a-fab01 WARNING syncd#syncd: :- logViewObjectCount: object count for SAI_OBJECT_TYPE_PORT_SERDES on current view 65 is different than on temporary view: 0
248 Oct  8 09:18:02.496759 lnos-x1-a-fab01 WARNING syncd#syncd: :- logViewObjectCount: object count is different on both view, there will be ASIC OPERATIONS!

you got 10 in DB but 5 created, since comparison logic correctly matched those 5 buffer profiles (non dummy ones) you can see that comparison logic scheduled remove for those 5 dummy ones:

201 Oct  8 09:18:02.707853 lnos-x1-a-fab01 NOTICE syncd#syncd: :- executeOperationsOnAsic: remove: SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000bd3
202 Oct  8 09:18:02.707914 lnos-x1-a-fab01 NOTICE syncd#syncd: :- executeOperationsOnAsic: remove: SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000bd4
203 Oct  8 09:18:02.707965 lnos-x1-a-fab01 NOTICE syncd#syncd: :- executeOperationsOnAsic: remove: SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000bd5
204 Oct  8 09:18:02.708015 lnos-x1-a-fab01 NOTICE syncd#syncd: :- executeOperationsOnAsic: remove: SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000bd6
205 Oct  8 09:18:02.708065 lnos-x1-a-fab01 NOTICE syncd#syncd: :- executeOperationsOnAsic: remove: SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000bd7

and that operation failed: (s3_syslog)

235 Oct  8 09:18:02.721171 lnos-x1-a-fab01 ERR syncd#syncd: [none] SAI_API_BUFFER:brcm_sai_remove_buffer_profile:2423 Getting buffer pool data failed with error -5.
236 Oct  8 09:18:02.721255 lnos-x1-a-fab01 ERR syncd#syncd: :- asic_handle_generic: remove SAI_OBJECT_TYPE_BUFFER_PROFILE RID: oid:0x1900000001 VID oid:0x19000000000bd3 failed: SAI_STATUS_INVALID_PARAMETER
237 Oct  8 09:18:02.721255 lnos-x1-a-fab01 ERR syncd#syncd: :- asic_process_event: failed to execute api: remove, key: SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000bd3, status: SAI_STATUS_INVALID_PARAMETER
238 Oct  8 09:18:02.721419 lnos-x1-a-fab01 NOTICE syncd#syncd: :- executeOperationsOnAsic: asic apply took 0.014296 sec
239 Oct  8 09:18:02.721445 lnos-x1-a-fab01 ERR syncd#syncd: :- executeOperationsOnAsic: Error while executing asic operations, ASIC is in inconsistent state: :- asic_process_event: failed to execute api: remove, key: SAI_OBJECT_TYPE_BUFFER_P
240 Oct  8 09:18:02.765644 lnos-x1-a-fab01 NOTICE syncd#syncd: :- applyView: apply took 0.687701 sec
241 Oct  8 09:18:02.767236 lnos-x1-a-fab01 ERR syncd#syncd: :- run: Runtime error: :- asic_process_event: failed to execute api: remove, key: SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000bd3, status: SAI_STATUS_INVALID_PARAMETER
242

causing OA to crash, BRCM is returning here SAI_STATUS_INVALID_PARAMETER, so it means its that NEW discovered RID in 1st warm boot that resulted in created dummy buffer profiles, is not VALID

this is broadcom BUG, it's caused via different RID values between 2 different version of their SDK, it's hard to tell whether they changed encoding on RID (vendors usually encode some special values inside RID for convenience) or this is a bug, and they forget to return correct value of RID in place where is returned the one for dummy object

what you can do is to ask BRCM whether NEW RID values ARE correct:

261 spotted new RID oid:0x1900000001
262 spotted new RID oid:0x1900000002
263 spotted new RID oid:0x1900000003
264 spotted new RID oid:0x1900000004
265 spotted new RID oid:0x1900000005

those are the one that are discovered and created a new dummy objects. if you do first cold boot, you can look into RIDTOVID table in redis ASIC_DB and check what are looking values for buffer_profiles after cold boot (they should contain 19 in hex value as the one above) but since those are not existing after warm boot, im guessing that after cold boot RID values will have extra bit set to 1 (or couple bits), or they could have continuous values

and not it's to broadcom to figure out, whether they forget to set to 1 extra bit or whether they wanted to remove those bits from RID encoding

Please take a look to actually any cold/warm boot db and do a dump from RIDTOVID table and post here RID VALUES

@kcudnik
Copy link
Collaborator

kcudnik commented Oct 10, 2024

current logging in syncd does not show at which attribute each OID is discovered, but it could help to understand where the bug occurs on which attribute, based on SAI headers buffer_profile could be set here:

saibuffer.h-    SAI_INGRESS_PRIORITY_GROUP_ATTR_BUFFER_PROFILE 
saiport.h-    SAI_PORT_ATTR_QOS_INGRESS_BUFFER_PROFILE_LIST,
saiport.h-    SAI_PORT_ATTR_QOS_EGRESS_BUFFER_PROFILE_LIST,
saiqueue.h-    SAI_QUEUE_ATTR_BUFFER_PROFILE_ID

but it;s not said that it's not returned on some other attribute as a bug

my guess would be on port_attr_qos

@kcudnik
Copy link
Collaborator

kcudnik commented Oct 11, 2024

Appreciate your input! I've sent the sairedis.rec files. If it's something to do with Broadcom SAI, I wouldn't expect any change from them as the version is pretty old. In this situation, for the upgrade to proceed with warm-reboot, we would be looking for sort of a workaround. Is there any fix or workaround available as you know? Specifically:

  • (preferred) we can make changes in syncd and/or sairedis of our own codebase.
  • (doable) after the first warm boot (upgrade), we can trigger a process to clean up the problem state (e.g., update/delete some SAI objects) to facilitate next warm boots.

we probably can make workaround on our side, but it will be problematic, since it's not only that we keep those extra discovered objects in ASIC_DB, those are also kept in syncd memory, so most likely fix would need to be in the syncd code to make up for that, since those objects are discovered on 202012 warm boot branch

and to make that code change we would need to have those invalid RID values how they look like, and where they are discovered

we would need to have a log information added after that :
https://github.com/sonic-net/sonic-sairedis/blob/master/syncd/SaiDiscovery.cpp#L151 and that https://github.com/sonic-net/sonic-sairedis/blob/master/syncd/SaiDiscovery.cpp#L211

if succeeded, then we need to loggedd attribute ID (as string enum) and all RID values to make sure where those values are discovered

then we can prepare image with workaround for this specific invlaid RIDs for that specific branch

question is whether this is only for your specific case, or you have a bunch of devices in that state with different configurations

first we would need to add logging, do a warm boot, see where those rids are, and then add workaround based on received data

of course we can also prepare generic woarkaround that will ignore ALL dummy discovered BUFFER_PROFILES, maybe that's better ?? but probably it would be more strict to just restrict those also to specific attribute

@zhenggen-xu
Copy link

We will debug further to understand the RIDs changes at different stage.

I also agree ignoring the dummy discovered BUFFER_PROFILES would be a simpler solution, we can control that during the first warm-reboot , this will make the system clean for future upgrades and warm-reboots. Do you have any pointers which part of the code should be changed for this approach?

@Stephenxf
Copy link
Author

@kcudnik Thanks a lot for your analysis and suggestions! A few things I've tried so far:

  1. In SaiSwitch::checkWarmBootDiscoveredRids(), skipped the new RIDs that are not in RID2VID, i.e., not calling the following 3 functions:
    • m_warmBootDiscoveredVids.insert(vid)
    • m_warmBootNewDiscoveredVids.insert(vid)
    • redisSetDummyAsicStateForRealObjectId(rid)

As a result, after warm-reboot from 201811 to 202012, the ASIC_DB didn't have the dummy objects, but the next warm-reboot still failed most likely due to the discovered objects still in syncd memory.

  1. After warm-reboot from 201811 to 202012, manually removed the dummy BUFFER_PROFILE objects as well as the corresponding mappings from RIDTOVID and VIDTORID tables from ASIC_DB, but next warm-reboot still failed (more dummy objects got created) most likely due to the discovered objects still in memory and syncd in problematic state.

  2. Added new logs to SaiDiscovery::discover() as suggested. The 5 new RIDs that were discovered (i.e., oid:0x1900000001 to oid:0x1900000005) have one of the two attributes:

    • SAI_INGRESS_PRIORITY_GROUP_ATTR_BUFFER_PROFILE
    • SAI_QUEUE_ATTR_BUFFER_PROFILE_ID

The existing RIDs that came from before warm-reboot (i.e., oid:0x1900000006 to oid:0x190000000a) also have one of these two attributes.

Another observation is that, with code boot followed by warm-reboot on 202012 only, the RIDs oid:0x1900000001 to oid:0x1900000005 never existed (only oid:0x1900000006 to oid:0x190000000a before and after warm-reboot).

I'll get you ASIC_DB dumps and syslog with the added logs shortly.

@kcudnik
Copy link
Collaborator

kcudnik commented Oct 15, 2024

We will debug further to understand the RIDs changes at different stage.

I also agree ignoring the dummy discovered BUFFER_PROFILES would be a simpler solution, we can control that during the first warm-reboot , this will make the system clean for future upgrades and warm-reboots. Do you have any pointers which part of the code should be changed for this approach?

yes, i already know what needs to be changed, but it's not that straight forward, that this change will fix this, since if you first warmboot and ignore dummy RIDs during discovery process, it may still happen that when OA will do GET operation on those you still get an invalid RID since it's still in vendor SDK memory so better approach would be to ignore them during discovery and do 2nd warm boot, since it seems like invalid RIDs are then gone after that, but this does not guarantee that system will be stable, since if those invalid values are discovered, then there is a bug in vendor code, and most likely in long system uptime will lead inconsistency

@kcudnik
Copy link
Collaborator

kcudnik commented Oct 15, 2024

@Stephenxf manual remove from ASIC_DB will not help, since entire db also sits in syncd memory, so when you will do warmboot agian you will still have invalid rids in asic db again putted by syncd.

skipping those methods you pointed its not enoug, they should noe be touched, invalid rid's should be ignored in discovery process, but this may be not enough, probably those RID's should be translated, since previous values are in syncd memory and asic db, then values restored from discovery on correct attributres:

SAI_INGRESS_PRIORITY_GROUP_ATTR_BUFFER_PROFILE
SAI_QUEUE_ATTR_BUFFER_PROFILE_ID

are those were then reassigned with new (dummy) ones because of vendor bug, it seems like vendor misscalculate those, so probably if they were from 0x6 to 0xa first then as a bug they are combined from 0x1 to 0x5, but they should stay at the same number, for this particular system we can just add 0x5 value to each to them and that should be enough, but in production that will be hard to estimate what happens actually and how many buffer profiles are there

after deeper thought, just ignoring new dummy values in disocvery process ,maybe not enough, but we could try that if it will work, problem will be if we would need to modify any existing buffer profile after 1st warm boot, since previous RID seems to be invalid

if you have easy way to modify discovery code, just skip adding discovered RID when the attribute is one of those 2 above to not discover old/new buffer profiles on those attributes and let see if this workaround will solve the problem but log those values with log level WARN to see if they will back to correct ones after 2nd warm boot

in previous logs i saw that there was SET request on one of the buffer profiles in syslog, so this operation can fail since RID will not be valid

let me know if you will be able to ignore those 2 attributes during discovery

@Stephenxf
Copy link
Author

Sure, will try the change of skipping the discovered RIDs for attributes SAI_INGRESS_PRIORITY_GROUP_ATTR_BUFFER_PROFILE and SAI_QUEUE_ATTR_BUFFER_PROFILE_ID. Will keep you posted with the results a bit later.

BTW, these 5 RIDs/attributes are supposed to be the ones created based on the RIDs prior to warm-reboot:

oid:0x1900000006	SAI_QUEUE_ATTR_BUFFER_PROFILE_ID
oid:0x1900000007	SAI_QUEUE_ATTR_BUFFER_PROFILE_ID
oid:0x1900000008	SAI_INGRESS_PRIORITY_GROUP_ATTR_BUFFER_PROFILE
oid:0x1900000009	SAI_INGRESS_PRIORITY_GROUP_ATTR_BUFFER_PROFILE
oid:0x190000000a	SAI_INGRESS_PRIORITY_GROUP_ATTR_BUFFER_PROFILE

These are the 5 extra ones most likely introduced by bcmsai during warm-reboot to 202012:

oid:0x1900000001	SAI_INGRESS_PRIORITY_GROUP_ATTR_BUFFER_PROFILE
oid:0x1900000002	SAI_INGRESS_PRIORITY_GROUP_ATTR_BUFFER_PROFILE
oid:0x1900000003	SAI_QUEUE_ATTR_BUFFER_PROFILE_ID
oid:0x1900000004	SAI_QUEUE_ATTR_BUFFER_PROFILE_ID
oid:0x1900000005	SAI_QUEUE_ATTR_BUFFER_PROFILE_ID

@kcudnik
Copy link
Collaborator

kcudnik commented Oct 15, 2024

this is interesting that in 2 cases first we have 3 ingress buffer profiles, and then 2
and queue buffers also 2 fist, next 3 of them, are you sure its correct ?

@kcudnik
Copy link
Collaborator

kcudnik commented Oct 15, 2024

I analysed logs you provided and besides dummy buffer profiles, there is an extra operation:

Oct 15 01:33:29.066469 lnos-x1-a-fab01 ERR syncd#syncd: :- processQuadEvent: VID: oid:0x19000000000c54 RID: oid:0x1900000006
Oct 15 01:33:29.066507 lnos-x1-a-fab01 ERR syncd#syncd: :- processQuadEvent: attr: SAI_BUFFER_PROFILE_ATTR_SHARED_STATIC_TH: 15982720

but it's outside apply view, so we should be good here, i dont see any errors on that in syslog, so assummed OA changed this value, and RID value matches, so if you ignore those extra during discovery process, we should be good

and from logs i found this:

SAI_INGRESS_PRIORITY_GROUP_ATTR_BUFFER_PROFILE oid:0x1900000001 (SAI_OBJECT_TYPE_BUFFER_PROFILE)
SAI_INGRESS_PRIORITY_GROUP_ATTR_BUFFER_PROFILE oid:0x1900000002 (SAI_OBJECT_TYPE_BUFFER_PROFILE)
SAI_INGRESS_PRIORITY_GROUP_ATTR_BUFFER_PROFILE oid:0x1900000008 (SAI_OBJECT_TYPE_BUFFER_PROFILE)
SAI_INGRESS_PRIORITY_GROUP_ATTR_BUFFER_PROFILE oid:0x1900000009 (SAI_OBJECT_TYPE_BUFFER_PROFILE)
SAI_INGRESS_PRIORITY_GROUP_ATTR_BUFFER_PROFILE oid:0x190000000a (SAI_OBJECT_TYPE_BUFFER_PROFILE)
SAI_QUEUE_ATTR_BUFFER_PROFILE_ID oid:0x1900000003 (SAI_OBJECT_TYPE_BUFFER_PROFILE)
SAI_QUEUE_ATTR_BUFFER_PROFILE_ID oid:0x1900000004 (SAI_OBJECT_TYPE_BUFFER_PROFILE)
SAI_QUEUE_ATTR_BUFFER_PROFILE_ID oid:0x1900000005 (SAI_OBJECT_TYPE_BUFFER_PROFILE)
SAI_QUEUE_ATTR_BUFFER_PROFILE_ID oid:0x1900000006 (SAI_OBJECT_TYPE_BUFFER_PROFILE)
SAI_QUEUE_ATTR_BUFFER_PROFILE_ID oid:0x1900000007 (SAI_OBJECT_TYPE_BUFFER_PROFILE)

it's weird that those values are in different places, but hey, its a bug :P

@Stephenxf
Copy link
Author

@kcudnik it's looking promising now! I made the change to skip those two objects. It took me a few iterations to get the IDs and TYPEs correct. I guess I added too many logs so syslog started rate-limiting :) I just cleaned up the code a bit. A new image is being built. I'll get the syslog later.

So basically this is the change:

@@ -185,6 +185,16 @@ void SaiDiscovery::discover(
                 }
             }

+            if ((md->objecttype == SAI_OBJECT_TYPE_QUEUE && md->attrid == SAI_QUEUE_ATTR_BUFFER_PROFILE_ID) ||
+                (md->objecttype == SAI_OBJECT_TYPE_INGRESS_PRIORITY_GROUP && md->attrid == SAI_INGRESS_PRIORITY_GROUP_ATTR_BUFFER_PROFILE)) {
+                SWSS_LOG_WARN("OBJECT_ID: RID %s (%s), SKIP attr %s %s (%s)",
+                        sai_serialize_object_id(rid).c_str(),
+                        sai_serialize_object_type(mk.objecttype).c_str(),
+                        md->attridname,
+                        sai_serialize_object_id(attr.value.oid).c_str(),
+                        sai_serialize_object_type(m_sai->objectTypeQuery(attr.value.oid)).c_str());
+                continue;
+            }
+
             discover(attr.value.oid, discovered); // recursion
         }
         else if (md->attrvaluetype == SAI_ATTR_VALUE_TYPE_OBJECT_LIST)

Not sure why the attribute SAI_QUEUE_ATTR_BUFFER_PROFILE_ID came with type SAI_OBJECT_TYPE_QUEUE instead of SAI_OBJECT_TYPE_BUFFER_PROFILE, but anyway with this change, I assume we skip all SAI_OBJECT_TYPE_BUFFER_PROFILE objects (both real and dummy ones) in the "discover" process; then the real objects will be newly created (rather than "discovered"). Is my understanding correct?

Following the first warm-reboot (upgrade from 201811 to 202012), I did warm-reboot a couple of more times, no issue identified, the number of SAI_OBJECT_TYPE_BUFFER_PROFILE objects was always 5 (i.e., oid:0x1900000006 to oid:0x190000000a).

Appreciate your great and consistent help! Let me get some logs and output for further analysis and confirmation.

@Stephenxf
Copy link
Author

@kcudnik sent the logs and dumps separately. The workaround looks good so far with basic testing. Will verify further in the next few days. Now the discovered objects no longer have SAI_OBJECT_TYPE_BUFFER_PROFILE.

After warm-reboot from 201811 to 202012:

Oct 15 23:35:11.277206 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: discover took 0.288548 sec
Oct 15 23:35:11.277206 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: discovered objects count: 2703
Oct 15 23:35:11.277238 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: SAI_OBJECT_TYPE_PORT: 65
Oct 15 23:35:11.277489 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: SAI_OBJECT_TYPE_VIRTUAL_ROUTER: 1
Oct 15 23:35:11.277489 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: SAI_OBJECT_TYPE_STP: 1
Oct 15 23:35:11.277489 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: SAI_OBJECT_TYPE_HOSTIF_TRAP_GROUP: 1
Oct 15 23:35:11.277489 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: SAI_OBJECT_TYPE_POLICER: 1
Oct 15 23:35:11.277489 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: SAI_OBJECT_TYPE_QUEUE: 1328
Oct 15 23:35:11.277489 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: SAI_OBJECT_TYPE_SCHEDULER: 3
Oct 15 23:35:11.277489 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: SAI_OBJECT_TYPE_SCHEDULER_GROUP: 715
Oct 15 23:35:11.277489 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: SAI_OBJECT_TYPE_INGRESS_PRIORITY_GROUP: 520
Oct 15 23:35:11.277489 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: SAI_OBJECT_TYPE_SWITCH: 1
Oct 15 23:35:11.277489 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: SAI_OBJECT_TYPE_VLAN: 1
Oct 15 23:35:11.277489 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: SAI_OBJECT_TYPE_BRIDGE: 1
Oct 15 23:35:11.277527 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: SAI_OBJECT_TYPE_PORT_SERDES: 65

After one more warm-reboot from 202012 to 202012:

Oct 15 23:50:58.765380 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: discover took 0.460236 sec
Oct 15 23:50:58.765404 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: discovered objects count: 2703
Oct 15 23:50:58.765581 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: SAI_OBJECT_TYPE_PORT: 65
Oct 15 23:50:58.765581 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: SAI_OBJECT_TYPE_VIRTUAL_ROUTER: 1
Oct 15 23:50:58.765581 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: SAI_OBJECT_TYPE_STP: 1
Oct 15 23:50:58.765581 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: SAI_OBJECT_TYPE_HOSTIF_TRAP_GROUP: 1
Oct 15 23:50:58.765581 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: SAI_OBJECT_TYPE_POLICER: 1
Oct 15 23:50:58.765581 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: SAI_OBJECT_TYPE_QUEUE: 1328
Oct 15 23:50:58.765581 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: SAI_OBJECT_TYPE_SCHEDULER: 3
Oct 15 23:50:58.765613 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: SAI_OBJECT_TYPE_SCHEDULER_GROUP: 715
Oct 15 23:50:58.765613 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: SAI_OBJECT_TYPE_INGRESS_PRIORITY_GROUP: 520
Oct 15 23:50:58.765632 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: SAI_OBJECT_TYPE_SWITCH: 1
Oct 15 23:50:58.765632 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: SAI_OBJECT_TYPE_VLAN: 1
Oct 15 23:50:58.765651 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: SAI_OBJECT_TYPE_BRIDGE: 1
Oct 15 23:50:58.765651 lnos-x1-a-fab01 NOTICE syncd#syncd: :- discover: SAI_OBJECT_TYPE_PORT_SERDES: 65

@kcudnik
Copy link
Collaborator

kcudnik commented Oct 16, 2024

Not sure why the attribute SAI_QUEUE_ATTR_BUFFER_PROFILE_ID came with type SAI_OBJECT_TYPE_QUEUE instead of SAI_OBJECT_TYPE_BUFFER_PROFILE, but anyway with this change, I assume we skip all SAI_OBJECT_TYPE_BUFFER_PROFILE objects (both real and dummy ones) in the "discover" process; then the real objects will be newly created (rather than "discovered"). Is my understanding correct?

hey, not sure how you found this, because from all syslog analysis you provided all discover methods, return object type BUFFER_PROFILE on attribute SAI_QUEUE_ATTR_BUFFER_PROFILE_ID on object QUEUE, in my understanding you found a QUEUE, which on attribute BUFFER_PROFILE_ID returns object of type QUEUE instead of BUFFER_PROFILE ?

from your syslog this is alll i found:

 (SAI_OBJECT_TYPE_INGRESS_PRIORITY_GROUP), attr SAI_INGRESS_PRIORITY_GROUP_ATTR_BUFFER_PROFILE oid:0x1900000001 (SAI_OBJECT_TYPE_BUFFER_PROFILE)
 (SAI_OBJECT_TYPE_INGRESS_PRIORITY_GROUP), attr SAI_INGRESS_PRIORITY_GROUP_ATTR_BUFFER_PROFILE oid:0x1900000002 (SAI_OBJECT_TYPE_BUFFER_PROFILE)
 (SAI_OBJECT_TYPE_INGRESS_PRIORITY_GROUP), attr SAI_INGRESS_PRIORITY_GROUP_ATTR_BUFFER_PROFILE oid:0x1900000008 (SAI_OBJECT_TYPE_BUFFER_PROFILE)
 (SAI_OBJECT_TYPE_INGRESS_PRIORITY_GROUP), attr SAI_INGRESS_PRIORITY_GROUP_ATTR_BUFFER_PROFILE oid:0x1900000009 (SAI_OBJECT_TYPE_BUFFER_PROFILE)
 (SAI_OBJECT_TYPE_INGRESS_PRIORITY_GROUP), attr SAI_INGRESS_PRIORITY_GROUP_ATTR_BUFFER_PROFILE oid:0x190000000a (SAI_OBJECT_TYPE_BUFFER_PROFILE)
 (SAI_OBJECT_TYPE_QUEUE), attr SAI_QUEUE_ATTR_BUFFER_PROFILE_ID oid:0x1900000003 (SAI_OBJECT_TYPE_BUFFER_PROFILE)
 (SAI_OBJECT_TYPE_QUEUE), attr SAI_QUEUE_ATTR_BUFFER_PROFILE_ID oid:0x1900000004 (SAI_OBJECT_TYPE_BUFFER_PROFILE)
 (SAI_OBJECT_TYPE_QUEUE), attr SAI_QUEUE_ATTR_BUFFER_PROFILE_ID oid:0x1900000005 (SAI_OBJECT_TYPE_BUFFER_PROFILE)
 (SAI_OBJECT_TYPE_QUEUE), attr SAI_QUEUE_ATTR_BUFFER_PROFILE_ID oid:0x1900000006 (SAI_OBJECT_TYPE_BUFFER_PROFILE)
 (SAI_OBJECT_TYPE_QUEUE), attr SAI_QUEUE_ATTR_BUFFER_PROFILE_ID oid:0x1900000007 (SAI_OBJECT_TYPE_BUFFER_PROFILE)

so at least object type matches on attribute, what don't match is the number of buffer profiles which is 10

@kcudnik
Copy link
Collaborator

kcudnik commented Oct 16, 2024

@kcudnik sent the logs and dumps separately. The workaround looks good so far with basic testing. Will verify further in the next few days. Now the discovered objects no longer have SAI_OBJECT_TYPE_BUFFER_PROFILE.

After warm-reboot from 201811 to 202012:

ok, that's good, but that's not all, so because we ignored those buffer profiles, and ASIC_DB contains 5 buffer profiles as expected, the problem may still exists on those objects queues/ipg

so what we want to achieve is this:

  • warm boot 1st time to 202012
  • warm boot 2nd time to 202012

and those should be successful , but also if for example we remove that discovery workaorund, at 3rd warm boot to 202012, discovery will discover only 5 buffer profiles and those with RIDs that are in ASIC DB, but i actually doubt about that, i think the problem is still there

and there is easy check for that actually, there is command that saiplayer can issue - it's "inspect asic" this will compare ASIC db with everything that is on the device and will report any errors in logs, you can probably run this on running system, by running "./saiplayer -i test.rec" command in syncd docker, test.rec can be empty file, but if everything is fine you should not see any error messages in syncd syslog, if there will be any missmatches between ASIC and ASIC_DB then you will see error messages in syslog giving you attr name and values with REDIS and actual SAI

i highly recommend you to make this test after 2nd or 3rd warm boot, i probably expect that there will be errors on those buffer profiles not matching RID values since im some how not convinced this error would fix itself on vendor side

please let me know how this test will go

also test.rec can look like this:

2019-06-06.22:51:50.087674|a|SYNCD_INSPECT_ASIC
2019-06-06.22:51:52.783261|A|SAI_STATUS_SUCCESS

and command can be like : ./saiplayer test.rec

although i dont remember if 202012 branch contains inspect asic feature

@Stephenxf
Copy link
Author

yeah, saiplayer is there in 202012. I ran /usr/bin/saiplayer -i test.rec on the warm-rebooted device and saw a bunch of "inspectAsic: failed to match ..." errors in syncd. However, the mismatches are not limited to BUFFER_PROFILE but all different other objects, e.g.,

Oct 16 09:10:31.509159 lnos-x1-a-fab01 ERR syncd#syncd: :- inspectAsic: failed to match SAI_SCHEDULER_GROUP_ATTR_SCHEDULER_PROFILE_ID REDIS attr 'oid:0x16000000000fde' with ASIC attr 'oid:0x1600000001' for SAI_OBJECT_TYPE_SCHEDULER_GROUP:oid:0x1d11173782001d
Oct 16 09:10:31.509343 lnos-x1-a-fab01 ERR syncd#syncd: :- inspectAsic: failed to match SAI_QUEUE_ATTR_WRED_PROFILE_ID REDIS attr 'oid:0x13000000000fe2' with ASIC attr 'oid:0x0' for SAI_OBJECT_TYPE_QUEUE:oid:0x5d011500000005
Oct 16 09:10:31.509343 lnos-x1-a-fab01 ERR syncd#syncd: :- inspectAsic: failed to match SAI_QUEUE_ATTR_BUFFER_PROFILE_ID REDIS attr 'oid:0x19000000000fcb' with ASIC attr 'oid:0x1900000006' for SAI_OBJECT_TYPE_QUEUE:oid:0x5d011500000005
Oct 16 09:10:31.510288 lnos-x1-a-fab01 ERR syncd#syncd: :- inspectAsic: failed to match SAI_ACL_COUNTER_ATTR_TABLE_ID REDIS attr 'oid:0x7000000000fea' with ASIC attr 'oid:0x700000004' for SAI_OBJECT_TYPE_ACL_COUNTER:oid:0x40900000040
Oct 16 09:10:31.511078 lnos-x1-a-fab01 ERR syncd#syncd: :- inspectAsic: failed to match SAI_ACL_COUNTER_ATTR_TABLE_ID REDIS attr 'oid:0x7000000000fea' with ASIC attr 'oid:0x700000004' for SAI_OBJECT_TYPE_ACL_COUNTER:oid:0x4090000004a

I also ran saiplayer on another device that is up from cold boot only. It showed similar errors with all objects, so I'm not sure if it can be used as an indicator if anything has gone wrong.

The other thing is, if the workaround works for warm-reboot upgrade from 201811 to 202012, we may keep it in 202012 so additional warm-reboots will go through the same process. We also don't have plan to upgrade further beyond 202012 for these switches. If this is the case, do you still have the concern about the discovery?

@kcudnik
Copy link
Collaborator

kcudnik commented Oct 17, 2024

hmm, this code is not actually mine, but seems like there is some issue with translation, those values usually should match, i actually tested this on master on virtual switch and i don't get any failed matches, i checked 202012 version of inspect asic and it seems to be the same, but i think it's missing some translation, for example:

'oid:0x7000000000fea' with ASIC attr 'oid:0x700000004' 

those 2 should be match but it's clearly that oid from REDIS is VID and from asic is RID, so there is translation step missing there, i will double check that once more and will prodice PR fixing that if necessary, not sure if should i make 202012 version too

if we are not planning to move from 202012 to higher version then i guess we can keep workaround in place, but is there any official info on not planning to update 202012? since i work in this project too long to know that this can change any time XD

@kcudnik
Copy link
Collaborator

kcudnik commented Oct 17, 2024

Hey, there is issue with inspect assic RIDs are compared to VIDs, here is fix:
#1434

you can try apply it and check again, you can still get some errors for pointers, since pointers will not match after warm boot, or at some point lists may not be equal, since some lists are allowed to be not in order, for example QOS information

but this fix should be more than enough to find which attribures and objects ids report invalid OIDs for buffer profile

@Stephenxf
Copy link
Author

@kcudnik Thanks for the patch for saiplayer! I applied it and did see some mismatches with warm-reboot upgrade 201811->202012, which were not seen with warm-reboot 202012->202012 after cold reboot.

The mismatches are mainly on 3 objects on my setup:

  • SAI_QUEUE_ATTR_WRED_PROFILE_ID
  • SAI_HOSTIF_ATTR_NAME
  • SAI_QOS_MAP_ATTR_MAP_TO_VALUE_LIST

I'll share more details and the output with you shortly.

@kcudnik
Copy link
Collaborator

kcudnik commented Oct 21, 2024

so i analyzed provided logs, and where we get and return not_implemented_1 it means that attribute at index 1 is not supported GET command, and in this case actually checking attributes is ignored, since code is doing GET for all attributes at once, so in no success operation, none of attributes are checked, this code should be updated to remove not supported attributes 1 by by 1 and check those where GET succeeded, and actually not implemented should be reported to brcm to fix this, since when we are doing GET on attribute X, that means this X attribute is in ASIC DB, and it was create or set by OA operations, so GET also should be supported on that operation. current logging is not enough so we don't even know which attribute is this. note here that getting attributes from ASIC_DB is in random order since this is a hash.

from logs i see that not implemented attributes are on:

SAI_OBJECT_TYPE_BUFFER_POOL:oid:0x1800000001: SAI_STATUS_ATTR_NOT_IMPLEMENTED_0
SAI_OBJECT_TYPE_HOSTIF_TABLE_ENTRY:oid:0x2300000001: SAI_STATUS_NOT_IMPLEMENTED
SAI_OBJECT_TYPE_PORT:oid:0x100000001: SAI_STATUS_ATTR_NOT_IMPLEMENTED_1
SAI_OBJECT_TYPE_SWITCH:oid:0xb960122100000000: SAI_STATUS_ATTR_NOT_IMPLEMENTED_4

for those:

failed to match SAI_ACL_COUNTER_ATTR_BYTES REDIS attr '46290' with ASIC attr '46769' for SAI_OBJECT_TYPE_ACL_COUNTER:oid:0x40900000033
failed to match SAI_ACL_COUNTER_ATTR_BYTES REDIS attr '71052' with ASIC attr '71531' for SAI_OBJECT_TYPE_ACL_COUNTER:oid:0x50900000048
failed to match SAI_ACL_COUNTER_ATTR_PACKETS REDIS attr '436' with ASIC attr '440' for SAI_OBJECT_TYPE_ACL_COUNTER:oid:0x40900000033
failed to match SAI_ACL_COUNTER_ATTR_PACKETS REDIS attr '550' with ASIC attr '554' for SAI_OBJECT_TYPE_ACL_COUNTER:oid:0x50900000048

i;m guessing those values are changed because of counter counting those, but normally attributes create/set should not be changed only read_only attributes should be allowed to change

this:

failed to match SAI_ACL_TABLE_ATTR_FIELD_ACL_RANGE_TYPE REDIS attr '2:SAI_ACL_RANGE_TYPE_L4_DST_PORT_RANGE,SAI_ACL_RANGE_TYPE_L4_SRC_PORT_RANGE' with ASIC attr '1:SAI_ACL_RANGE_TYPE_L4_DST_PORT_RANGE' for SAI_OBJECT_TYPE_ACL_TABLE:oid:0x700000004
failed to match SAI_ACL_TABLE_ATTR_FIELD_ACL_RANGE_TYPE REDIS attr '2:SAI_ACL_RANGE_TYPE_L4_DST_PORT_RANGE,SAI_ACL_RANGE_TYPE_L4_SRC_PORT_RANGE' with ASIC attr '1:SAI_ACL_RANGE_TYPE_L4_DST_PORT_RANGE' for SAI_OBJECT_TYPE_ACL_TABLE:oid:0x700000005

this is a bug definitely, there were 2 items on the list, there is 1 after

failed to execute get api on SAI_OBJECT_TYPE_ROUTER_INTERFACE:oid:0x20600000001: SAI_STATUS_FAILURE
failed to execute get api on SAI_OBJECT_TYPE_ROUTER_INTERFACE:oid:0x22000600000003: SAI_STATUS_INVALID_ATTR_VALUE_MAX
failed to execute get api on SAI_OBJECT_TYPE_ROUTER_INTERFACE:oid:0x23000600000004: SAI_STATUS_INVALID_ATTR_VALUE_MAX
failed to execute get api on SAI_OBJECT_TYPE_ROUTER_INTERFACE:oid:0x44000600000002: SAI_STATUS_INVALID_ATTR_VALUE_MAX
failed to execute get api on SAI_OBJECT_TYPE_ROUTER_INTERFACE:oid:0x45000600000005: SAI_STATUS_INVALID_ATTR_VALUE_MAX

those i have no idea :(

on warm boot:

failed to match SAI_HOSTIF_ATTR_NAME REDIS attr 'Ethernet0' with ASIC attr '' for SAI_OBJECT_TYPE_HOSTIF:oid:0x44000d00000021
failed to match SAI_QUEUE_ATTR_WRED_PROFILE_ID REDIS attr 'oid:0x71300000001' with ASIC attr 'oid:0x0' for SAI_OBJECT_TYPE_QUEUE:oid:0x1011500000004

those seems a bug also, returning not valid ethernet port name as empty string, and returning oid 0 on wred profile where it was set

what is interesting, that none of those errors are actually reporting BUFFER_PROFILE OID issue, where i would expect that there will be 5 objects with different OID values, this is really interesting XD

maybe they are not reported because you added code to ignore them on discovery ? but any way, even if you ignored them, then inspect ASIC should do a GET operation on all objects in ASCI_DB and they will return some of those invalid RIDs that's really interesting why we don't see it here hmmm

i would suspect that those "dummy" ones are for example like on QUEUE and INGRESS_PRIORIRTY group which we see here:#1429 (comment) but those QUEUES and IPGS may be not actually configured by OA, i;m not sure if OA is configuring ALL visible QUEUES and IPGs on the device, in contrast discovery method id discovering all existing objects, so it could happen that let sya switch has 100 queues and OA configures only 80 of them, and then because of a bug 20 of not configured ones contains some bug and returns dome invalie buffer_profile ID, that would be mu suspicion

@kcudnik
Copy link
Collaborator

kcudnik commented Oct 21, 2024

SAI_STATUS_INVALID_ATTR_VALUE_MAX in terms of SAI codes is actually SAI_STATUS_ATTR_NOT_IMPLEMENTED_1 (because status codes are negative)

@kcudnik
Copy link
Collaborator

kcudnik commented Nov 22, 2024

hey, did brcm get back on this ?

@Stephenxf
Copy link
Author

Hi @kcudnik sorry for the delayed response. I don't expect Broadcom to respond on this as the version in question is pretty dated. During the last few weeks we built a setup with traffic enabled to verify warm-reboot upgrade (201811->202012) with the workaround you helped us come up with. Some extensive tests were done and the results looked good. Last week we went ahead and did the same on multiple switches in production, and no issues were observed either. As we discussed earlier, shall I post the PR with this workaround? This is the diff (only applied to 202012):

diff --git a/syncd/SaiDiscovery.cpp b/syncd/SaiDiscovery.cpp
index a241fab..50c487d 100644
--- a/syncd/SaiDiscovery.cpp
+++ b/syncd/SaiDiscovery.cpp
@@ -185,6 +185,37 @@ void SaiDiscovery::discover(
                 }
             }
 
+            /*
+             * WORKAROUND: after warm-reboot upgrade to 202012/rel_3.x, the new BRCM SAI will return additional BUFFER_PROFILE
+             * objects with OIDs that are different from the ones in orchagent, most likely due to a bug in BRCM SAI upgrade.
+             * As a result, syncd will create dummy objects for the extra BUFFER_PROFILE attributes during discovery process.
+             * However, these dummy objects are not in ASIC, hence a subsequent warm-reboot that tries to remove these objects
+             * will fail and lead to orchagent crash.
+             *
+             * The workaround here is to ignore all discovered BUFFER_PROFILE objects (both real ones and extra ones). The real
+             * objects will be created again from configs. The extra ones will not be in ASIC_DB or ASIC but in BRCM SDK, which
+             * should be fine, especially no further upgrade of BRCM SAI will occur in our case.
+             *
+             * More details can be found from https://github.com/sonic-net/sonic-sairedis/issues/1429
+             */
+            if ((md->objecttype == SAI_OBJECT_TYPE_QUEUE && md->attrid == SAI_QUEUE_ATTR_BUFFER_PROFILE_ID) ||
+                (md->objecttype == SAI_OBJECT_TYPE_INGRESS_PRIORITY_GROUP && md->attrid == SAI_INGRESS_PRIORITY_GROUP_ATTR_BUFFER_PROFILE)) {
+                SWSS_LOG_WARN("RID %s (%s), skipping attr %s %s (%s)",
+                        sai_serialize_object_id(rid).c_str(),
+                        sai_serialize_object_type(mk.objecttype).c_str(),
+                        md->attridname,
+                        sai_serialize_object_id(attr.value.oid).c_str(),
+                        sai_serialize_object_type(m_sai->objectTypeQuery(attr.value.oid)).c_str());
+                continue;
+            }
+
+            SWSS_LOG_DEBUG("RID %s (%s), getting attr %s %s (%s)",
+                    sai_serialize_object_id(rid).c_str(),
+                    sai_serialize_object_type(mk.objecttype).c_str(),
+                    md->attridname,
+                    sai_serialize_object_id(attr.value.oid).c_str(),
+                    sai_serialize_object_type(m_sai->objectTypeQuery(attr.value.oid)).c_str());
+
             discover(attr.value.oid, discovered); // recursion
         }
         else if (md->attrvaluetype == SAI_ATTR_VALUE_TYPE_OBJECT_LIST)

@kcudnik
Copy link
Collaborator

kcudnik commented Nov 27, 2024

Please move this work sound to workaround.cpp and use it from there

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants