Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect when the firmware responds with the wrong response #241

Merged
merged 4 commits into from
Dec 18, 2023

Conversation

puddly
Copy link
Contributor

@puddly puddly commented Dec 16, 2023

Detected with a Raspbee II (0x26780700) (thanks @Citizen-2CB8A24A):

# Watchdog TTL write (seq=101)
2023-12-16 00:31:24.247 DEBUG (MainThread) [zigpy_deconz.api] Sending CommandId.write_parameter{'parameter_id': <NetworkParameter.watchdog_ttl: 38>, 'parameter': b'<\x00\x00\x00'} (seq=101)
2023-12-16 00:31:24.248 DEBUG (MainThread) [zigpy_deconz.uart] Send: 0b65000c000500263c000000

# Response received, but the wrong one (device_state_changed??) (seq=101)
2023-12-16 00:31:25.470 DEBUG (MainThread) [zigpy_deconz.uart] Frame received: 0x0e65000700aa00
2023-12-16 00:31:25.471 DEBUG (MainThread) [zigpy_deconz.api] Received command CommandId.device_state_changed{'status': <Status.SUCCESS: 0>, 'frame_length': 7, 'device_state': DeviceState(network_state=<NetworkState2.CONNECTED: 2>, device_state=<DeviceStateFlags.APSDE_DATA_INDICATION|APSDE_DATA_REQUEST_FREE_SLOTS_AVAILABLE|32: 42>), 'reserved': 0} (seq 101)

# Another device_state_changed received (seq=102) right after
2023-12-16 00:31:25.648 DEBUG (MainThread) [zigpy_deconz.uart] Frame received: 0x0e66000700aa00
2023-12-16 00:31:25.650 DEBUG (MainThread) [zigpy_deconz.api] Received command CommandId.device_state_changed{'status': <Status.SUCCESS: 0>, 'frame_length': 7, 'device_state': DeviceState(network_state=<NetworkState2.CONNECTED: 2>, device_state=<DeviceStateFlags.APSDE_DATA_INDICATION|APSDE_DATA_REQUEST_FREE_SLOTS_AVAILABLE|32: 42>), 'reserved': 0} (seq 102)

From the log, you can see that the write_parameter request doesn't receive the correct response. It's almost as if the enqueued device_state_changed notification overwrites the expected write_parameter response.

A separate exception type is used to detect this rare problem and the watchdog TTL will be re-written until it succeeds.

Copy link

codecov bot commented Dec 16, 2023

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (cc1f018) 98.52% compared to head (23d5670) 98.68%.

Files Patch % Lines
zigpy_deconz/exception.py 88.88% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##              dev     #241      +/-   ##
==========================================
+ Coverage   98.52%   98.68%   +0.15%     
==========================================
  Files           7        7              
  Lines         884      910      +26     
==========================================
+ Hits          871      898      +27     
+ Misses         13       12       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@puddly
Copy link
Contributor Author

puddly commented Dec 17, 2023

It looks like this problem isn't limited to just device_state_changed. Here is another log, this time for a completely different command:

2023-12-17 22:04:06.794 DEBUG (MainThread) [zigpy_deconz.api] Command Command.aps_data_confirm (0,)
2023-12-17 22:04:06.795 DEBUG (MainThread) [zigpy_deconz.uart] Send: 0x04c80007000000
2023-12-17 22:04:06.799 DEBUG (MainThread) [zigpy_deconz.uart] Frame received: 0x04c8050800010022
2023-12-17 22:04:06.801 DEBUG (MainThread) [zigpy_deconz.api] Command Command.aps_data_indication (1, <DataIndicationFlags.Always_Use_NWK_Source_Addr: 1>)

A seq of 0xC8 is used for the aps_data_confirm request and the firmware erroneously responds with aps_data_indication.

To fix this, I'm globally enabling "invalid response" retries for all commands.

@puddly puddly merged commit bc7e960 into zigpy:dev Dec 18, 2023
13 of 14 checks passed
@manup
Copy link

manup commented Dec 18, 2023

Hi @puddly some background information, this is due the nature of a few async commands. While most commands act in a sync way there is the exception of ZGP and device state "notifications" which can be send by the firmware at any time. This is done so that the host application is notified immediately that new APS indications, APS confirm or other changes are available to poll.

In the deCONZ implementation all commands which are send to the firmware are handled in a queue which tracks the sequence number + command id. All incoming commands are checked against this queue. If above mentioned notifications like device state arrive, new respective commands to query APS indications/confirm are scheduled, to speed up everything compared to plain polling. This part is not open source yet but I can provide you the sources if that is of any help.

@puddly
Copy link
Contributor Author

puddly commented Dec 18, 2023

Thank you! I think I have this implemented in zigpy-deconz by triggering an event to instantly poll every time device_state is sent back either within any command response or by these notifications.

From the logs above, it seems like the sequence number desynchronizes, as the response sequence number matches the request sequence number from an unrelated command sent nearly 200ms in the past. Currently, I just endlessly retry if this is the case, as the chances of this occurring are very low.

In the future, it would be nice to reserve a sequence number for unsolicited notifications (e.g. 0x00 or 0xFF) and avoid using it during normal communications. Or maybe pick a random sequence number during a notification so that it does not collide with a pending one?

But it is good to know that aps_data_indication, device_state_changed, and zigbee_green_power are the only frames that can do this. I think this can help simplify the logic in a followup PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants