-
Couldn't load subscription status.
- Fork 173
[websocket]: Handling timeouts/errors in from tcp_transport #882
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[websocket]: Handling timeouts/errors in from tcp_transport #882
Conversation
|
Hello,
After some more testing it did not solve our issue. There is still a bug in the TLS error handling while sending data through the socket. I will create a new issue and document. |
Thanks @MaxVandenbussche for your feedback and testing this changeset. Edit: Have reconsidered and updated to fix only the abort-in-abort issue, please check if it solves the issue |
eb79b1c to
3ee37e2
Compare
|
Thanks, I will test this later today. |
|
@MaxVandenbussche did it solve the tls double free error? I have created the issue on striking always this error, as this is a complex mutual exchange data over tls/websocket, the logic should be understood correctly before we try to suggest any change. So, if the both patch @david-cermak has point solves the problem we need to go from that. I will push this to the testing devices tomorrow. |
We set 40 devices in the same network among 3 deco m5 router with this patch for testing, in the 1.5.0 at least once a day one board throw exception in the error handling as tls;doublefree;doubleclose, so, now we can verify exactly what will be the patch result and can use gdb with backtrace with the error, this shall accelerate the David/gmle side to implement and fixes |
|
Hello, I'd like to come back to this issue. This is still blocking at our side.
|
|
it does solved the TLS double free error but create another monster inside, a massive memory leak upon websocket fall/reconnection, wifi failure and so on, so, all the patches applied didnt solve the websocket usage. We are testing a new version from @surengab (https://github.com/gabsuren/esp-protocols-1/tree/fix/ws_race_on_abort) that has been tracked, it has a minor memory leak after the connection set and fall for the first time, but later there is no more memory leak. (need to fix the error on mismatch types in log), as we are running in github here #898 (comment) The first storm/stom test passed well at WiFi drop, block and reconnection issue with a very large network and crowded environment, so, no leaks or failure in the websocket. TESTS DONE The websocket testing uses a storm/stom large set of chunks of data, datasets, from 1k to 50k, and simulates all drops and wifi connections. after the basic level, we do the simultaneous send and receive storm/stom form 1 to 50 while send and receive we conduct a third teste that execute the wifi level, so this way we can grab different scenarios and concurrencies, and provide a stable result. this logs the memory range from everything and heap trace, so, after all the check the results from the trace. testing ongoing now! |
WebSocket Client Timeout Handling Fix
Overview
This document describes a critical fix for the ESP WebSocket client that resolves timeout handling issues that could cause the client to enter a corrupted state and emit fragmented/random data.
Issue Description
Problem Summary
The WebSocket client was not properly handling transport timeout errors, causing it to:
Original Issue Report
Symptoms
Root Cause Analysis
The Core Problem
The WebSocket client's timeout detection logic was fundamentally flawed:
Why This Failed
Ambiguous Return Value:
esp_transport_read()returns0for both:network_timeout_ms)Insufficient Detection: The condition
rlen == 0 && client->last_opcode == WS_TRANSPORT_OPCODES_NONEwas unreliable because:State Corruption: When timeout was misidentified as valid data:
WEBSOCKET_EVENT_DATAwithlen=0and potentially invalid opcodeTransport Layer Analysis
TCP Transport Component Architecture
The ESP-IDF transport layer provides sophisticated error reporting through multiple layers:
Transport Layer Error Codes
The transport layer distinguishes between different types of "zero" returns:
ESP_ERR_ESP_TLS_CONNECTION_TIMEOUTERR_TCP_TRANSPORT_CONNECTION_TIMEOUTESP_ERR_ESP_TLS_TCP_CLOSED_FINERR_TCP_TRANSPORT_CONNECTION_CLOSED_BY_FIN00Transport Layer Implementation
The transport layer (
transport_ssl.c) already handles these cases properly:The Fix
Solution Approach
Instead of trying to guess timeout conditions from WebSocket frame state, the fix leverages the transport layer's existing error reporting infrastructure.
Key Changes
1. Enhanced Timeout Detection
2. Improved Error Handling for Negative Returns
3. Invalid Opcode Detection
Benefits of the Fix
1. Leverages Existing Infrastructure
2. Comprehensive Error Handling
3. Prevents State Corruption
4. Better Debugging
Testing the Fix
Test Scenarios
Timeout Test
Empty Message Test
Connection Closure Test
Corruption Test
Expected Behavior After Fix
Implementation Details
Files Modified
components/esp_websocket_client/esp_websocket_client.cesp_websocket_client_recv()functionDependencies
esp_transport.h- Transport layer interfaceesp_tls.h- TLS error handlingesp_log.h- Logging functionalityConfiguration
No configuration changes required. The fix works with existing WebSocket client configuration.
Backward Compatibility
The fix is fully backward compatible:
Performance Impact
The fix has minimal performance impact:
Conclusion
This fix resolves a critical issue in the WebSocket client by properly leveraging the transport layer's error reporting infrastructure. The solution is:
The fix prevents the WebSocket client from entering corrupted states and ensures reliable operation in all network conditions.
References