SSL EOF leads to persistent non-delivery of messages #750

gdt · 2023-10-01T11:45:16Z

I am using 1.6.1, NetBSD 9 amd64, python 3.10. The broker is up-to-date mosquitto, on a remote NetBSD 9 amd64 system. I am using 8883 with a real certificate, and generally everything works well.

I have a script which polls a UPS and publishes messages usually once a minute, but at intervals of up to 1s if something interesting has happened. On that system, I took the WAN interface down, releasing the lease, removing addresses and default route, and ifconfig down, waited an hour and brought it back. Obviously I won't have data in the meantime but I expect it to recover.

In the log (stdout/stderr) of the program I see (with the json shortened; there are actually 14 keys):

SUBMITTED time {'time': 1696117812.470824, 'runtime': 5940.0}
PUBLISHED 16075
SUBMITTED time {'time': 1696117873.4957478, 'runtime': 5940.0}
Exception in thread Thread-1 (_thread_main):
Traceback (most recent call last):
  File "/usr/pkg/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/pkg/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/pkg/lib/python3.10/site-packages/paho/mqtt/client.py", line 3591, in _thread_main
    self.loop_forever(retry_first_connection=True)
  File "/usr/pkg/lib/python3.10/site-packages/paho/mqtt/client.py", line 1756, in loop_forever
    rc = self._loop(timeout)
  File "/usr/pkg/lib/python3.10/site-packages/paho/mqtt/client.py", line 1181, in _loop
    rc = self.loop_write()
  File "/usr/pkg/lib/python3.10/site-packages/paho/mqtt/client.py", line 1577, in loop_write
    rc = self._packet_write()
  File "/usr/pkg/lib/python3.10/site-packages/paho/mqtt/client.py", line 2464, in _packet_write
    write_length = self._sock_send(
  File "/usr/pkg/lib/python3.10/site-packages/paho/mqtt/client.py", line 649, in _sock_send
    return self._sock.send(buf)
  File "/usr/pkg/lib/python3.10/ssl.py", line 1206, in send
    return self._sslobj.write(data)
ssl.SSLEOFError: EOF occurred in violation of protocol (_ssl.c:2426)
PUBLISHED 16076
SUBMITTED time {'time': 1696117934.524036, 'runtime': 5940.0}
SUBMITTED time {'time': 1696117994.8527699, 'runtime': 5940.0}

So the "EOF" looks like write failed. Probably this is "no route to host". However I don't get a disconnect callback. And I did get a publish callback! But surely the broker didn't get the message.
Then, future publish calls happen, with no exceptions. But no publish callbacks.

Obviously I can write a watchdog to close/open if I don't get pubacks to python.

But it seems that the ssl error should be caught and the connection should be judged non-functional. Or at least some later publish should cause failure and to close the connection. The user program should not get an exception thrown other than in response to an API call and then only documented exceptions.

I don't think this is related to threading/locking, as the program only did one publish call, and without the network being down, runs for months.

The text was updated successfully, but these errors were encountered:

gdt · 2023-10-04T11:55:36Z

(I did write a watchdog, but calling close and then reopen did not result in messages flowing.)

Reading client.py, I see _packet_write calling _sock_send on line 2506. A number of exceptions are caught, but there is no general catc, and obviously the one above is not caught. It seems obvious that an exception that has not been considered carefully risks leaving the socket in a broken state, as happens to me. And therefore that other exceptions should lead to the close/retry step.

warped-rudi · 2024-01-04T09:19:54Z

Are there any updates on this issue? I'm seeing it as well when the internet connection gets lost temporarily.

PierreF · 2024-01-07T20:32:48Z

It seems you are using loop_start, and some exception aren't correctly handled which cause the loop thread to crash. Once crashed no more processing it done.

I've made a PR (#797) which should fix your problem.

gdt · 2024-01-08T01:29:02Z

Thanks for addressing this. I am in fact using loop_start. Modulo variables to set my username/etc. and to enable TLS, the code is at https://github.com/merbanan/rtl_433/blob/master/examples/rtl_433_mqtt_relay.py . But, aside from this failure, everything else is fine. I am grabbing the patch in the PR, will patch my system, and restart it.

warped-rudi · 2024-01-09T06:46:57Z

I gave this patch a try and first tests were positive. Thanks!

github-actions bot added the Status: Available No one has claimed responsibility for resolving this issue. label Oct 1, 2023

gdt mentioned this issue Oct 1, 2023

ssl.SSLEOFError: EOF occurred in violation of protocol (_ssl.c:2483) #637

Open

PierreF mentioned this issue Jan 7, 2024

Correctly mark connection as broken on SSL error and don't crash loop_forever #797

Merged

MattBrittan added the Type: Bug label Jan 8, 2024

PierreF closed this as completed in #797 Jan 20, 2024

anschweitzer mentioned this issue Apr 1, 2024

TLS error on send crashes proactor thegridelectric/gridworks-proactor#99

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SSL EOF leads to persistent non-delivery of messages #750

SSL EOF leads to persistent non-delivery of messages #750

gdt commented Oct 1, 2023 •

edited

Loading

gdt commented Oct 4, 2023

warped-rudi commented Jan 4, 2024

PierreF commented Jan 7, 2024

gdt commented Jan 8, 2024

warped-rudi commented Jan 9, 2024

SSL EOF leads to persistent non-delivery of messages #750

SSL EOF leads to persistent non-delivery of messages #750

Comments

gdt commented Oct 1, 2023 • edited Loading

gdt commented Oct 4, 2023

warped-rudi commented Jan 4, 2024

PierreF commented Jan 7, 2024

gdt commented Jan 8, 2024

warped-rudi commented Jan 9, 2024

gdt commented Oct 1, 2023 •

edited

Loading