Unable to send 10 MB messages through the network #543

evshary · 2025-03-14T06:43:33Z

While running ping-pong test across a network, a large-size message can't be received.
If sending with a 1MB message, data might be missed. With a 10 MB message, we can never receive that.

We test with the package: https://github.com/ZettaScaleLabs/ros2-simple-performance in the topology

     host 1                   host 2
[ping ----- zenohd] ----- [zenohd ----- pong]

Host 1: ros2 run simple_performance ping --ros-args -p warmup:=1.0 -p size:=10000000 -p samples:=10 -p rate:=1
Host 2: ros2 run simple_performance pong

We can't receive data on pong size, but it works when size is smaller.

Here are some investigations on the pong side:

Analyzing with gdb, the callback inside SubscriptionData isn't triggered.
Enabling with RUST_LOG=z=trace, Zenoh didn't show the message payload.
We indeed see the packets in the Wireshark.
Using the same configuration as rmw_zenoh, Zenoh examples z_ping and z_pong can work.

The issue is originally reported here

The text was updated successfully, but these errors were encountered:

evshary · 2025-03-14T07:30:26Z

Okay, the issue is related to the congestion control. We need to use block instead of drop, or the large message will never be sent to the other side. That means we should keep the reliability to reliable and history to keep_all.

rmw_zenoh/rmw_zenoh_cpp/src/detail/rmw_publisher_data.cpp

Line 127 in 63f0623

pub_opts.congestion_control = Z_CONGESTION_CONTROL_BLOCK;

ros2 run simple_performance ping --ros-args -p reliability:=RELIABLE -p history:=KEEP_ALL -p warmup:=1.0 -p size:=10000000 -p samples:=100000 -p rate:=1
ros2 run simple_performance pong --ros-args -p reliability:=RELIABLE -p history:=KEEP_ALL

However, in the navigation2 scenario, the map server uses keep_last with depth 1.
https://github.com/ros-navigation/navigation2/blob/085d235db0ef0d189ca19315d7c69a359778fc93/nav2_map_server/src/map_server/map_server.cpp#L118
This will cause some issues if loading a large-sized map

Hugal31 · 2025-03-14T13:16:24Z

I had the same issue. I think all reliable topics should be using the BLOCK congestion control.

JEnoch · 2025-03-14T14:19:26Z

I understand using CongestionControl.BLOCK for all reliable topics is tempting, but I would be very careful with this:
it would lead any slow subscriber or any subscriber across a congested wifi to block the reliable publisher in the robot, and thus to possibly slow done all the traffic inside the robot.

In DDS the RELIABLE QoS means that the reliability is ensured for all the samples in the Writer's history cache.
I.e. if HISTORY.depth=1, there might be messages loss. Only HISTORY.depth=KEEP_ALL will ensure full reliability, in the limit of the RESOURCE_LIMITS QoS. And the Writer will block the application code.
That's why in rmw_zenoh CongestionControl is set to BLOCK only if the QoS is RELIABLE + KEEP_ALL, which literally means "I don't want to lose any data".

Now, my understanding is that the current issue is with a "sporadic" publication (I guess the map is published only once), over a congested WiFi. The publisher QoS is set to TRANSIENT_LOCAL, meaning the AdvancedPublisher is used by rmw_zenoh. There could be other solutions like enabling end-to-end reliability for such topic.

Another idea we're working on is to allow the router to change/overwrite the QoS per topic, when routing outside the robot.
Because if you have a Lidar publishing big point clouds internally on a RELIABLE topic, you don't want an external rviz over WiFi to slow down or block your robot. While having some point clouds loss over WiFi might be acceptable.

YuanYuYuan · 2025-03-14T14:43:53Z

As explained in FastDDS documentation on sending large data. The current configuration of rmw_zenoh follows the same behavior.

It is recommended to fine-tune other QoS settings and parameters based on the specific use case for transmitting large data. For instance, real-time video streaming has different requirements compared to sending an HD map as a one-time transfer.

Hugal31 · 2025-03-17T12:14:50Z

As explained in FastDDS documentation on sending large data. The current configuration of rmw_zenoh follows the same behavior.

I'm not certain rmw_zenoh follows the same behavior. If I understand the doc you linked, Fast-DDS "Reliable+KeepLast(1)" would still ensure the latest sample in the buffer is received by the subscribers, unless it is erased by a new sample. This sounds like the correct behavior. rmw_zenoh, on the other hand, will stop trying to send the sample if it cannot fit in in the queue before the drop timeout, even in the topic is set to Reliable + KeepLast(1).

I understand the risks of a "big" topic blocking a reliable topic, but I'd argue it's more the "big" topic's fault for not using a dedicated queue and transport. Still, if you prefer to keep Reliable+KeepLast topics dropping, I'd like to consider making at least transient local topics blocking. I hope that makes sense.

Another idea we're working on is to allow the router to change/overwrite the QoS per topic

That would be nice. For now though, I use TCP+UDP between my peers (within the same host) but for the routers (talking over Wi-Fi), I use TCP?prio=0-5, TCP?prio=6-7 and UDP, so a lossy Wi-Fi wouldn't congest all the reliable topics.

Thank you for the quick replies!

Hugal31 · 2025-03-17T16:30:35Z

Another idea we're working on is to allow the router to change/overwrite the QoS per topic, when routing outside the robot.
Because if you have a Lidar publishing big point clouds internally on a RELIABLE topic, you don't want an external rviz over WiFi to slow down or block your robot. While having some point clouds loss over WiFi might be acceptable.

Actually, what would make most sense (although not in Zenoh's current architecture) would be for the subscriber to be able to specify the desired reliability. So your robot can work with a reliable/blocking point cloud but RViz would be happy with a best_effort one. In addition, this semantic already exists in ROS1 and 2.

YuanYuYuan · 2025-03-19T02:23:04Z

Hi @Hugal31!

Fast-DDS "Reliable+KeepLast(1)" would still ensure the latest sample in the buffer is received by the subscribers, unless it is erased by a new sample.

That's correct. I meant that users could lose data if they keep updating the queue without setting the history QoS to KEEP_ALL. We're discussing how to ensure reliability when sending a single large piece of data at most once.

Actually, what would make most sense (although not in Zenoh's current architecture) would be for the subscriber to be able to specify the desired reliability.

In fact, that's our previous design. But we decided to make them configured on the publisher side. To be clear, the current issue is how we properly map ROS 2 Reliability + KEEP_LAST(N) + actual sending behavior to zenoh's CongestionControl + Reliability.

Hugal31 linked a pull request Mar 14, 2025 that will close this issue

Make reliable topics use blocking congestion control #547

Open

evshary mentioned this issue Mar 17, 2025

Sending large messages with CongestionControl::Drop eclipse-zenoh/zenoh#1835

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to send 10 MB messages through the network #543

Unable to send 10 MB messages through the network #543

evshary commented Mar 14, 2025

evshary commented Mar 14, 2025 •

edited

Loading

Hugal31 commented Mar 14, 2025

JEnoch commented Mar 14, 2025

YuanYuYuan commented Mar 14, 2025

Hugal31 commented Mar 17, 2025 •

edited

Loading

Hugal31 commented Mar 17, 2025

YuanYuYuan commented Mar 19, 2025

Unable to send 10 MB messages through the network #543

Unable to send 10 MB messages through the network #543

Comments

evshary commented Mar 14, 2025

evshary commented Mar 14, 2025 • edited Loading

Hugal31 commented Mar 14, 2025

JEnoch commented Mar 14, 2025

YuanYuYuan commented Mar 14, 2025

Hugal31 commented Mar 17, 2025 • edited Loading

Hugal31 commented Mar 17, 2025

YuanYuYuan commented Mar 19, 2025

evshary commented Mar 14, 2025 •

edited

Loading

Hugal31 commented Mar 17, 2025 •

edited

Loading