Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to send 10 MB messages through the network #543

Open
evshary opened this issue Mar 14, 2025 · 7 comments · May be fixed by #547
Open

Unable to send 10 MB messages through the network #543

evshary opened this issue Mar 14, 2025 · 7 comments · May be fixed by #547

Comments

@evshary
Copy link
Contributor

evshary commented Mar 14, 2025

While running ping-pong test across a network, a large-size message can't be received.
If sending with a 1MB message, data might be missed. With a 10 MB message, we can never receive that.

We test with the package: https://github.com/ZettaScaleLabs/ros2-simple-performance in the topology

     host 1                   host 2
[ping ----- zenohd] ----- [zenohd ----- pong]
  • Host 1: ros2 run simple_performance ping --ros-args -p warmup:=1.0 -p size:=10000000 -p samples:=10 -p rate:=1
  • Host 2: ros2 run simple_performance pong

We can't receive data on pong size, but it works when size is smaller.

Here are some investigations on the pong side:

  • Analyzing with gdb, the callback inside SubscriptionData isn't triggered.
  • Enabling with RUST_LOG=z=trace, Zenoh didn't show the message payload.
  • We indeed see the packets in the Wireshark.
  • Using the same configuration as rmw_zenoh, Zenoh examples z_ping and z_pong can work.

The issue is originally reported here

@evshary
Copy link
Contributor Author

evshary commented Mar 14, 2025

Okay, the issue is related to the congestion control. We need to use block instead of drop, or the large message will never be sent to the other side. That means we should keep the reliability to reliable and history to keep_all.

pub_opts.congestion_control = Z_CONGESTION_CONTROL_BLOCK;

ros2 run simple_performance ping --ros-args -p reliability:=RELIABLE -p history:=KEEP_ALL -p warmup:=1.0 -p size:=10000000 -p samples:=100000 -p rate:=1
ros2 run simple_performance pong --ros-args -p reliability:=RELIABLE -p history:=KEEP_ALL

However, in the navigation2 scenario, the map server uses keep_last with depth 1.
https://github.com/ros-navigation/navigation2/blob/085d235db0ef0d189ca19315d7c69a359778fc93/nav2_map_server/src/map_server/map_server.cpp#L118
This will cause some issues if loading a large-sized map

@Hugal31
Copy link
Contributor

Hugal31 commented Mar 14, 2025

I had the same issue. I think all reliable topics should be using the BLOCK congestion control.

@Hugal31 Hugal31 linked a pull request Mar 14, 2025 that will close this issue
@JEnoch
Copy link
Contributor

JEnoch commented Mar 14, 2025

I understand using CongestionControl.BLOCK for all reliable topics is tempting, but I would be very careful with this:
it would lead any slow subscriber or any subscriber across a congested wifi to block the reliable publisher in the robot, and thus to possibly slow done all the traffic inside the robot.

In DDS the RELIABLE QoS means that the reliability is ensured for all the samples in the Writer's history cache.
I.e. if HISTORY.depth=1, there might be messages loss. Only HISTORY.depth=KEEP_ALL will ensure full reliability, in the limit of the RESOURCE_LIMITS QoS. And the Writer will block the application code.
That's why in rmw_zenoh CongestionControl is set to BLOCK only if the QoS is RELIABLE + KEEP_ALL, which literally means "I don't want to lose any data".

Now, my understanding is that the current issue is with a "sporadic" publication (I guess the map is published only once), over a congested WiFi. The publisher QoS is set to TRANSIENT_LOCAL, meaning the AdvancedPublisher is used by rmw_zenoh. There could be other solutions like enabling end-to-end reliability for such topic.

Another idea we're working on is to allow the router to change/overwrite the QoS per topic, when routing outside the robot.
Because if you have a Lidar publishing big point clouds internally on a RELIABLE topic, you don't want an external rviz over WiFi to slow down or block your robot. While having some point clouds loss over WiFi might be acceptable.

@YuanYuYuan
Copy link
Contributor

As explained in FastDDS documentation on sending large data. The current configuration of rmw_zenoh follows the same behavior.

It is recommended to fine-tune other QoS settings and parameters based on the specific use case for transmitting large data. For instance, real-time video streaming has different requirements compared to sending an HD map as a one-time transfer.

@Hugal31
Copy link
Contributor

Hugal31 commented Mar 17, 2025

As explained in FastDDS documentation on sending large data. The current configuration of rmw_zenoh follows the same behavior.

I'm not certain rmw_zenoh follows the same behavior. If I understand the doc you linked, Fast-DDS "Reliable+KeepLast(1)" would still ensure the latest sample in the buffer is received by the subscribers, unless it is erased by a new sample. This sounds like the correct behavior. rmw_zenoh, on the other hand, will stop trying to send the sample if it cannot fit in in the queue before the drop timeout, even in the topic is set to Reliable + KeepLast(1).

I understand the risks of a "big" topic blocking a reliable topic, but I'd argue it's more the "big" topic's fault for not using a dedicated queue and transport. Still, if you prefer to keep Reliable+KeepLast topics dropping, I'd like to consider making at least transient local topics blocking. I hope that makes sense.

Another idea we're working on is to allow the router to change/overwrite the QoS per topic

That would be nice. For now though, I use TCP+UDP between my peers (within the same host) but for the routers (talking over Wi-Fi), I use TCP?prio=0-5, TCP?prio=6-7 and UDP, so a lossy Wi-Fi wouldn't congest all the reliable topics.

Thank you for the quick replies!

@Hugal31
Copy link
Contributor

Hugal31 commented Mar 17, 2025

Another idea we're working on is to allow the router to change/overwrite the QoS per topic, when routing outside the robot.
Because if you have a Lidar publishing big point clouds internally on a RELIABLE topic, you don't want an external rviz over WiFi to slow down or block your robot. While having some point clouds loss over WiFi might be acceptable.

Actually, what would make most sense (although not in Zenoh's current architecture) would be for the subscriber to be able to specify the desired reliability. So your robot can work with a reliable/blocking point cloud but RViz would be happy with a best_effort one. In addition, this semantic already exists in ROS1 and 2.

@YuanYuYuan
Copy link
Contributor

Hi @Hugal31!

Fast-DDS "Reliable+KeepLast(1)" would still ensure the latest sample in the buffer is received by the subscribers, unless it is erased by a new sample.

That's correct. I meant that users could lose data if they keep updating the queue without setting the history QoS to KEEP_ALL. We're discussing how to ensure reliability when sending a single large piece of data at most once.

Actually, what would make most sense (although not in Zenoh's current architecture) would be for the subscriber to be able to specify the desired reliability.

In fact, that's our previous design. But we decided to make them configured on the publisher side. To be clear, the current issue is how we properly map ROS 2 Reliability + KEEP_LAST(N) + actual sending behavior to zenoh's CongestionControl + Reliability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants