Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Session restart example #218

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

dcorbeil
Copy link
Contributor

Hi,

A couple weeks back I asked on the Discord about how to handle the situation where in a multi-router network, the router that the zenoh-pico client is connected to goes down. The recommendation that I was given was to check whether the session is still valid is by periodically calling z_check(). I tried doing just that but it didn't work. Even if the router that the zenoh-pico client was connected to went down for whatever reason, z_check() kept returning true. For the use-case of a publisher I ended up using the return value from z_publisher_put(). This works well for one restart.

The problem that I run into is that the second restart triggers a USAGE FAULT:

uart:~$ *** Booting Zephyr OS build zephyr-v20701  ***
[00:00:01.642,000] <inf> net_config: Initializing network
[00:00:01.642,000] <inf> net_config: Waiting interface 1 (0x20001c58) to be up...
[00:00:01.931,000] <err> eth_stm32_hal: Failed to enqueue frame into RX queue: -115
[00:00:01.932,000] <inf> net_config: Interface 1 (0x20001c58) coming up
[00:00:01.932,000] <inf> net_config: IPv4 address: 192.0.2.4
[00:00:02.033,000] <inf> net_config: IPv6 address: 2001:db8::2
uart:~$ Opening Zenoh Session...OK
Declaring publisher for 'demo/example/zenoh-pico-pub'...OK
Putting Data ('demo/example/zenoh-pico-pub': '[   0] [STSTM32]{nucleo-F767ZI} Pub from Zenoh-Pico!')...
Putting Data ('demo/example/zenoh-pico-pub': '[   1] [STSTM32]{nucleo-F767ZI} Pub from Zenoh-Pico!')...
Putting Data ('demo/example/zenoh-pico-pub': '[   2] [STSTM32]{nucleo-F767ZI} Pub from Zenoh-Pico!')...
Putting Data ('demo/example/zenoh-pico-pub': '[   3] [STSTM32]{nucleo-F767ZI} Pub from Zenoh-Pico!')...
Putting Data ('demo/example/zenoh-pico-pub': '[   4] [STSTM32]{nucleo-F767ZI} Pub from Zenoh-Pico!')...
Putting Data ('demo/example/zenoh-pico-pub': '[   5] [STSTM32]{nucleo-F767ZI} Pub from Zenoh-Pico!')...
Putting Data ('demo/example/zenoh-pico-pub': '[   6] [STSTM32]{nucleo-F767ZI} Pub from Zenoh-Pico!')...
Putting Data ('demo/example/zenoh-pico-pub': '[   7] [STSTM32]{nucleo-F767ZI} Pub from Zenoh-Pico!')...
Putting Data ('demo/example/zenoh-pico-pub': '[   8] [STSTM32]{nucleo-F767ZI} Pub from Zenoh-Pico!')...
Putting Data ('demo/example/zenoh-pico-pub': '[   9] [STSTM32]{nucleo-F767ZI} Pub from Zenoh-Pico!')...
Putting Data ('demo/example/zenoh-pico-pub': '[  10] [STSTM32]{nucleo-F767ZI} Pub from Zenoh-Pico!')...
Putting Data ('demo/example/zenoh-pico-pub': '[  11] [STSTM32]{nucleo-F767ZI} Pub from Zenoh-Pico!')...
Putting Data ('demo/example/zenoh-pico-pub': '[  12] [STSTM32]{nucleo-F767ZI} Pub from Zenoh-Pico!')...
Putting Data ('demo/example/zenoh-pico-pub': '[  13] [STSTM32]{nucleo-F767ZI} Pub from Zenoh-Pico!')...
Putting Data ('demo/example/zenoh-pico-pub': '[  14] [STSTM32]{nucleo-F767ZI} Pub from Zenoh-Pico!')...
Putting Data ('demo/example/zenoh-pico-pub': '[  15] [STSTM32]{nucleo-F767ZI} Pub from Zenoh-Pico!')...
Putting Data ('demo/example/zenoh-pico-pub': '[  16] [STSTM32]{nucleo-F767ZI} Pub from Zenoh-Pico!')...
Error publishing message (-100)
Closing Zenoh Session...OK!
Opening Zenoh Session...OK
Declaring publisher for 'demo/example/zenoh-pico-pub'...OK
Putting Data ('demo/example/zenoh-pico-pub': '[   0] [STSTM32]{nucleo-F767ZI} Pub from Zenoh-Pico!')...
Putting Data ('demo/example/zenoh-pico-pub': '[   1] [STSTM32]{nucleo-F767ZI} Pub from Zenoh-Pico!')...
Putting Data ('demo/example/zenoh-pico-pub': '[   2] [STSTM32]{nucleo-F767ZI} Pub from Zenoh-Pico!')...
Putting Data ('demo/example/zenoh-pico-pub': '[   3] [STSTM32]{nucleo-F767ZI} Pub from Zenoh-Pico!')...
Putting Data ('demo/example/zenoh-pico-pub': '[   4] [STSTM32]{nucleo-F767ZI} Pub from Zenoh-Pico!')...
Putting Data ('demo/example/zenoh-pico-pub': '[   5] [STSTM32]{nucleo-F767ZI} Pub from Zenoh-Pico!')...
Putting Data ('demo/example/zenoh-pico-pub': '[   6] [STSTM32]{nucleo-F767ZI} Pub from Zenoh-Pico!')...
Putting Data ('demo/example/zenoh-pico-pub': '[   7] [STSTM32]{nucleo-F767ZI} Pub from Zenoh-Pico!')...
Putting Data ('demo/example/zenoh-pico-pub': '[   8] [STSTM32]{nucleo-F767ZI} Pub from Zenoh-Pico!')...
Putting Data ('demo/example/zenoh-pico-pub': '[   9] [STSTM32]{nucleo-F767ZI} Pub from Zenoh-Pico!')...
Error publishing message (-100)
Closing Zenoh Session...OK!
Opening Zenoh Session...OK
Declaring publisher for 'demo/example/zenoh-pico-pub'...OK
[00:00:44.337,000] <err> os: ***** USAGE FAULT *****
[00:00:44.337,000] <err> os:   Illegal load of EXC_RETURN into PC
[00:00:44.337,000] <err> os: r0/a1:  0x200025f0  r1/a2:  0x0802ddff  r2/a3:  0x00000010
[00:00:44.337,000] <err> os: r3/a4:  0xffffffff r12/ip:  0xffffffff r14/lr:  0x0801985f
[00:00:44.337,000] <err> os:  xpsr:  0xe000ec00
[00:00:44.337,000] <err> os: Faulting instruction address (r15/pc): 0x00000000
[00:00:44.337,000] <err> os: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
[00:00:44.337,000] <err> os: Current thread: 0x20002300 (unknown)
[00:00:44.400,000] <err> os: Halting system

Setup:

  • Microcontroller 1x
    • User code: z_pub_session_retry.c
    • Board: Nucleo-F429zi
    • Build system: PlatformIO as per the README.md
    • OS: ZephyrRTOS v2.7.1 default from PlatformIO
  • Router 2x
    • zenohd v0.7.0-rc
    • Board: Raspberry Pi CM4
    • OS: Custom buildroot

Steps to reproduce:

  1. Follow Zephyr instruction from the zenoh-pico README.md to build and flash the z_pub_session_retry.c example from this PR
  2. Start a zenohd on two different machines that can function has a zenoh router (I started it by running RUST_LOG=DEBUG zenohd. The debug output makes it very easy to know which one is connected to the zenoh-pico client)
  3. Upload/start the microcontroller running the example
  4. Wait for the publisher from the example to start publishing
  5. Note which of the two routers the client connected itself to (Let's call it router X)
  6. Terminate (Ctrl+C) router X
  7. Note that the client reconnects itself to router Y
  8. Restart router X
  9. Terminate router Y
  10. Notice the USAGE FAULT on the microcontroller

Expected behavior:

  • Be able to switch back and forth between the two routers indefinitely at runtime

My intent with this PR is two easily show the code that I'm using and solve this issue and find the solution to that problem.

I initially ran into a slightly different issue that this one while using a fairly recent version of Zephyr (30efd04) but for the sake of simplicity and removing variables I'm reporting the issue as if I'm using PlatformIO. I suppose that you guys are all using PlaftormIO for zenoh-pico development?

@@ -222,7 +222,7 @@
*/
#ifndef Z_BATCH_SIZE_RX
#define Z_BATCH_SIZE_RX \
65535 // Warning: changing this value can break the communication
3072 // Warning: changing this value can break the communication
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not intend to merge changes made to this file. They are just to make the session opening work as my microcontroller doesn't have enough RAM for the default value.

@dcorbeil
Copy link
Contributor Author

dcorbeil commented Jun 1, 2023

I was able to reproduce the error again in the debugger but got a slightly different one than previously mentioned in this PR.

uart:~$ *** Booting Zephyr OS build zephyr-v2.7.1  ***
[00:00:01.957,000] <inf> net_config: Initializing network
[00:00:01.957,000] <inf> net_config: Waiting interface 1 (0x20001a94) to be up...
[00:00:02.446,000] <err> eth_stm32_hal: Failed to enqueue frame into RX queue: -115
[00:00:02.447,000] <inf> net_config: Interface 1 (0x20001a94) coming up
[00:00:02.447,000] <inf> net_config: IPv4 address: 192.0.2.4
[00:00:02.547,000] <inf> net_config: IPv6 address: 2001:db8::2
[00:00:02.547,000] <inf> net_config: IPv6 address: 2001:db8::2
uart:~$ Opening Zenoh Session...OK
Declaring publisher for 'demo/example/zenoh-pico-pub'...OK
uart:~$ Putting Data ('demo/example/zenoh-pico-pub': '[   0] [STSTM32]{nucleo-F767ZI} Pub from Zenoh-Pico!')...
Putting Data ('demo/example/zenoh-pico-pub': '[   1] [STSTM32]{nucleo-F767ZI} Pub from Zenoh-Pico!')...
Putting Data ('demo/example/zenoh-pico-pub': '[   2] [STSTM32]{nucleo-F767ZI} Pub from Zenoh-Pico!')...
Putting Data ('demo/example/zenoh-pico-pub': '[   3] [STSTM32]{nucleo-F767ZI} Pub from Zenoh-Pico!')...
Putting Data ('demo/example/zenoh-pico-pub': '[   4] [STSTM32]{nucleo-F767ZI} Pub from Zenoh-Pico!')...
Putting Data ('demo/example/zenoh-pico-pub': '[   5] [STSTM32]{nucleo-F767ZI} Pub from Zenoh-Pico!')...
Putting Data ('demo/example/zenoh-pico-pub': '[   6] [STSTM32]{nucleo-F767ZI} Pub from Zenoh-Pico!')...
Error publishing message (-100)
Closing Zenoh Session...OK!
uart:~$ Opening Zenoh Session...OK
Declaring publisher for 'demo/example/zenoh-pico-pub'...OK
Putting Data ('demo/example/zenoh-pico-pub': '[   0] [STSTM32]{nucleo-F767ZI} Pub from Zenoh-Pico!')...
Putting Data ('demo/example/zenoh-pico-pub': '[   1] [STSTM32]{nucleo-F767ZI} Pub from Zenoh-Pico!')...
Putting Data ('demo/example/zenoh-pico-pub': '[   2] [STSTM32]{nucleo-F767ZI} Pub from Zenoh-Pico!')...
Putting Data ('demo/example/zenoh-pico-pub': '[   3] [STSTM32]{nucleo-F767ZI} Pub from Zenoh-Pico!')...
Putting Data ('demo/example/zenoh-pico-pub': '[   4] [STSTM32]{nucleo-F767ZI} Pub from Zenoh-Pico!')...
Error publishing message (-100)
Closing Zenoh Session...OK!
uart:~$ Opening Zenoh Session...OK
[00:00:31.872,000] <err> os: ***** BUS FAULT *****
[00:00:31.872,000] <err> os:   Imprecise data bus error
[00:00:31.872,000] <err> os: r0/a1:  0x000009ff  r1/a2:  0x0000aaaa  r2/a3:  0x00000003
[00:00:31.872,000] <err> os: r3/a4:  0x0000aaaa r12/ip:  0x00000000 r14/lr:  0x08020813
[00:00:31.872,000] <err> os:  xpsr:  0x81080000
[00:00:31.872,000] <err> os: Faulting instruction address (r15/pc): 0x08020816
[00:00:31.872,000] <err> os: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
[00:00:31.872,000] <err> os: Current thread: 0x20002788 (main)
[00:00:31.933,000] <err> os: Halting system

Sadly gdb doesn't give the callstack from before the signal handler was called. Whereas when I'm using the Zephyr commit I pointed at earlier I was able to. Maybe they added that between 2.7.1 and 3.3.0 or there is a config option that needs to be turned on that was magically turned on behind my back? I don't know.

Anyways, using ``addr2line` I was able to pin point exactly which line triggered the signal handler by doing:

/path/to/zephyr/sdk/arm-zephyr-eabi/bin/arm-zephyr-eabi-addr2line -e build/zephyr/zephyr.elf 0x08020816

And it pointed to /opt/dynonavionics/san-juan-zephyr-dev/zephyrproject/zephyr/lib/os/heap.h:179. Does that ring a bell?

@cguimaraes
Copy link
Member

Hi @dcorbeil .
I am currently without access to my development environment, but I can try to help.

It seems to me that it fails on the 3rd open. Can you confirm that this is consistent with other failed runs?
If so, it might be that there is some memory leak and it is exhausting the memory.

@dcorbeil
Copy link
Contributor Author

dcorbeil commented Jun 5, 2023

@cguimaraes
Yes it does fail on the 3rd open and is consistently failing at the 3rd open.

My mind immediately went to a memory leak and it's exhausting the memory as well so I checked the heap by using this function:

#include <zephyr/sys/sys_heap.h>
extern struct sys_heap _system_heap;

void print_sys_heap()
{
    int err;
    struct sys_memory_stats stats;

    err = sys_heap_runtime_stats_get(&_system_heap, &stats);
    if (err) {
        printf("Failed to read kernel system heap statistics (err %d)\n", err);
    }

    printf("free:           %zu\t", stats.free_bytes);
    printf("allocated:      %zu\t", stats.allocated_bytes);
    printf("max. allocated: %zu\n", stats.max_allocated_bytes);
}

and everything seems ok. There is always at least 10k bytes available.

When using Zephyr > 3.3.0, I was able to get the callstack of where it fails exactly and it always fails at zp_start_read_task() or zp_start_lease_task(), whichever is called second on the 3rd open. The exact line at which it fails is _z_task_t *task = (_z_task_t *)z_malloc(sizeof(_z_task_t)); in either functions.

@cguimaraes
Copy link
Member

If the memory keeps stable I wouldn't say it is a leaking issue. It might still be a memory fragmentation issue.

If I am not mistaken Zephyr also has a function to tell you the biggest continuous block of memory. Can you double check that?

@dcorbeil
Copy link
Contributor Author

dcorbeil commented Jun 7, 2023

Are you talking about heap_print_info()?

For fun I tried allocating 700 int pointers and then freeing them right after to try to fragment the memory. No change. It still crashes at the 3rd open.

@cguimaraes
Copy link
Member

I would prefer to see the statistics with heap_print_info before each z_open.

When you allocate and deallocate right after, it might be that the memory will be in a very similar state.

@dcorbeil
Copy link
Contributor Author

dcorbeil commented Jun 9, 2023

Here is a log with a call to heap_print_info right before opening the call to z_open. For some reason the error is now different. The one thing that stays consistent is that an error occurs after the 3rd restart.

uart:~$ *** Booting Zephyr OS build zephyr-v20701  ***
[00:00:03.653,000] <inf> net_config: Initializing network
[00:00:03.653,000] <inf> net_config: Waiting interface 1 (0x20001c58) to be up...
[00:00:04.125,000] <err> eth_stm32_hal: Failed to enqueue frame into RX queue: -115
[00:00:04.125,000] <inf> net_config: Interface 1 (0x20001c58) coming up
[00:00:04.125,000] <inf> net_config: IPv4 address: 192.0.2.4
[00:00:04.225,000] <inf> net_config: IPv6 address: 2001:db8::2
[00:00:04.225,000] <inf> net_config: IPv6 address: 2001:db8::2
uart:~$ Heap at 0x2000fb80 contains 2559 units in 12 buckets

  bucket#    min units        total      largest      largest
             threshold       chunks      (units)      (bytes)
  -----------------------------------------------------------
        1            2            1            2           12
       11         2048            1         2512        20092

20104 free bytes, 240 allocated bytes, overhead = 132 bytes (0.6%)


uart:~$ *** Booting Zephyr OS build zephyr-v20701  ***
[00:00:01.532,000] <inf> net_config: Initializing network
[00:00:01.532,000] <inf> net_config: Waiting interface 1 (0x20001c58) to be up...
[00:00:01.811,000] <err> eth_stm32_hal: Failed to enqueue frame into RX queue: -115
[00:00:01.811,000] <inf> net_config: Interface 1 (0x20001c58) coming up
[00:00:01.812,000] <inf> net_config: IPv4 address: 192.0.2.4
[00:00:01.913,000] <inf> net_config: IPv6 address: 2001:db8::2
uart:~$ Heap at 0x2000fb80 contains 2559 units in 12 buckets

  bucket#    min units        total      largest      largest
             threshold       chunks      (units)      (bytes)
  -----------------------------------------------------------
        1            2            1            2           12
       11         2048            1         2512        20092

20104 free bytes, 240 allocated bytes, overhead = 132 bytes (0.6%)
Opening Zenoh Session...OK
Declaring publisher for 'demo/example/zenoh-pico-pub'...OK
Putting Data ('demo/example/zenoh-pico-pub': '[   0] [STSTM32]{nucleo-F767ZI} Pub from Zenoh-Pico!')...
Putting Data ('demo/example/zenoh-pico-pub': '[   1] [STSTM32]{nucleo-F767ZI} Pub from Zenoh-Pico!')...
Putting Data ('demo/example/zenoh-pico-pub': '[   2] [STSTM32]{nucleo-F767ZI} Pub from Zenoh-Pico!')...
Putting Data ('demo/example/zenoh-pico-pub': '[   3] [STSTM32]{nucleo-F767ZI} Pub from Zenoh-Pico!')...
Putting Data ('demo/example/zenoh-pico-pub': '[   4] [STSTM32]{nucleo-F767ZI} Pub from Zenoh-Pico!')...
Putting Data ('demo/example/zenoh-pico-pub': '[   5] [STSTM32]{nucleo-F767ZI} Pub from Zenoh-Pico!')...
Putting Data ('demo/example/zenoh-pico-pub': '[   6] [STSTM32]{nucleo-F767ZI} Pub from Zenoh-Pico!')...
Putting Data ('demo/example/zenoh-pico-pub': '[   7] [STSTM32]{nucleo-F767ZI} Pub from Zenoh-Pico!')...
Putting Data ('demo/example/zenoh-pico-pub': '[   8] [STSTM32]{nucleo-F767ZI} Pub from Zenoh-Pico!')...
Putting Data ('demo/example/zenoh-pico-pub': '[   9] [STSTM32]{nucleo-F767ZI} Pub from Zenoh-Pico!')...
Error publishing message (-100)
Closing Zenoh Session...OK!
Heap at 0x2000fb80 contains 2559 units in 12 buckets

  bucket#    min units        total      largest      largest
             threshold       chunks      (units)      (bytes)
  -----------------------------------------------------------
        1            2            1            2           12
       11         2048            1         2512        20092

20104 free bytes, 240 allocated bytes, overhead = 132 bytes (0.6%)
Opening Zenoh Session...OK
Declaring publisher for 'demo/example/zenoh-pico-pub'...OK
Putting Data ('demo/example/zenoh-pico-pub': '[   0] [STSTM32]{nucleo-F767ZI} Pub from Zenoh-Pico!')...
Putting Data ('demo/example/zenoh-pico-pub': '[   1] [STSTM32]{nucleo-F767ZI} Pub from Zenoh-Pico!')...
Putting Data ('demo/example/zenoh-pico-pub': '[   2] [STSTM32]{nucleo-F767ZI} Pub from Zenoh-Pico!')...
Putting Data ('demo/example/zenoh-pico-pub': '[   3] [STSTM32]{nucleo-F767ZI} Pub from Zenoh-Pico!')...
Putting Data ('demo/example/zenoh-pico-pub': '[   4] [STSTM32]{nucleo-F767ZI} Pub from Zenoh-Pico!')...
Error publishing message (-100)
Closing Zenoh Session...OK!
Heap at 0x2000fb80 contains 2559 units in 12 buckets

  bucket#    min units        total      largest      largest
             threshold       chunks      (units)      (bytes)
  -----------------------------------------------------------
        1            2            1            2           12
       11         2048            1         2512        20092

20104 free bytes, 240 allocated bytes, overhead = 132 bytes (0.6%)
Opening Zenoh Session...OK
Declaring publisher for 'demo/example/zenoh-pico-pub'...OK
[00:00:32.883,000] <err> os: ***** USAGE FAULT *****
[00:00:32.883,000] <err> os:   Illegal load of EXC_RETURN into PC
[00:00:32.883,000] <err> os: r0/a1:  0x200025f0  r1/a2:  0x0802e035  r2/a3:  0x00000010
[00:00:32.883,000] <err> os: r3/a4:  0xffffffff r12/ip:  0xffffffff r14/lr:  0x08019a1f
[00:00:32.883,000] <err> os:  xpsr:  0xe000ec00
[00:00:32.883,000] <err> os: Faulting instruction address (r15/pc): 0x00000000
[00:00:32.883,000] <err> os: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
[00:00:32.883,000] <err> os: Current thread: 0x20002300 (unknown)
[00:00:32.946,000] <err> os: Halting system

@cguimaraes
Copy link
Member

@dcorbeil can you share your project with us? Not only the main.c but also the prj.conf files.
I just got a nucleo-f429zi and I would like to replicate exactly the same issue on my side.

@cguimaraes
Copy link
Member

I think I am able to replicate the same issue.
I will try to debug it further.

@dcorbeil did you try to use the most recent Zephyr version 3.3.0 or 3.4.0?

@dcorbeil
Copy link
Contributor Author

dcorbeil commented Jun 19, 2023

@cguimaraes
Sure no problem. Here it is.

I tried Zephyr 3.3.0 and 3.4.0 with similar results. Of course when using 3.4.0, I had to pass the iface to net_if_ipv<4|6>_maddr_<join|leave> because Zephyr changed the signature to those functions. You'll see the commits where I changed the parameters back and forth passed to it. Feel free to experiment and message me if you run into issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants