SSR-BRS crashes after varying amount of time #371

janhenhan · 2023-07-16T17:08:06Z

Hi all,

Great to see you guys are still going strong developing SSR after over a decade. Congrats on the 0.6 release!

Recently, I've increased the number of network messages I send to ssr-brs (As an example, let's say 20 sources each get messages updating some of their attributes at 100 Hz update rate). Unfortunately that came with a big decrease of stability of the ssr.

I am experiencing some unexpected crashes after varying amounts of time - sometimes it runs fine for hours, other times only minutes. At first I thought this is maybe the older FUDI interface's fault (seeing some open issues here describing similar crashes using the older network interface), so I switched over to using the more recent websocket interface. Unfortunately, same problem with crashes there. The messages I send all seem to contain values within a valid range, i.e. it is no particular message that crashes ssr-brs as far as I can tell.

I attached the process to lldb, however the messages mean very little to me - most of the time it is a bad access in the cleanup:
"
Process 45934 stopped

thread # 13, stop reason = EXC_BAD_ACCESS (code=1, address=0xbeadde8ca818)
frame # 0: 0x000000010000d2e4 ssr-brs`apf::CommandQueue::push(apf::CommandQueue::Command*) [inlined] apf::CommandQueue::_cleanup(this=0x0000000100110588, cmd=0x000060000021a800) at commandqueue.h:173:12 [opt]
170 void _cleanup(Command* cmd)
171 {
172 assert(cmd != nullptr);
-> 173 cmd->cleanup();
174 delete cmd;
175 }
176
Target 0: (ssr-brs) stopped.
"

Any thoughts on what this means or how I could prevent it, to get ssr-brs to a more robust state again? These bad_accesses happen somewhere in APF? Any other logs that would help? I'm on a M1 Mac.

Many thanks!

mgeier · 2023-07-19T18:31:09Z

Thanks for the report!

This sounds like a nasty bug, I hope we can find the cause and fix it.

It kinda sounds like a use-after-free bug where the cmd pointer is accessed after it has been freed somewhere else. However, it is freed literally in the next line, and not somewhere else ...

Smells a bit like undefined behavior ...

These bad_accesses happen somewhere in APF?

Well, yes, the CommandQueue is used to send messages from the control thread to the audio thread (and back).
It might be a problem in the APF, but not necessarily.

Any other logs that would help?

I don't know. It seems the problem happens when calling the cleanup() function, but before this function is actually executed.

I'm on a M1 Mac.

That's a good hint. I have the feeling that our ring buffer implementation might not be correct on ARM processors.

Are you running the SSR natively or via Rosetta?

The first thing I would try is to use atomics in our ring buffer and see if that changes anything.
Currently, I don't have a lot of time, but maybe I can try a few things next week.

janhenhan · 2023-07-19T19:19:40Z

Thanks Matthias! It would be really great if you can find the time to have a look at some point :)

Are you running the SSR natively or via Rosetta?
I'm running a native M1 arm build.

For what it is worth, a maybe questionable observation I have made is that SSR seems to crash much quicker when I start it as a subprocess in python compared to when I wait for it to crash in the debugger... But that may just be subjective or within the range of the very varying times it runs until it crashes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SSR-BRS crashes after varying amount of time #371

SSR-BRS crashes after varying amount of time #371

janhenhan commented Jul 16, 2023 •

edited

Loading

mgeier commented Jul 19, 2023

janhenhan commented Jul 19, 2023

SSR-BRS crashes after varying amount of time #371

SSR-BRS crashes after varying amount of time #371

Comments

janhenhan commented Jul 16, 2023 • edited Loading

mgeier commented Jul 19, 2023

janhenhan commented Jul 19, 2023

janhenhan commented Jul 16, 2023 •

edited

Loading