Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iio_buffer_destroy() sometimes hangs when items are enqueued #1211

Open
catkira opened this issue Nov 5, 2024 · 101 comments · May be fixed by #1212
Open

iio_buffer_destroy() sometimes hangs when items are enqueued #1211

catkira opened this issue Nov 5, 2024 · 101 comments · May be fixed by #1212

Comments

@catkira
Copy link
Contributor

catkira commented Nov 5, 2024

This is caused by iiod_responder_wait_done() in iiod-responder.c. The thread is probably not stopped correctly and therefore cannot join.

Testscenario: latest master libiio, network context to sdr.

@catkira
Copy link
Contributor Author

catkira commented Nov 5, 2024

The problem seems to be that iiod_rw_all() is not returning.

@pcercuei do you have a quick idea how to fix this?

@catkira
Copy link
Contributor Author

catkira commented Nov 5, 2024

@catkira
Copy link
Contributor Author

catkira commented Nov 5, 2024

void do_cancel(struct iiod_client_pdata *io_ctx) in network-unix.c gets called, but it doesnt seem to abort the recv() call ^^

@nunojsa
Copy link
Contributor

nunojsa commented Nov 5, 2024

void do_cancel(struct iiod_client_pdata *io_ctx) in network-unix.c gets called, but it doesnt seem to abort the recv() call ^^

I guess that only matters for cancellable operations. If that flag is set to true, do_cancel() should do the trick. But if we are not getting into poll(2), then that cancel_fd should not matter.

But in that case MSG_DONTWAIT should be set and recv(2) should not block...

@catkira
Copy link
Contributor Author

catkira commented Nov 5, 2024

cancellable &= !(flags & MSG_DONTWAIT); evaluates to true in my case

@catkira
Copy link
Contributor Author

catkira commented Nov 5, 2024

it seems like wait_cancellable(io_ctx, true, timeout_ms); is blocking in my case and never returns

@catkira
Copy link
Contributor Author

catkira commented Nov 5, 2024

something with the cancel_fd is not working

@nunojsa
Copy link
Contributor

nunojsa commented Nov 5, 2024

Yeah, it seems there's some subtle bug in that logic... Because do_cancel() should do the trick

@catkira
Copy link
Contributor Author

catkira commented Nov 5, 2024

The problem is pfd[1].revents is always 0 even if cancel was called.

read...
network_recv...
cancellable = 1
wait_cancellable...
destroying...
iio_buffer_cancel...
task_stop
task stop: 0 0 id: 28940624
do cancel...
do cancel done
iio_buffer_cancel done
free_buffer...
free1
pfd[1].revents = 0
task stop: 0 1 id: 28626704

@catkira
Copy link
Contributor Author

catkira commented Nov 5, 2024

something is wrong with the cancel_fd, it has wrong value when poll() is called after cancelling the buffer.

@catkira
Copy link
Contributor Author

catkira commented Nov 5, 2024

the other cancel_fd is from the other buffer that my application has open, so thats not the problem. I have to debug a bit more...

@catkira
Copy link
Contributor Author

catkira commented Nov 6, 2024

its a bit difficult to debug, because every iio context and every buffer has its own thread which all use functions from network.c and there is no real documentation about the internals of libiio ^^

@catkira
Copy link
Contributor Author

catkira commented Nov 6, 2024

@nunojsa I have some more findings, it goes into a 100% cpu load endless loop here

libiio/iiod-responder.c

Lines 571 to 572 in 790a6c9

for (tmp = priv->readers; tmp->r_next; )
tmp = tmp->r_next;

@catkira
Copy link
Contributor Author

catkira commented Nov 6, 2024

in c++ this would just be a std::vector and so much easier to debug

@catkira
Copy link
Contributor Author

catkira commented Nov 6, 2024

@pcercuei
it looks like the same reader is added twice to the list which causes an endless loop

[2024-11-06 11:34:23.757] [control_center::iio_buf_reader] [warning] ---     dequeue: timed out    ---
thread = 1778366016, adding reader 179795776
list item 0 = 181190880
list item 1 = 180965088
list item 2 = 179790880
list item 3 = 180897360
list item 4 = 180966064
list item 5 = 180047712
list item 6 = 180763824
list item 7 = 179200736
list item 8 = 180901488
list item 9 = 181200016
list item 10 = 183120608
list item 11 = 178781568
list item 12 = 181129104
list item 13 = 180895728
list item 14 = 179365760
list item 15 = 181301104
thread = 1786758720, wait_cancellable done
b ts: 0 1
c ts: 0 1
x1
thread = 1795151424, poll cancel_fd 20
thread = 1795151424, pfd[0].revents = 4, pfd[1].revents = 0, ret = 1
thread = 1786758720, network_recv done
thread = 1786758720, network_recv cancellable = 1
thread = 1786758720, wait_cancellable...
thread = 1786758720, poll cancel_fd 20
thread = 1786758720, pfd[0].revents = 1, pfd[1].revents = 0, ret = 1
thread = 1786758720, wait_cancellable done
thread = 1786758720, network_recv done
thread = 1786758720, network_recv cancellable = 1
thread = 1786758720, wait_cancellable...
thread = 1786758720, poll cancel_fd 20
thread = 1786758720, pfd[0].revents = 1, pfd[1].revents = 0, ret = 1
thread = 1786758720, wait_cancellable done
thread = 1786758720, network_recv done
thread = 1786758720, network_recv cancellable = 1
thread = 1786758720, wait_cancellable...
thread = 1786758720, poll cancel_fd 20
thread = 1786758720, pfd[0].revents = 1, pfd[1].revents = 0, ret = 1
thread = 1786758720, wait_cancellable done
x2
task stop: 0 1 id: 31090272
thread = 1786758720, network_recv done
thread = 1786758720, network_recv cancellable = 1
thread = 1786758720, wait_cancellable...
thread = 1786758720, poll cancel_fd 20
thread = 1786758720, pfd[0].revents = 1, pfd[1].revents = 0, ret = 1
thread = 1786758720, wait_cancellable done
thread = 1786758720, network_recv done
thread = 1786758720, read done
thread = 1786758720, read...
thread = 1786758720, network_recv cancellable = 1
thread = 1786758720, wait_cancellable...
thread = 1786758720, poll cancel_fd 20
thread = 1778366016, adding reader 179795776
list item 0 = 181190880
list item 1 = 180965088
list item 2 = 179790880
list item 3 = 180897360
list item 4 = 180966064
list item 5 = 180047712
list item 6 = 180763824
list item 7 = 179200736
list item 8 = 180901488
list item 9 = 181200016
list item 10 = 183120608
list item 11 = 178781568
list item 12 = 181129104
list item 13 = 180895728
list item 14 = 179365760
list item 15 = 181301104
list item 16 = 178759280
thread = 1778366016, discard reader 181190880
[2024-11-06 11:34:23.758] [control_center::iio_buf_reader] [warning] ---     dequeue: timed out    ---
b ts: 0 1
c ts: 0 1
x1
thread = 1795151424, poll cancel_fd 20
thread = 1795151424, pfd[0].revents = 4, pfd[1].revents = 0, ret = 1
thread = 1778366016, adding reader 181190880
list item 0 = 180965088
list item 1 = 179790880
list item 2 = 180897360
list item 3 = 180966064
list item 4 = 180047712
list item 5 = 180763824
list item 6 = 179200736
list item 7 = 180901488
list item 8 = 181200016
thread = 1795151424, poll cancel_fd 20
thread = 1795151424, pfd[0].revents = 4, pfd[1].revents = 0, ret = 1
x2
list item 9 = 183120608
list item 10 = 178781568
list item 11 = 181129104
task stop: 0 1 id: 31090272
list item 12 = 180895728
list item 13 = 179365760
list item 14 = 181301104
list item 15 = 178759280
list item 16 = 179795776
list item 17 = 179795776
list item 18 = 179795776
list item 19 = 179795776
list item 20 = 179795776
list item 21 = 179795776
list item 22 = 179795776
list item 23 = 179795776
list item 24 = 179795776
list item 25 = 179795776
list item 26 = 179795776
list item 27 = 179795776
list item 28 = 179795776
list item 29 = 179795776
list item 30 = 179795776
list item 31 = 179795776
list item 32 = 179795776
list item 33 = 179795776
list item 34 = 179795776
list item 35 = 179795776
list item 36 = 179795776
list item 37 = 179795776
list item 38 = 179795776
list item 39 = 179795776
list item 40 = 179795776
list item 41 = 179795776
list item 42 = 179795776
list item 43 = 179795776
list item 44 = 179795776
list item 45 = 179795776
list item 46 = 179795776
list item 47 = 179795776
list item 48 = 179795776
list item 49 = 179795776
list item 50 = 179795776
list item 51 = 179795776
list item 52 = 179795776
list item 53 = 179795776
list item 54 = 179795776
list item 55 = 179795776
list item 56 = 179795776
list item 57 = 179795776
list item 58 = 179795776
list item 59 = 179795776
list item 60 = 179795776
list item 61 = 179795776
list item 62 = 179795776
list item 63 = 179795776
list item 64 = 179795776
list item 65 = 179795776
list item 66 = 179795776
list item 67 = 179795776
list item 68 = 179795776
list item 69 = 179795776
list item 70 = 179795776
list item 71 = 179795776
list item 72 = 179795776
list item 73 = 179795776
list item 74 = 179795776
list item 75 = 179795776
list item 76 = 179795776
list item 77 = 179795776
list item 78 = 179795776
list item 79 = 179795776
list item 80 = 179795776
list item 81 = 179795776
list item 82 = 179795776
list item 83 = 179795776
list item 84 = 179795776
list item 85 = 179795776
list item 86 = 179795776
list item 87 = 179795776
list item 88 = 179795776
list item 89 = 179795776
list item 90 = 179795776
list item 91 = 179795776
list item 92 = 179795776
list item 93 = 179795776
list item 94 = 179795776
list item 95 = 179795776
list item 96 = 179795776
list item 97 = 179795776
list item 98 = 179795776
list item 99 = 179795776
list item 100 = 179795776
	/* Add it to the readers list */
	printf("thread = %u, adding reader %d\n", pthread_self(), io);
	if (!priv->readers) {
		priv->readers = io;
	} else {
		unsigned int i = 0;
		for (tmp = priv->readers; tmp->r_next; ) {
			printf("list item %d = %d\n", i++,tmp);
			tmp = tmp->r_next;
			if (i > 100) exit(1);
		}
		tmp->r_next = io;
	}

@nunojsa
Copy link
Contributor

nunojsa commented Nov 6, 2024

it looks like the same reader is added twice to the list which causes an endless loop

Interesting findings...

We can obviously add a protection so we error out gracefully (or just ignore the entry) if we add an entry that already exists (I guess that should not be possible) even though this is only a real problem if the repeated entries are sequential so that entry->next = entry (which pretty much means list_empty - but not in this case I guess). But more importantly would be to find out why we get the same entry added twice...

@catkira thanks for debugging this

@catkira
Copy link
Contributor Author

catkira commented Nov 6, 2024

yes I tried this work around, but then it still hangs sometimes at other locations (probably because of the same underlying reason)

	/* Add it to the readers list */
	printf("thread = %u, adding reader %d\n", pthread_self(), io);
	bool found = false;
	if (priv->readers) {
		for (tmp = priv->readers; tmp->r_next; ) {
			if (tmp == io)
			{
				printf("reader is already in list!!\n");
				found = true;
			}
			tmp = tmp->r_next;
		}
	}
	if (!found) {
		if (!priv->readers) {
			priv->readers = io;
		} else {
			unsigned int i = 0;
			for (tmp = priv->readers; tmp->r_next; ) {
				printf("list item %d = %d\n", i++,tmp);
				tmp = tmp->r_next;
				if (i > 100) exit(1);
			}
			tmp->r_next = io;
		}
	}

@catkira
Copy link
Contributor Author

catkira commented Nov 6, 2024

I still get endless loops at the same location even when the same item is not inserted twice
image
maybe memory corruption??

@catkira
Copy link
Contributor Author

catkira commented Nov 6, 2024

	/* Add it to the readers list */
	printf("thread = %u, adding reader %d\n", pthread_self(), io);
	bool found = false;
	if (priv->readers) {
		for (tmp = priv->readers; tmp->r_next; ) {
			if (tmp == io)
			{
				printf("reader is already in list!!\n");
				found = true;
				exit(1);
			}
			if (tmp == tmp->r_next) {
				printf("loop detected!!!");
				exit(1);
			}
			tmp = tmp->r_next;
		}
	}
	if (!found) {
		if (!priv->readers) {
			priv->readers = io;
		} else {
			unsigned int i = 0;
			for (tmp = priv->readers; tmp->r_next; ) {
				printf("list item %d = %d\n", i++,tmp);
				tmp = tmp->r_next;
				if (i > 100) exit(1);
			}
			tmp->r_next = io;
		}
	}
thread = 2539071040, adding reader 182425776
loop detected!!!
thread = 2588911168, pfd[0].revents = 1, pfd[1].revents = 0, ret = 1
thread = 2588911168, wait_cancellable done
thread = 2588911168, network_recv done
thread = 2588911168, read done
thread = 2588911168, try lock...
thread = 2588911168, locked
thread = 2588911168, discard reader 182140576
thread = 2588911168, read...
thread = 2588911168, network_recv cancellable = 1
thread = 2588911168, wait_cancellable...
thread = 2588911168, poll cancel_fd 15
thread = 2588911168, pfd[0].revents = 1, pfd[1].revents = 0, ret = 1
thread = 2588911168, wait_cancellable done
thread = 2588911168, network_recv done
thread = 2588911168, read done
thread = 2588911168, read...
thread = 2588911168, network_recv cancellable = 1
thread = 2588911168, wait_cancellable...
thread = 2588911168, poll cancel_fd 15
thread = 3178092864, adding reader 182140576
b ts: 0 1
c ts: 0 1
x1
thread = 2597303872, poll cancel_fd 15
thread = 2597303872, pfd[0].revents = 4, pfd[1].revents = 0, ret = 1
x2
task stop: 0 1 id: 32557408
thread = 2588911168, pfd[0].revents = 1, pfd[1].revents = 0, ret = 1
thread = 2588911168, wait_cancellable done
thread = 2588911168, network_recv done
thread = 2588911168, read done
thread = 2588911168, try lock...
thread = 2588911168, locked
thread = 2588911168, discard reader 182140576
thread = 2588911168, read...
thread = 2588911168, network_recv cancellable = 1
thread = 2588911168, wait_cancellable...
thread = 2588911168, poll cancel_fd 15
thread = 2588911168, pfd[0].revents = 1, pfd[1].revents = 0, ret = 1
thread = 2588911168, wait_cancellable done
thread = 2588911168, network_recv done
thread = 2588911168, read done
thread = 2588911168, read...
thread = 2588911168, network_recv cancellable = 1
thread = 2588911168, wait_cancellable...
thread = 2588911168, poll cancel_fd 15
terminate called without an active exception
Aborted (core dumped)

it must be memory corruption

@nunojsa
Copy link
Contributor

nunojsa commented Nov 6, 2024

it must be memory corruption

Auch... Might very well be but let's hop it's not :). I see your loop detected!!! and looking at iiod_io_get_response_async() I wonder how that could be as we do io->r_next = NULL; and then I started to look at __iiod_io_cancel_unlocked(). Maybe it's worth putting some debug in there. It might be the place where the loop is created...

@nunojsa
Copy link
Contributor

nunojsa commented Nov 6, 2024

BTW, I'm not sure if the code you pasted is the one you used but if it is, I think we need to check for an existing object before doing io->r_next = NULL;. Because, if the object already exists, that assignment will cut the list at that point and your loop will never find the repeated object because tmp->r_next == NULL. Right?

So __iiod_io_cancel_unlocked() should not be the problem and adding a struct iiod_io object that is already in the list cannot be permitted...

@catkira
Copy link
Contributor Author

catkira commented Nov 6, 2024

I think io->r_next is set to NULL, because it will always be inserted at the end of the list.

@catkira
Copy link
Contributor Author

catkira commented Nov 6, 2024

When I change the code that adds items to the reader list like this, it does not hang anymore.
This printf("io->r_next = %d\n", io->r_next); printf showed, that the next item was not set to 0, but instead to itself. Manually setting it to 0 seems to fix it. But why do I have to set it to 0, it is already done before io->r_next = NULL; ??

	/* Add it to the readers list */
	printf("thread = %u, adding reader %d\n", pthread_self(), io);
	bool found = false;
	if (priv->readers) {
		for (tmp = priv->readers; tmp->r_next; ) {
			if (tmp == io)
			{
				printf("reader is already in list!!\n");
				found = true;
				exit(1);
			}
			if (tmp == tmp->r_next) {
				printf("loop detected!!!\n");
				exit(1);
			}
			tmp = tmp->r_next;
		}
	}
	if (!found) {
		if (!priv->readers) {
			priv->readers = io;
		} else {
			unsigned int i = 0;
			for (tmp = priv->readers; tmp->r_next; ) {
				printf("list item %d = %d\n", i++,tmp);
				tmp = tmp->r_next;
				if (i > 100) exit(1);
			}
			tmp->r_next = io;
			printf("io->r_next = %d\n", io->r_next);
			io->r_next = NULL;
		}
	}

@catkira catkira linked a pull request Nov 6, 2024 that will close this issue
5 tasks
@catkira
Copy link
Contributor Author

catkira commented Nov 6, 2024

you can see my change in the PR, the second to last commit contains all the debug printfs.

@pcercuei
Copy link
Contributor

pcercuei commented Nov 6, 2024

If you suspect memory corruption, try to compile with GCC's built-in address sanitizer.
Add -fsanitize=address to the C flags, and -static-libasan to the linker flags.

If there is any use-after-free, or a buffer overflow, it will detect it.

@catkira
Copy link
Contributor Author

catkira commented Nov 6, 2024

I tried it. It freezes without any sanitizer output.

@nunojsa
Copy link
Contributor

nunojsa commented Nov 6, 2024

I think io->r_next is set to NULL, because it will always be inserted at the end of the list.

Sure, it marks the end of the list. My point is that if you're adding an object that already exists on the list and do io->r_next = NULL, you're cutting the list at that point right? So your code that tries to detect that the same object is being added won't really work (unless I'm missing something).

This printf("io->r_next = %d\n", io->r_next); printf showed, that the next item was not set to 0, but instead to itself.

And this really seems to suggest we're adding an object that is already on the list

Add -fsanitize=address to the C flags, and -static-libasan to the linker flags.

Likely a nice thing to add to cmake on a debug build...

@pcercuei
Copy link
Contributor

pcercuei commented Nov 6, 2024

@pcercuei it looks like the same reader is added twice to the list which causes an endless loop

Yeah that's something that is definitely not supported (nor should it be). The real question then, is why is it added twice.

Note that there is one iiod_responder instance per buffer created, plus a main iiod_responder for things like reading attributes.
When you call iio_cancel_buffer(), only the iiod_responder that corresponds to the buffer is canceled. At that point, since the underlying socket is deemed unreliable, the "free block" and "free buffer" commands go through the main iiod_responder instead.

I suspect that you are getting a race in the main "iiod_responder", maybe by calling iio_buffer_cancel() while another thread in your program is busy reading an attribute; or by calling iio_buffer_cancel() in a signal handler (that case is explicitely written as unsupported in the doc). From what I can read in the code, iiod_responder does have some multi-thread protection in place, but it does not protect against two threads issueing two requests on the same iiod_io.

@catkira
Copy link
Contributor Author

catkira commented Nov 6, 2024

yes I also think that it is some kind of race condition, because I only encounter this when I run the code on a fast CPU.

@catkira
Copy link
Contributor Author

catkira commented Nov 7, 2024

I tried to do it with mutex locks, but it did not work, my code understanding was probably not deep enough. However getting a deeper code understanding without having programmed it can easily take a week or more. The next problem is that this bug is not easy to reproduce, even on my setup I have to let my application run for about 1 min to trigger it. If you want it fix it, you probably have to create a test app first that reliably triggers the bug.

@catkira
Copy link
Contributor Author

catkira commented Nov 7, 2024

I have removed all the debug stuff, the change is pretty small now and well isolated. I admit that it is not the most elegant solution possible BUT it's much better than hidden multi-threading bugs.

@nunojsa
Copy link
Contributor

nunojsa commented Nov 8, 2024

I tried to do it with mutex locks, but it did not work, my code understanding was probably not deep enough. However getting a deeper code understanding without having programmed it can easily take a week or more

Yeah, I was wondering the same. If we serialize it, there will always be a penalty associated (even if we can make the locking better than your quick tests) it. Not sure if @pcercuei meant something else with the locking approach. And FWIW I have the same issue than you regarding deep understanding of the code (I'm mostly working on the kernel side of things).

I have removed all the debug stuff, the change is pretty small now and well isolated

We don't do squash commits, sorry. In theory, only fast forward merges would be best (IMO) but github does not allows it (on the gui). But I'm still not sure we have a proper solution for merging... See more below.

It's not a pool per thread, its one io element per thread,

Yeah, my bad.

Yes, the pool would have to be large enough for all threads that are created during runtime. It is only growing if there is a new thread, which usually does not happen that much (otherwise I would not be the first person to discover these multithreading problems).

I guess that's true but still not very elegant... But the real problem (not sure if it is in practice) with any pool alternative is the client_id thing. If i'm not missing nothing, now that means that we can, in theory, have multiple threads using the same client_id (0). Hence, I would say (again, in theory) that we can mess up badly in here:

https://github.com/analogdevicesinc/libiio/blob/main/iiod-responder.c#L288

I mean, can we safely know which iiod_io should we awake for the incoming reply?! I don't think so... And to make it work with an unique id, I think bigger changes are needed (also on iiod side).

Giving back elements would require intrusive changes at many code locations, I would try to avoid that

It should not be. The way I understand only readers matter for now so the logical place for me would be after removing the IO out of the readers list which should mean we're done with it.

FWIW, I guess the main issue your app demonstrated is that accessing an IIO context for reading/writing attributes from different threads can be problematic. Is that correct? Is the mutex hang that bad? Or would it be feasible a workaround on the app side (like a context per thread - as bad as that sounds)? I mean, I ask it because anything more than serializing default_io, is likely more difficult (and so needs time).

@catkira
Copy link
Contributor Author

catkira commented Nov 8, 2024

I had another crash (hang) :/ so my last fix apparently does not work. What seems to work is adding just io->r_next = NULL; at the right location. I have updated my PR accordingly. This basically cuts the list after the first occurence of the double inserted item. This could be a problem if the double inserted item is not at the end of the list.

@catkira
Copy link
Contributor Author

catkira commented Nov 8, 2024

In my app I am continuously reading buffers and also read and set some attributes at the same time. Serializing it with a mutex on the app side would not work, if that should also include the buffer read/writes. If I just had to serialize the attribute access that would be ok.

@catkira
Copy link
Contributor Author

catkira commented Nov 8, 2024

My approach to fix this properly would be to slowly migrate the project to c++ and rewrite one compilation unit at a time using std containers and pointers and possibly also c++ mutex and condition variables and also classes instead of priv structs where possible. It would make onboarding new people into the code much easier and faster.

@nunojsa
Copy link
Contributor

nunojsa commented Nov 8, 2024

If I just had to serialize the attribute access that would be ok.

Could you try that? From what I understood only those accesses are using the default_io which always has client_id = 0. For buffering (and events), I think we have an iiod_io per block (being the block index the IO client_id) so I don't think that would be an issue.

@catkira
Copy link
Contributor Author

catkira commented Nov 8, 2024

I am thinking now that the default_io usage is not the only multithreading problem, because otherwise my last fix attempt should have worked.

@pcercuei
Copy link
Contributor

pcercuei commented Nov 8, 2024

Yeah, I was wondering the same. If we serialize it, there will always be a penalty associated (even if we can make the locking better than your quick tests) it. Not sure if @pcercuei meant something else with the locking approach. And FWIW I have the same issue than you regarding deep understanding of the code (I'm mostly working on the kernel side of things).

What would the "penalty associated" be? Given that right now if you try to do parallel access you get all kinds of issues, you already need to serialize the API calls, so I don't see how adding some locking to enforce everything goes smooth would come with a penalty.

My approach to fix this properly would be to slowly migrate the project to c++ and rewrite one compilation unit at a time using std containers and pointers and possibly also c++ mutex and condition variables and also classes instead of priv structs where possible. It would make onboarding new people into the code much easier and faster.

Microcontrollers generally don't support C++.

@catkira
Copy link
Contributor Author

catkira commented Nov 8, 2024

Microcontrollers generally don't support C++.

c++ is getting more and more popular in the uC community. All my uC code is c++ and it works fine.

Microcontrollers just run asm instructions and a c++ compiler can generate asm instructions. What exactly do you mean by "generally don't support c++"? Do you mean there is not a c++ compiler for every uC?

@nunojsa
Copy link
Contributor

nunojsa commented Nov 8, 2024

My approach to fix this properly would be to slowly migrate the project to c++ and rewrite one compilation unit at a time using std containers and pointers and possibly also c++ mutex and condition variables and also classes instead of priv structs where possible.

You're knocking at he wrong door :). You would need to convince others than me for that. There's also plenty of ways to shoot yourself in the foot with C++. Even more when thinking about classes and the way c++ makes it easy for inheritance rather than composition... I know, I know it's trendy to say this but I would consider and internally propose rust if it was supported in more archs.

But again, let's not make this an argument between languages :)

What would the "penalty associated" be? Given that right now if you try to do parallel access you get all kinds of issues, you already need to serialize the API calls, so I don't see how adding some locking to enforce everything goes smooth would come with a penalty.

@catkira saw his app hanging sometimes... Could we read/write attrs from different threads without any issue in libiio 0.25? If so, this definitely feels like a regression which is never nice for users

I am thinking now that the default_io usage is not the only multithreading problem, because otherwise my last fix attempt should have worked.

That might very well be.... But I'm still not convinced your solution was race free.

@nunojsa
Copy link
Contributor

nunojsa commented Nov 8, 2024

Microcontrollers just run asm instructions and a c++ compiler can generate asm instructions. What exactly do you mean by "generally don't support c++"? Do you mean there is not a c++ compiler for every uC?

I think sometimes what you can do is very limited and you can't really use all c++ features. But yeah, things might have changed. I don't plain with bare uC for some time now. But I think we're bikeshedding here... I'm fairly sure that rewriting libiio is off the table.

@catkira
Copy link
Contributor Author

catkira commented Nov 8, 2024

code without standard containers or wrapped complexity is just hard to read for me, ie this block

	/* Discard the entry from the readers list */
	if (io == priv->readers) {
		priv->readers = io->r_next;
	} else if (priv->readers) {
		for (tmp = priv->readers; tmp->r_next; tmp = tmp->r_next) {
			if (tmp->r_next == io) {
				tmp->r_next = io->r_next;
				break;
			}
		}
	}

There is a comment, so if I just assume that the code does what the comment says then its easy. But if I want to verify the code then it becomes difficult.

@pcercuei
Copy link
Contributor

pcercuei commented Nov 8, 2024

@catkira saw his app hanging sometimes... Could we read/write attrs from different threads without any issue in libiio 0.25? If so, this definitely feels like a regression which is never nice for users

Yes. And it is still possible, when talking to a 0.25 IIOD server. This is enforced by a mutex, iiod_client->lock which wraps every command/response to IIOD.

Now we need to have a similar locking mechanism, but that works at the iiod_io level instead.

I am thinking now that the default_io usage is not the only multithreading problem, because otherwise my last fix attempt should have worked.

That might very well be.... But I'm still not convinced your solution was race free.

Can we see the diff?

@nunojsa
Copy link
Contributor

nunojsa commented Nov 8, 2024

code without standard containers or wrapped complexity is just hard to read for me, ie this block

Yeah, I tend to agree. This could be wrapped in some list_add() / list_del() thingy. But this is a simple linked_list so not that messy. And again, arguable... For me, when I look at modern c++ code, it's very hard to understand what it's doing.

Yes. And it is still possible, when talking to a 0.25 IIOD server. This is enforced by a mutex, iiod_client->lock which wraps every command/response to IIOD.

Now we need to have a similar locking mechanism, but that works at the iiod_io level instead.

Alright, if it was like this before, then the same serialization should not be a problem. And, FWIW, we want to be really fast when streaming...

Can we see the diff?

I think it's somewhere in this discussion... But it's the pool based stuff we're talking about

@pcercuei
Copy link
Contributor

pcercuei commented Nov 8, 2024

Can you test pcercuei@85e8499?

@catkira
Copy link
Contributor Author

catkira commented Nov 8, 2024

it immediately goes into a deadlock

@pcercuei
Copy link
Contributor

pcercuei commented Nov 8, 2024

Ok, this one should work better: pcercuei@524e979

@nunojsa
Copy link
Contributor

nunojsa commented Nov 8, 2024

Yeah, the question is which thread wakes up first when pthread_cond_broadcast() is called?! Not sure there's a guarantee on that...

@catkira
Copy link
Contributor Author

catkira commented Nov 8, 2024

The last discard item code block that I posted is pretty clear after looking at it for 1 min, but this code confuses me. First I would put the next = io->r_next into the for loop header or does iiod_responder_signal_io() already modify io->r_next?? It does not look like it for me. But besides that I don't see anything here that actually removes an item from the list. Does the comment mean iiod_responder_signal_io() should eventually lead to an item being removed?

	/* Discard the entry from the readers list */
	for (io = priv->readers; io; io = next) {
		next = io->r_next;
		iiod_responder_signal_io(io, priv->thrd_err_code);
	}

@catkira
Copy link
Contributor Author

catkira commented Nov 8, 2024

Ok, this one should work better: pcercuei@524e979

it occasionally hangs

@pcercuei
Copy link
Contributor

pcercuei commented Nov 8, 2024

Yeah, the question is which thread wakes up first when pthread_cond_broadcast() is called?! Not sure there's a guarantee on that...

It absolutely does not matter, does it?

@nunojsa
Copy link
Contributor

nunojsa commented Nov 8, 2024

What if the thread waiting on iiod_io_get_response_async() wins against the one in iiod_io_wait_for_response()? Wouldn't it set io->r_done back to false? Or am I seeing ghosts?

@pcercuei
Copy link
Contributor

pcercuei commented Nov 8, 2024

What if the thread waiting on iiod_io_get_response_async() wins against the one in iiod_io_wait_for_response()? Wouldn't it set io->r_done back to false? Or am I seeing ghosts?

Yes, then change the check in iiod_io_get_response_async() to if (io->r_done || io->r_enqueued) { ... }

@catkira
Copy link
Contributor Author

catkira commented Nov 8, 2024

I don't know if its relevant. The hangs in my app occur when the block read request is waiting (with timeout) because my sdr is not sending data. In the meantime I am doing attr rw stuff from another thread. When my sdr is sending data, I never got the hangs. So it probably also has something todo with it. The hangs also seem to be associated with bufffer_cancel() calls, which I occasionally do when there is no data from sdr (because I need to clear the dma buffers).

Cancelling a buffer that is in the poll() call waiting for data while at the same time doing rw attr stuff seems to trigger the problem.

@pcercuei
Copy link
Contributor

pcercuei commented Nov 8, 2024

I don't know if its relevant. The hangs in my app occur when the block read request is waiting (with timeout) because my sdr is not sending data. In the meantime I am doing attr rw stuff from another thread. When my sdr is sending data, I never got the hangs. So it probably also has something todo with it. The hangs also seem to be associated with bufffer_cancel() calls, which I occasionally do when there is no data from sdr (because I need to clear the dma buffers).

I assume that after your iio_buffer_cancel() call, you free the blocks and destroy the buffer?

The "free block" and "destroy buffer" commands are done on the iiod-responder's default iiod_io, so those will race against an attr rw command, and that's why it is relevant.

@catkira
Copy link
Contributor Author

catkira commented Nov 8, 2024

The problem is that unit tests (if there are any) do not cover these abort cases, because there is no SDR source that emulates this behavior and the normal SDRs like pluto also don't show this behavior, because they either send data or no data at all. A dummy iio device that can be configured for different behavior would be nice to have.

It's super difficult and slow to develop code that the developer cannot immediately test.

@nunojsa
Copy link
Contributor

nunojsa commented Nov 8, 2024

Yes, then change the check in iiod_io_get_response_async() to if (io->r_done || io->r_enqueued) { ... }

Yeah, that should do it but I think your first suggestion for a new busy mutex would likely be easier. Only thing with the mutex approach is that, in theory, an app can mess up in a way that one thread locks the IO busy lock and another one unlocks it which is far from ideal (eg: an app using a thread for enqueuing blocks and another thread for dequeuing - as odd as that looks?!)

A dummy iio device that can be configured for different behavior would be nice to have.

There's already a dummy iio device in the kernel but tbh, I'm not sure what you can do with it. Definitely not emulating high speed DMA buffers.

@mhennerich
Copy link
Contributor

I think we should integrate IIO-EMU into libiio.
https://github.com/analogdevicesinc/iio-emu
It might be useful to test, but currently only implements the old v0.x interface...

@pcercuei
Copy link
Contributor

pcercuei commented Nov 8, 2024

I think we should integrate IIO-EMU into libiio. https://github.com/analogdevicesinc/iio-emu It might be useful to test, but currently only implements the old v0.x interface...

It could be re-implemented as a Libiio backend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants