-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Build/test fail on architectures affected by the time_t transition #4876
Comments
That chunk of code is:
Which is interesting because that's the third assertion, meaning that the other two passed. This means that
Digging through headers, it looks like suseconds_t is the same underlying type as time_t:
What does What happens if you apply this patch? All it does is change that function to actually initialise (and check) the full 64-bits:
I don't expect it to fix anything, but if it changes the failure, we might learn something from the way the failure changes |
I don't see anything else in this source file that seems sensitive to the size changing, it mostly uses all its own types and only uses time_t et al to interact with the outside.
I don't see how this could be happening, we only ever use the |
Hi @elliefm, I tried the patch but the test still failed on a armel machine |
Interesting. The assertion is testing that we got some value back from the call other than the one we initialised. So I wonder if perhaps the In that case, I'd suggest patching out that one assertion and see what happens. We won't have sub-second granularity in our time distortion, but maybe that's fine. I think the only set of tests that actually use the time distorting features are in search_expr.testc, which so far aren't even running, because the whole test program crashes out much earlier when one of the backend test's set_up functions fails (due to this assert). |
Hi @elliefm, I can test some other patches if you want |
No, without patch we have the original issue (see build logs)
What can I test now ? |
I didn't mean remove the patch, just remove the failing assert. Here's a new patch, to replace the previous one:
That will stop the tests from crashing out when microseconds aren't available, and then we'll see what starts failing instead... |
Hi @elliefm, it's better but there is another error :
|
That looks promising, it made it a lot further before crashing this time. I see a handful of failures from the tests that didn't crash, but that's not surprising. I'll see if I can see a likely cause of the crash, and if so I'll throw you another patch to try. |
Oh, I finally understand this comment now, I was confused earlier. These are 32-bit platforms, but they will now have a 64-bit I noticed in the build logs a lot of extra warnings, because the build compiles with -O2 rather than -O0, and gcc generally reports a lot more warnings when optimisations are enabled, because it makes a deeper analysis of the code as part of the optimisation. So I'll also do some full builds with optimisations on to reproduce it locally, and see if I can fix the things it warns about. |
Are you able to inject a whole git branch, rather than manually adding individual patches? If you can't inject a git branch, can you inject a github PR? I've created https://github.com/elliefm/cyrus-imapd/tree/v38/debian-time_t-transition to keep track of the work I'm doing here, hopefully it's easy for you to test the changes I make there, but please let me know what you need. :) |
You can discard the assert patch from a few days ago, now that I understand the gettimeofday problem better it's no longer needed. I think the first commit on my branch should fix cunit/timeofday.testc properly |
Hi, I tried this patch but it still fail:
|
I've made a PR to make it easy to produce a single patch that includes all the fixes I'm trying. Can you try this patch please: https://github.com/cyrusimap/cyrus-imapd/pull/4887.patch I'm not sure if it will fix the stack smashing problem or not, but it also adds some extra debug logging so that if the problem still happens, hopefully we can see where. |
Hi @elliefm , here is the new result:
|
Thanks, that gets us somewhere! I've just added some more debug logging. Can you download the patch again from the same link (it should be a bit different this time) and try that? |
Thanks @elliefm . Here are the new logs:
|
Thanks, that's narrowed things down! I can see that the stack smashing seems to be happening somewhere in I'll add a bunch more debug logging, and get back to you when I have another patch for you to try. It might take me a couple of days to get back to you, I need to figure out how to usefully log these libical objects! 😅 |
I'm assuming libical already builds fine and passes all its tests on these platforms? If libical itself is broken there's not much point debugging Cyrus until it's fixed. Hmm, this looks like a build log from libical on armel, showing some test failures related to timezones. I don't know if this is the correct build log for the libical being used by Cyrus here, or if it's unrelated. What I don't see in that build log, though, is the stack being smashed, so I can't trivially blame the stack smashing on libical. I suppose libical might be failing and returning some error flag, and we're not handling it properly and smashing the stack ourselves as a side effect. |
Hi, you are looking for an old build log. The package is now named libical3 and passed : https://buildd.debian.org/status/package.php?p=libical3 |
Thanks and no problem to wait: Debian 13 will be released around 2025 😉 |
...
I hope that's a joke. Unless this is solved in the next few weeks, it's going to start impacting the overall t64 transition in Debian. |
@guimard Can you fetch a fresh copy of the patch from the same link as before, and try that please? I've added a bunch of debug logging to I've also made a small-but-real change that might fix an edge case arising from compiling with optimisations enabled on ARM. I'm crossing my fingers that this change fixes the stack smashing problem, but if not, hopefully the extra debug logging at least narrows the search! Thanks :) |
Let's go 😉 |
I used last commits from v38/debian-time_t-transition, and get the following errors. (I included also warnings about time_t)
|
Thanks. Turns out I forgot to push my latest changes last time, whoops! I've just done that now. Can you please fetch a fresh copy and go again? |
Hi, here are the new errors:
|
Please keep Debian-only comments inside bugs.debian.org. And there is no such issue since cyrus-imapd was automatically removed from testing and will return in it only after having being fixed |
Hmm. Looks like the edge case fix didn't help, but at least the additional logging has narrowed it down a bit more. I've added some more logging yet again, fingers crossed that maybe this time we might find the culprit! Can you grab a fresh copy of https://github.com/cyrusimap/cyrus-imapd/pull/4887.patch to replace what you already had and try again? Thanks :) |
Hi, here is the result:
|
Thanks! |
That's fascinating, we hit "about to call utc_timespan_cb" but didn't hit "utc_timespan_cb: entered". That means it's either something wrong with icalcomponent_foreach_recurrence, or something wrong with how we're calling it. I've added some more logging again, including pulling some function parameters out into local variables before the call. I don't know if that'll make any difference, but let's see what happens and what it logs... Can you re-fetch https://github.com/cyrusimap/cyrus-imapd/pull/4887.patch and try again? Thanks :) |
Thanks, here are the new logs:
|
Well then! At this point I'm pretty sure the stack smashing is happening in In the meantime, I've made one more change, this time it just disables the bit we know is broken. I don't think that's a solution, but it'll let the tests keep running beyond this point, so we can see what else is failing. Can you fetch a fresh copy of https://github.com/cyrusimap/cyrus-imapd/pull/4887.patch and retry please? Thanks :) |
Hi, |
Here is the result:
|
Thanks! I've just pushed another patch that disables "two bounded rrules" too. Can you try that please :) |
It's running. So it confirms that the bug is in libical ? |
Build and test succeed with your last patch:
|
I think it confirms that the stack smashing is occurring inside libical. It's not yet known whether that's because of a bug in libical, or a bug in how we're calling it. I guess it could also be an architectural limitation -- perhaps this platform doesn't provide enough stack space for libical to work with, or something. I don't know much about the gcc stack protector functionality, so I don't know if simply running out of stack space would cause this particular symptom. My next step will be putting together a regression test for libical that runs as close to the same code as possible as these cyrus tests, and providing that as a patch against libical, so we can try to reproduce it there. If we can, hopefully we can then add enough diagnostic information to the test to isolate the problem, like we've been doing here. |
Great! At least now we know there's no other obvious problems on this platform. :) Is the package build system able to run Cassandane? |
Hi, I think I provided all Cyrus's features in my package. So I think the response is |
@elliefm, what can I do to help here ? |
Hi @elliefm, do you plan to work on this issue ? |
I haven't made any progress in figuring out how to patch libical to keep investigating the stack smashing on armel/armhf, and I'm probably not going to get around to that. From what I can tell, these architectures are mostly used by mobile phones, old Raspberry Pis, etc. There's no good reason to run a real IMAP server on those, so it's not a priority for us. It would be fine with us if Debian just didn't package Cyrus-IMAP for those architectures. There was a few good commits in my draft PR, and I would like to clean that up and get the good bits landed. I hope to do that soon, cause it's on my list and I would like to cross it off! |
Hi, Cyrus-Imapd is going to be expelled from Debian 13 due to this issue. Is there any progress ? |
Nothing new, all my earlier fixes landed on master months ago, and will be in the next set of point releases. From what I could tell at the time, the last remaining problem seemed to be something wrong with libical, but I didn't know how to patch that with debug logging to investigate further, and it didn't seem important -- it does not make any sense to run an email server on a phone. I thought that cyrus-imapd would be not packaged for the affected platforms, but not removed entirely. I've raised this with the team to discuss further. Will libical be in Debian 13? |
From Debian #1068223.
I can launch 3.8.2 build on such architecture if you give me a patch to test.
Best regards,
The text was updated successfully, but these errors were encountered: