-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crash during navitia::type::Data::clone_from when applying disruption #3113
Comments
Hi @eturck, Sorry to hear about all the crashes, it's quite embarrassing to be honest. By having a look at the call stack you provided, it reminds me of some memory corruption problem we've had in the past, working on some optimisations in georef. We were messing up the data but could only see the problem when the data was used during serialization. The root cause and the actual crash were completely unrelated... This brings me to a few questions:
Funny enough, when we last saw that type of crash on our platform, we had to bisect on our production servers to root cause the issue.... Not a glorious time at all :( Sorry I couldn't help you much, but I hope this could give you a few leads. |
I've asked the team to retrieve the version that had the issue I mentioned... Hopefully this might be 2.87 🤞 |
We have had a random crash on our prod caused by a non initialized int in A*. here is the PR which fixed it: #2846 Note that, this PR has been merged into v2.82. As @nonifier has suggested, if you're short of ideas about what's happening in the code, you may have to bisect to find which the very PR caused the crash. Let's us know |
Sorry for the delay I was away the last few days.
@xlqian since this was merged in v2.82 we have this fix so it is something else :(. I guess we are going to need to resort to bisect.
Our main idea is trying to try to launch gatling or something like this and updating disruptions at the same time to see if we can make this happen semi-reliably to make a git bisect worthwhile. |
I'd suggest plug a local memory-sanitized version of navitia on your prod disruption-queue, and request it a bit on the side... |
Thanks @pbougue for the suggestion. I don't know much about memory sanitizers but I'll try it out and see if I can find something. |
Welcome, that's a grouped effort here :) |
Well, I finally dived into Sanitizers, but I'm not sure of if I found something or not... The potentially good news is that I managed to have 3 separate crashes while running Kraken compiled with AdressSanitizer during cloning of the data. The sad part is I managed to do it only twice in the span of half an hour at the start, but not once in the five or six hours after that trying... This morning I manage to do it a third time (also at the start of the day), not once since either. I don't know if I did something wrong, but there is basically no informations whatsoever though... Here is the gist of it: https://gist.github.com/eturck/2ee7c7107f9eee98c362f7700f5ef5ca I fear that I did something wrong since @pbougue warned me about potentially lots of noise. I also ran ThreadSanitizer since number of requests seemed to play a role in our production crash.
I guess it could be related? But I fear we are way (way) passed my knowledges of multithreaded applications... I'll be on Element in the afternoon to talk about all of this. Thank you for your help. |
Hi @eturck ! Thanks for you detailed investigation ! Some thoughts about this : Some differences between the stackoverflow answer and navitia code :
Reading the Navitia code, I don't really understand why it works :man (or I read it wrongly). For me :
So for me, the main thread may run the reading operation to completion before the write thread has finished writing, which I think could result in the data races found by ThreadSanitizer. Maybe you can try to modify navitia to bring it more in-line with the stackoverflow answer :
Or even completely reproduce the stackoverflow answer with 3 threads. |
Thanks for the feedback, I really appreciate the time you all took to take a look at this. I can't say I really understand the code either. I had the same interrogations on the order of execution. I saw that @TeXitoi was the one responsible for this https://github.com/CanalTP/navitia/pull/632/files#diff-930d109fe1ca9c60e82996bda4dd2f8aR538 :), If he ever stumble upon this message and want to shed some light on those lines... I was really hopeful with this data race but I unfortunately wasn't able to reproduce it a second time... I had a few others more or less related to data cloning, sometimes there was tens of data race on some totally unrelated parts too. Lots of noise but not once again this one. I spend three days trying a few things to have a way to at least semi reliably reproduce this crash or data race. Even if it happened once every hour it would be frequent enough to maybe draw a conclusion on a fix, but right now there is just nothing and I really have no clear path to investigate more. My current plan is:
|
To check if the data race comes from the this data clone trick, maybe it is possible to assign a low priority on the write thread, so that this write thread will be scheduled after the read in the main ? |
Hey, @pascal-vs suggested an easy way to test whether the data clone trick is responsible for the crashes : add a sleep() in the write thread before the serializing to the archive. This should ensure that the read occurs before the write, and thus trigger some segfault. |
Another update, probably the last one. But mostly positive ;). We started by re-enabling disruption on 1.62, there was still some crashes even with the hardware changes. We tried looking for something in boost issue tracker or mailing list with @pbench but nothing was resembling this crash, I really don't know what could cause it. If you want to close this issue feel free to do it. There will probably be no changes on our side for quite a while on this. Thank you everyone for your ideas and your help! |
Hello everyone.
We have been using disruptions from chaos in production for a while.
Our current install of navitia consists of, on the same server :
Until this week disruptions were only plugged on kraken-iv. We sometimes had a crash but it was not frequent (once per month). We talked about it with @kinnou02 once or twice but never found a solution.
Yesterday, we plugged disruptions on kraken-prod and it quickly went wrong. We had one crash yesterday and 5 more between 7:17am and 8;50am today (it's peak hour for us if it has any impacts). However, the same exact disruption has been applied to kraken-iv at the exact same time, on the exact same server, and no crash on kraken-iv, probably excluding a problem on the disruption itself.
If you want the full stacktrace and syslog output : https://gist.github.com/eturck/afd981ed550432f53aaf76a8b9eb0df5
What I see :
tcmalloc: large alloc 2314885530818461696
. I don't know how we are trying to allocate that much RAM really :). The only thing I saw about this value is that in hex it's 0x2020202020204000. It's always this exact same value when kraken crashes. I don't know if we are really trying to allocate this much RAM or it's just a display glitch.What I know :
I am not super hopeful that you can help me on this issue. But let me know if you have any ideas or something similar on kraken or another project already happened to you and you have some inputs to share.
Thank you in advance.
The text was updated successfully, but these errors were encountered: