-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wasm: out of memory with lots of HeapIdle #3237
Comments
I'm not surprised the C heap and the tinygo garbage collector aren't playing nice together. Getting some more debug info is definitely way to go. If the crash is easily reproducible, we should think about what other info we might want to get out of tinygo allocator that would make this easier to hunt down. |
Thanks @dgryski - just to clarify, malloc/free for C are still serviced by TinyGo as https://github.com/tinygo-org/tinygo/blob/release/src/runtime/arch_tinygowasm.go#L97 Would we expect this to also not play nice together? I suppose it could have been Go code too - but perhaps a pool of references in a map is a pattern that we wouldn't expect in Go and could be triggering some bad behavior. Happy to run with custom builds to see what we can find, I'll poke around to see if I can find something to instrument myself too. If anyone is interested, it's relatively simple to run our test with tinygo, go, docker installed (from PATH, so building a custom TinyGo would be used as-is).
It'll just print those memstat lines. It's obvious when the GC is pausing as most requests take milliseconds and the paused requests take about a second, with live objects not really trending up. The pausing becomes more frequent until finally it grows the heap - then the pauses are less frequent until they become frequent again. Feels like heap fragmentation but not sure without more investigation. The build page also has an |
It's hard to say what's going on without more information. ..that said, the GC is not at all optimized and a 1 second pause time doesn't surprise me at all with such a large heap. |
I suspect Envoy may be preventing For improving performance, I tried an approach replacing the current map-based So I went ahead and made the big change of having the Go GC use a fixed-size heap allocated by mimalloc and letting mimalloc own the wasm heap (I happened to use mimalloc for having a self-contained malloc for easier tweaking, but will also try enabling dlmalloc in TinyGo's wasi-libc and see if it performs similar). But even with a 128MB fixed-size heap for Go, our benchmark runs to completion, and much faster, less pauses and they are about ~100ms, not ~1s for a comparably sized heap with normal TinyGo, throughout. I suspect some work could happen to allow multiple arenas to make it a not fixed-size heap too which I believe is how the Go GC uses system allocated buffers. The changes are here Notably, I doubt there is interest in actually adding this GC to TinyGo so I am going to continue looking into how it could be allowed to be plugged in by the app. I tried I'm not too sure the linkname approach is even supposed to work, but while I could get the tinygo compilation to succeed with this code, the result was an invalid wasm file. I'll look more into whether this can be made to be pluggable more cleanly, but if anyone happens to have any suggestions would be great to hear. One big issue I saw with https://github.com/tinygo-org/tinygo/blob/release/src/runtime/gc_none.go#L38 I wonder if this and maybe all the other functions could also be kept as stubs to allow externally overriding - already, |
I think I had some confusion with This would be a huge unblocker for us (and maybe TinyGo e.g., #3162 (review) :) ). Hopefully the change makes sense. |
Did we get any further investigating this? Can we place the blame on the GC or elsewhere? |
Swapping in bdwgc removed the problem - not definitive but I guess it would put the blame on GC most likely. I haven't tried the new precise GC yet. |
@anuraaga - I've got a TinyGo wasi program that is running out of memory on Spin Cloud and was wondering if you have an example of how you swapped out the garbage collector using TinyGo compiling to wasm? |
Hi @gmlewis - I had nottinygc library to integrate them, but I found there are still too many cases of memory leaks to maintain it as a stable library. I described some issues I had in detail here tetratelabs/proxy-wasm-go-sdk#450 (comment) You could still give the package a try just to see how it works for you. Many issues are solvable with I think a large amount of drive so if anyone is interested, maybe they can resurrect the approach with the hints in that issue / codebase. |
@dgryski - I'm running on a Mac M2 Max and followed the BUILDING.md steps (with your 1-file change) but stopped at the Now, when I build my Spin app with: error: could not find wasi-libc, perhaps you need to run `make wasi-libc`? I'll try again on a Linux Mint 21.3 Cinnamon box to see if I can get it to work there. Thank you! |
If you're using the leaking gc, then neither my change nor using |
Oh, sorry, that's the default that |
@gmlewis Yes, that's correct. That will switch from the leaking collector to the default conservative one. |
@dgryski - I built your version both on Linux and on Mac... it turns out that all I really needed to do was indeed follow the directions and type "make wasi-libc" before building I tried both Previously, the wasm endpoint would respond in under one second. |
@gmlewis If you have a solid reproducer (hang vs. 1 second response), if you could bisect the git commits and figure out which commit caused the regression that would be a huge help. Is your same program available somewhere or it is proprietary? |
It's not open source, but I can try to bisect the git commits or alternatively try to make a simpler test case. |
Any luck getting a smaller reproducer you can share? |
No, sorry. I switched this wasm app to MoonBit and the problem went away. |
I am debugging an OOM panic with a wasm binary and want to check if this is a fact of life. The binary is handling HTTP requests in Envoy
https://github.com/corazawaf/coraza-proxy-wasm
After a while, it crashes with this, from a C++ function that's calling malloc which I have linked into the binary (recall the fix to allow malloc to not be corrupted for polyglot)
I print memstats after each request, and they look like the below. There is no upward trend in the number of live objects so I don't think we have a memory leak.
HeapIdle
remains high throughout the program - is this because the memory gets fragmented and something we can't improve? Or is there by any chance a possible bug? I'm not too sure why it panics even though there's still plenty of headway from this 512MB heap up to the theoretical 4GB max, this may be an Envoy quirk. But either way I'd expect 512MB to be enough especially given the amount of idle heap. So wondering if there is a way to debug what could cause this and perhaps reduce heap fragmentation somehow.Thanks!
The text was updated successfully, but these errors were encountered: