Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash in ILC in the CI #110836

Closed
MichalStrehovsky opened this issue Dec 19, 2024 · 12 comments · Fixed by #111011
Closed

Crash in ILC in the CI #110836

MichalStrehovsky opened this issue Dec 19, 2024 · 12 comments · Fixed by #111011
Labels
area-NativeAOT-coreclr in-pr There is an active PR which will close this issue when it is merged

Comments

@MichalStrehovsky
Copy link
Member

We've been seeing ILC crashes in the CI for a while. This may or may not be related to #109800.

Symptom of this one is just "exited with code 57005". E.g. here: https://dev.azure.com/dnceng-public/public/_build/results?buildId=898321&view=logs&j=6a7e26fa-36e7-5a45-28af-dc6c8e6724e6&t=089ef86b-599a-543a-2c8c-82b31601e3ec

I left the compilation of Microsoft.Extensions.FileProviders.Composite.Tests (-rc Checked -lc Release) running in a loop overnight and out of 2000 iterations, I got 5 crashes with dumps. So at least I can repro it. All the dumps are from the scanning phase so we could possibly speed it up further by exiting after scanner is done (not going to run into the bug after).

I tried to make sense of the dump, but I'm at loss, I'm not good with GC and this is some sort of corruption.

Crash is here because MethodTable was null:

>	[Inline Frame] ilc.exe!MethodTable::HasComponentSize() Line 228	C++
 	[Inline Frame] ilc.exe!SVR::my_get_size(Object *) Line 11673	C++
 	ilc.exe!SVR::gc_heap::mark_object_simple(unsigned char * * po, int thread) Line 28160	C++
 	[Inline Frame] ilc.exe!SVR::gc_heap::mark_through_cards_helper(unsigned char * *) Line 41722	C++
 	ilc.exe!SVR::gc_heap::mark_through_cards_for_uoh_objects(void(SVR::gc_heap::*)(unsigned char * *, int) fn, int gen_num, int relocating, SVR::gc_heap * hpt) Line 47244	C++
 	ilc.exe!SVR::gc_heap::mark_phase(int condemned_gen_number) Line 30150	C++
 	ilc.exe!SVR::gc_heap::gc1() Line 22500	C++
 	ilc.exe!SVR::gc_heap::garbage_collect(int n) Line 24679	C++
 	ilc.exe!SVR::gc_heap::gc_thread_function() Line 7293	C++
 	ilc.exe!SVR::gc_heap::gc_thread_stub(void * arg) Line 37778	C++
 	ilc.exe!CreateNonSuspendableThread::__l2::<lambda>(void * argument) Line 592	C++

We read the MethodTable out of a presumed object at 0x00000235a567a7f8, however the bytes are all zeros (except for the first byte that is 1, presumably we just marked it), so MethodTable is null.

Searching through the memory for references to this address finds a couple hits:

0:005> s -q 0x00000000 L?0xffffffffffffffff 235A567A7F8
000000fd`136fe3b8  00000235`a567a7f8 00000000`11ad8690
000000fd`136fe608  00000235`a567a7f8 00000000`11ad8690
000000fd`136fe728  00000235`a567a7f8 00000235`b0d1fab8
00000235`847a48f8  00000235`a567a7f8 00000000`00000000
00000235`a5400068  00000235`a567a7f8 00000000`00000000
00000235`a54000f8  00000235`a567a7f8 00000235`a50e4950

Since g_lowest_address is 0x0000023584c00000, the first 3 hits are not in GC range and I'm going to ignore them.

I don't know what's the reference at 847a48f8, so I'm going to ignore it for now (there's tons of other GC-like references around it, but no MT pointer, likely some queue within the GC - do they get put in the heap range?).
Reference at a5400068 is the _source field on a XNodeNavigator instance. The object is marked. The _parent field is null and _nameTable points to some bogus value in the middle of another object. The object looks to be regurgitated.

Looking for who references it:

0:005> s -q 0x00000000 L?0xffffffffffffffff 235A5400060
00000235`847a48f0  00000235`a5400060 00000235`a567a7f8
00000235`a5400038  00000235`a5400060 00007ff7`6c1dfef8

The first hit is in the 847a4 range again, going to ignore it. The second is from a XPathChildIterator instance. The name field is intact and says "assembly". The nav field points to the XNodeNavigator.

Looking further for the references to the XPathChildIterator:

0:005> s -q 0x00000000 L?0xffffffffffffffff 235A5400028
00000235`847a48c0  00000235`a5400028 00000235`a57e51d8
00000235`84838a30  00000235`a5400028 00000275`98914bd8
00000235`a50e9a10  00000235`a5400028 00000000`00000000
00000275`989149d0  00000235`a5400028 00000000`00000000
00000275`98914a00  00000235`a5400028 00000235`a5400028
00000275`98914a08  00000235`a5400028 00000000`00000000

235A50E9A08 is a XPathNodeIterator.Enumerator, it is however not marked. The 275* addresses are out of heap range.

This is basically all I have. I don't know how the GC was made to look at this dead object.

The XPath references make it sound like it could be related to #108743 which was another mystery bug.

I can share the dumps I have or would appreciate any advice on how to root cause this further. I heard of "stress log" before, not sure if that could be helpful here (and if/how that works on native AOT).

Cc @VSadov

@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Dec 19, 2024
Copy link
Contributor

Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas
See info in area-owners.md if you want to be subscribed.

@jkotas
Copy link
Member

jkotas commented Dec 19, 2024

#45557 (comment) are my go-to stresslog settings to diagnose crashes like this one. Could you please try to reproduce it with this setting? I can take a look at a crash dump with stresslog.

The typical cause of crash like this is that a live object reference was not reported during recent GC, the GC collected the dead objects, and these dead objects became live again a subsequent GC. Stresslog should have enough breadcrumbs to figure out what happened.

@MichalStrehovsky
Copy link
Member Author

#45557 (comment) are my go-to stresslog settings to diagnose crashes like this one. Could you please try to reproduce it with this setting? I can take a look at a crash dump with stresslog.

Thank you! Got a dump with those stresslog settings (had to split it in 2 files to keep it under Github's size limits):

ilc.exe.26652.zip
symbols.zip

@jkotas
Copy link
Member

jkotas commented Dec 20, 2024

I am sorry - the stresslog settings for naot have to be slightly different - the LogLevel config is named StressLogLevel in NAOT for some reason. Could you please try again?

set DOTNET_StressLog=1
set DOTNET_StressLogLevel=6
set DOTNET_StressLogSize=2000000
set DOTNET_TotalStressLogSize=40000000

@MichalStrehovsky
Copy link
Member Author

I am sorry - the stresslog settings for naot have to be slightly different

Got one with those settings. The size is the same though, I hope the settings did kick in this time. If not, I can repro this much faster since I'm short-cirtcuiting this to only run scanner and exit. (Symbols are the same as above.)

ilc.exe.85208.zip

@jkotas
Copy link
Member

jkotas commented Dec 21, 2024

This is basically all I have. I don't know how the GC was made to look at this dead object.

This is Gen0 GC and this object is not in Gen0. For Gen0 marking, all pointers from other generations into Gen0 are treated as roots.

This marking is done in mark_through_cards_helper. mark_through_cards_helper goes through all chunks of older generations that may contain pointers into the younger generation and marks the objects in the younger generation. It does not mark the object in the older generation that contain the pointer.

@jkotas
Copy link
Member

jkotas commented Dec 21, 2024

The size is the same though, I hope the settings did kick in this time.

It did kick in. I can tell that the dump has the good stuff now. Unfortunately, SOS has troubles dumping it: dotnet/diagnostics#5125

@jkotas
Copy link
Member

jkotas commented Dec 21, 2024

Here is the stresslog dump (with a hacky workaround for the SOS issue): StressLog.txt

@jkotas
Copy link
Member

jkotas commented Dec 24, 2024

This is the part that looks wrong:

15a10   0.051825800 : `GC`JIT`                 IGCHeap::Promote: Promote GC Root *000000BC0637E4A0 = 0000028F86680038 MT = 00007FF615749688(MethodTable: 00007FF615749688) // ilc!S_P_Xml_System_Xml_XPath_XPathNodeIterator_Enumerator::`vftable'

15a10   0.051825700 : `GC`JIT`                 IGCHeap::Promote: Promote GC Root *000000BC0637E498 = 0000028F86676E18 MT = 00007FF61570BCB8(MethodTable: 00007FF61570BCB8) // ilc!ILCompiler_Compiler_ILCompiler_UsageBasedMetadataManager_LinkAttributesReader::`vftable'

15a10   0.051824100 : `JIT`                Scanning method 00007FF61494BADF (ilc!ILCompiler_Compiler_ILCompiler_ProcessLinkerXmlBase__ProcessAssemblies+0xbf)

We stopped for GC in ProcessLinkerXmlBase.ProcessAssemblies and reported two GC roots: XPathNodeIterator.Enumerator and UsageBasedMetadataManager.LinkAttributesReader.

Looking at the disassembly of ProcessAssemblies:

ilc!ILCompiler_Compiler_ILCompiler_ProcessLinkerXmlBase__ProcessAssemblies:
00007ff6`1494ba20 55              push    rbp
00007ff6`1494ba21 4157            push    r15
00007ff6`1494ba23 4156            push    r14
00007ff6`1494ba25 4155            push    r13
00007ff6`1494ba27 57              push    rdi
00007ff6`1494ba28 56              push    rsi
00007ff6`1494ba29 53              push    rbx
00007ff6`1494ba2a 4881ecf0020000  sub     rsp,2F0h
00007ff6`1494ba31 488dac2420030000 lea     rbp,[rsp+320h]
00007ff6`1494ba39 0f57e4          xorps   xmm4,xmm4
00007ff6`1494ba3c 0f29a520fdffff  movaps  xmmword ptr [rbp-2E0h],xmm4
00007ff6`1494ba43 48b860fdffffffffffff mov rax,0FFFFFFFFFFFFFD60h
00007ff6`1494ba4d 0f296405d0      movaps  xmmword ptr [rbp+rax-30h],xmm4
00007ff6`1494ba52 0f296405e0      movaps  xmmword ptr [rbp+rax-20h],xmm4
00007ff6`1494ba57 0f296405f0      movaps  xmmword ptr [rbp+rax-10h],xmm4
00007ff6`1494ba5c 4883c030        add     rax,30h
00007ff6`1494ba60 75eb            jne     ilc!ILCompiler_Compiler_ILCompiler_ProcessLinkerXmlBase__ProcessAssemblies+0x2d (00007ff6`1494ba4d)
00007ff6`1494ba62 488bd9          mov     rbx,rcx
00007ff6`1494ba65 488bf2          mov     rsi,rdx
00007ff6`1494ba68 403836          cmp     byte ptr [rsi],sil
00007ff6`1494ba6b 488d0de6dddf00  lea     rcx,[ilc!S_P_Xml_Linq_System_Xml_XPath_XNodeNavigator::`vftable' (00007ff6`15749858)]
00007ff6`1494ba72 e819f86900      call    ilc!RhpNewFast (00007ff6`14feb290)
00007ff6`1494ba77 488bf8          mov     rdi,rax
00007ff6`1494ba7a 488b5608        mov     rdx,qword ptr [rsi+8]
00007ff6`1494ba7e 488d4f08        lea     rcx,[rdi+8]
00007ff6`1494ba82 e819fa6900      call    ilc!RhpAssignRefAVLocation (00007ff6`14feb4a0)
00007ff6`1494ba87 488b5610        mov     rdx,qword ptr [rsi+10h]
00007ff6`1494ba8b 488d4f10        lea     rcx,[rdi+10h]
00007ff6`1494ba8f e80cfa6900      call    ilc!RhpAssignRefAVLocation (00007ff6`14feb4a0)
00007ff6`1494ba94 488b5618        mov     rdx,qword ptr [rsi+18h]
00007ff6`1494ba98 488d4f18        lea     rcx,[rdi+18h]
00007ff6`1494ba9c e8fff96900      call    ilc!RhpAssignRefAVLocation (00007ff6`14feb4a0)
00007ff6`1494baa1 4889bd40fdffff  mov     qword ptr [rbp-2C0h],rdi
00007ff6`1494baa8 488d0d8944d300  lea     rcx,[ilc!_Str_assembly (00007ff6`1567ff38)]
00007ff6`1494baaf 48898d48fdffff  mov     qword ptr [rbp-2B8h],rcx
00007ff6`1494bab6 488d0ddb71cc00  lea     rcx,[ilc!_Str_ (00007ff6`15612c98)]
00007ff6`1494babd 48898d50fdffff  mov     qword ptr [rbp-2B0h],rcx
00007ff6`1494bac4 488d0dbddbdf00  lea     rcx,[ilc!S_P_Xml_System_Xml_XPath_XPathNodeIterator_Enumerator::`vftable' (00007ff6`15749688)]
00007ff6`1494bacb e8c0f76900      call    ilc!RhpNewFast (00007ff6`14feb290) <----------------
....

We should have also seen a GC root for XPath.XNodeNavigator that is stored at [rbp-2C0h].

@jkotas
Copy link
Member

jkotas commented Dec 24, 2024

Here is jitdump.txt for ProcessAssemblies: jitdump.txt . (This jitdump was not generated with the exact same of binaries as the crashdump above, so there are some differences between the code in the crash dump and the code in the jit dump.)

@jkotas
Copy link
Member

jkotas commented Dec 24, 2024

We seem to be failing to report rbp-0x2B8 after a live object reference was stored into it here:

IN000e: 000085 mov      gword ptr [rbp-0x2B8], rdi

rbp-0x2B8 is part of stack allocated ref class temp V80. I do not see it being reported in GC info. There seems to be a problem with GC reporting for stack allocated ref class temps.

MichalStrehovsky added a commit to MichalStrehovsky/runtime that referenced this issue Dec 31, 2024
Fixes dotnet#110836.

When we extended managed CorInfoImpl to support object stack allocation in dotnet#104411, there was one more spot that assumed valuetypes only in `GatherClassGCLayout` that we missed. This resulted in not reporting any GC pointers in base types.
@dotnet-policy-service dotnet-policy-service bot added the in-pr There is an active PR which will close this issue when it is merged label Dec 31, 2024
@MichalStrehovsky
Copy link
Member Author

Thank you @jkotas! I think I was able to follow what you did. I don't think I would be able to figure this out without your help. I was able to come up with a fix though.

@dotnet-policy-service dotnet-policy-service bot removed the untriaged New issue has not been triaged by the area owner label Jan 2, 2025
github-actions bot pushed a commit that referenced this issue Jan 2, 2025
Fixes #110836.

When we extended managed CorInfoImpl to support object stack allocation in #104411, there was one more spot that assumed valuetypes only in `GatherClassGCLayout` that we missed. This resulted in not reporting any GC pointers in base types.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-NativeAOT-coreclr in-pr There is an active PR which will close this issue when it is merged
Projects
Status: No status
Development

Successfully merging a pull request may close this issue.

2 participants