-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
matchedNodeElmReader alloc crash #338
Comments
I seem to be crashing when trying to stream in a 8,437,865,894 element, 1.9B node mesh. It is a mix of wedges and tets. I was trying to stream it in to 160 Broadwell nodes each running 4 processes and I think that since I told PBS this: |
If I am reading the output right, MPT: #13 0x000000000047e733 in apf::setCoords (m=m@entry=0x3312d300, is trying to handle 30M verts which I would not expect to be a problem in terms of memory usage so I think I am hitting an index size issue. |
What branch/commit were these tests using? |
Discussion notes:
|
MGEN_write3D (c1d05c1) |
Is there a quick way to convert from long to int? As you feared, there is propagation [ 38%] Building CXX object mds/CMakeFiles/mds.dir/apfBox.cc.o |
Note I also had to create a PCU_Max_Long which I hopefully replicated correctly from PCU_Max_Int |
I got around the casting issue above but am hitting new issues |
Where do you define the GID typedef? The error makes it look like its defined in an anonymous namespace. |
It is/was in apf here: Line 32 in fe15d56
I'm hacking at this now. |
Reproducer: It is not asmall as I would like but here is a path to the case on the viz nodes for your you to grab (note it it grabs inputs from one directory above in the run line so safest to just grab the dir above were I am running it).
Case DIR /projects/tools/Models/BoeingBump/LES_DNS_Meshing/FPS-MTW-12-30/MGEN4mner_noIDX/mner
Note though the covert takes 2 minutes (with -g)
edit: formatting |
I decided to try serial. Long-mod code makes it through construct in serial but then segfaults on the line 138 in mds_apf.c on return gmi_find(m-> user_model, .... what totalview showed me for the mesh (m) was pretty messed up so its note REALLY getting through construct cleanly. I have a new run stopping just after constuct to check that but it is probably time to build with memory sanitizer. OK it got to my break just before delete [] m.elements on line 826 of matchedNodeElemReader. m.elements are junk as the first entry is 140007343915089 according to totalview I will move my search into where this is assigned to see what is going wrong. |
looks like line 705 is a problem. gmi_fscanf(f,1, "%u%,elmVtx+j); does not put a long into elmVtx according to TotalView which does seem to know that elmVtx is apf::Gid[6]. Here is a screenshot. I suppose there is a different format to read long ints? I will try to dig and find it but if someone can share that I will see if that is the last problem. |
Found it. Changed %u to %ld and it seems to be working on the small case with 8 processes Still crashing on NAS for the big case in the same place. What does do when q=t/p and t is long, p is an int and q is an int? Does it automatically cast? This is what we are doing in set Coords where it is crashing |
There is an implicit conversion happening so t is converted to an int and the division happens. This will obviously cause issues if t>max int (2^32/2-1 for 32 bit integer which is about 2 billion). If you are using gcc or clang you can use See here for an example on compiler explorer. EDIT: I was initially wrong. both t and p is converted to long then the implicit conversion happens during assignment. See cppreference section on arithmetic operator conversion. This can be verified with the following code: |
Thanks for the advice. Before I went to sleep last night (and before I saw this), i put the following conditional print statements into setCoords /* Force each peer to have exactly mySize verts. int start = PCU_Exscan_Int(nverts); PCU_Comm_Begin(); while (nverts > 0) { They produced no output. Maybe my code was wrong or perhaps lion_eprint is buffering??? but, if it is not, then the theory that it is n which is blowing up is false. Is there a way to force lion_eprint to empty its buffer? Note I suppose I can also dig back to past messages from Cameron to try to an addr2line command to confirm what line number we are actually crashing on in setCoord. We also have cores if someone wants to tell me how to mine them for information? I guess I can also try to use the hints from Jacob but I will need to read up there too. NAS does not provide very modern gcc (6.2 is what they provide and what I am using). |
addr2line is not helping (or I am not using it correctly) MPT: #12 0x0000000000523ab2 in PCU_Comm_Pack () kjansen@pfe27:/nobackup/kjansen/SeparatedBump/DNS/ReL_2M/MGEN/mner> addr2line -e /home5/kjansen/SCOREC-core/buildMGEN_write3D/test/matchedNodeElmReader 0x0000000000508bd8 That said, setCoords calls PCU_Comm_Pack only twice: the place we have been staring at
and Since that last one is 3 doubles long, I don't see any way it could not be the first one which means n MUST be blowing up which I guess means lion_eprint is not emptying its buffer to help me debug this. I am not sure what changed but I am also no longer getting arguments (e.g.n nverts=xxx) in the stack trace viz MPT: #9 0x0000000000527ddd in noto_realloc () But I am still gettign this as the first ERROR output There are 29 of them. Not sure if that tells us anything but not all 80 at least are reporting this error before crashing. |
Core fie interrogation with gdb is not helping either. kjansen@pfe27:/nobackup/kjansen/SeparatedBump/DNS/ReL_2M/MGEN/mner> gdb /home5/kjansen/SCOREC-core/buildMGEN_write3D/test/matchedNodeElmReader core.19660 For help, type "help". warning: core file may not match specified executable file. |
lion_eprintAre you capturing stdout and stderr? The verbosity level set here core/test/matchedNodeElmReader.cc Line 790 in 6dd96db
is correct for the added call(s) to Line 32 in 4d659af
addr2line and gdbThe usage looks OK to me. As a sanity check, what is the output of the following command?
|
kjansen@pfe27:/nobackup/kjansen/SeparatedBump/DNS/ReL_2M/MGEN/mner> file /home5/kjansen/SCOREC-core/buildMGEN_write3D/test/matchedNodeElmReader |
Yes, I am looking in both stdout and stderr. Does lion_eprint flush its buffer? If not, given that these are likely the lines written right before a crash then they may not get flushed before mpi kills the job? Is there a function to flush the buffer? |
Nice.
|
Distributed read is done in 15 seconds on 160 cores and has already overtaken at single file read that started an hour earlier. |
sadly MPT: #8 0x000000000048f485 in PCU_Comm_Pack (to_rank=to_rank@entry=159, |
OPTIMIZE=OFF version *** Error in `/home5/kjansen/SCOREC-core/buildMGEN_write3D/test/matchedNodeElmReader': free(): invalid pointer: 0x0000000276f5ea00 *** |
Actually, I confused myself. This is the code that reads a single file and now that I point it to the right executable it is line 100 of OPTIMIZE=ON exec 59 template So we are still having an issue here with tags. The OPTIMIZE=OFF is churning slowly and will hopefully tell us more and/or be fast enough to allow me to launch TotalView on a devel <=2hour qsub |
OPTIMIZE OFF says this so size is blowing up MPT: #6 0x00002aaaac3e0247 in raise () from /lib64/libc.so.6 And is being caught by the code @cwsmith added here on line 144 (sure enough I do get that error message) 137 int PCU_Comm_Pack(int to_rank, const void* data, size_t size) I have a totalview session on 160 processes to try to find the issue here but I am pretty certain that I am feeding it a classifcation array that is int* but the template seems stuck on the data being long*. |
I took a skim through the templated code you were asking about:
Line 14 in 111d9d6
Not necessarily a bug, but the fact that you commented out the failure indicates you hit this in the past! Essentially this is your base template, so anything that doesn't match the other specializations (only int and double are provided) will use the definition there. So, Lines 17 to 22 in 111d9d6
Possibly a bug because the sum of a bunch of ints might not fit into an int. You set it to a Line 79 in 111d9d6
The following isn't a bug at the moment since Line 35 in 111d9d6
Line 46 in 111d9d6
|
@KennethEJansen Point me at the code where you are giving a |
Thanks Jacob. I am not sure if @cwsmith has opened up access to this but the code is here Or here is the calling code on line 819 where 736 readClassification(ff, mesh.localNumVerts, &(mesh.classification)); and that routine allocated the memory with 597 void readClassification(FILE* f, int localNumVtx, int** classification) { and |
Is either the MGEN_Parted or MGEN_write3D branch on the SCOREC github up to date? I need to take a look at the template definition. |
It has the bugs you noted above. I will push my interpretation of your fixes to _Parted now I pushed without checking for compiling which it does not so I will try to fix that. |
I don't see how it's possible that core/test/matchedNodeElmReader.cc Lines 819 to 820 in 20cdfc0
Is using Lines 70 to 73 in 20cdfc0
What's making you think that when you call this that T is |
I was just going by the fact that when it crashed size > INT_MAX but maybe that is not the reason. In TotalView at the scene of the crime and start is negative so my assumption above was wrong. Digging through the rubble to find the bomb. |
We have a bigger problem. total=max+1 where max is supposed to be the max across ranks of globalToVert is returning I have verified that the last element does have that node in it so I guess I need to figure out what is going wrong in the construction of globalToVert since that is where it supposed to come from? |
The first computation of Max is correct but it seems that in the process getMax or PCU_Max_Long destroy globalToVert as it has junk soon after. Running again to see how that happens. Is this routine intended to operate on key-value stores? |
Totalview licenses were taken so I wrote some code to try to check this theory Gid max = getMax(globalToVert); APF_ITERATE(GlobalToVert, globalToVert, it) { The idea is print the first and near last "key" before and after the computation of the Max that Totalview showed destroying the globalToVert data structure. Curiously, I got this type of output from ranks 0 to 138 for the stuff before the max computation (but nothing form ranks higher and since odd, one rank only did one of the two writes) This is not exactly a proof that the max is munging globalToVert but is odd unless you see something I have done badly (yes I do see that i have a %ld for self but ... |
Ok. I might be missing something, but you don't do For memory bugs like this I've had luck with any of the following approaches in the past:
|
You are correct. The code did have the ifirst=0. Magically, I ran it outside of totalview as I went to bed last night and it ran through. Note this had all your bug fixes except these which I rolled back because I could not get them to compile. I am offline most of today but leaving the smoking gun here for further review and/or help with how to get Long Tags properly . Also pasting a grep that shows their prolific use. diff --git a/apf/apfConvertTags.h b/apf/apfConvertTags.h
|
Apparently we are NOT out of the woods yet. Chef did not like the mesh it seems kjansen@pfe26:/nobackup/kjansen/SeparatedBump/DNS/ReL_2M/MGEN190_Parted/Chef/4320-160-Chef> tail -f output.11201203 MPT: --------stack traceback------- MPT: --------stack traceback------- Note that this is an OLD version of chef but I am not thinking that any of our changes really affect Chef. I am pretty we have worked with meshes whose GLOBAL entity counts would overflow ints or even unsigned ints but, if successfully spit across 160 ranks, the on part entity counts are far below. I can try to break it into smaller steps. This was breaking it from 160 to 4320 (splitFactor 27) if we think that will make a difference. The strange slit factor is an attempt to find a final number close to 64K (1602715=64800 = 98.8% of 64K so I would only waste 1.2% of my resources on Theta if I move the job there and yet it fits nicely into several of NAS's core counts (20, 24, and 40 but not the 28 core Broadmore). |
Cut back to splitFactor of 9 and got this errorPUMI version 2.2.0 Git hash ba64b1d on CascadeLake . I have never seen anything like this before so perhaps it is a hardware/software missmatch. I am trying one more time with splitFactor 2 before retreating back to more familiar hardware. That failed and so did the familiar hardware so I think we have to conclude that matchedNodeElemReader is making an error that verify did not find. I am going back to try to fix the LongTag issue. I uncommented those lines of code and a compile generates this error: [100%] Building CXX object test/CMakeFiles/matchedNodeElmReader.dir/matchedNodeElmReader.cc.o With no formal training in C++ and not even grasping what a templated function is, I am out of my element here. I suppose it is obvious that I copy and pasted the int form of the functions and replaced int with long which is obviously not sufficient. |
I just noticed that doubles used int for entries so maybe that is my problem. This compiled so I will give it a go.
kjansen@pfe27:~/SCOREC-core/core> |
That build leads to different crashing in Chef (after rerunning matchedNodelElmReader of course) [0] Zoltan ERROR in DD_Update_Local (line 278 of ../../src/Utilities/DDirectory/DD_Update.c): Invalid input parameter a gazillion times. Is Zoltan trying to go back to global ID's? I will try some alternate partitioning approaches. |
Switching from graph to zrib dodged that issue and successfully planned the partition. I suppose I should not be surprised that the actual migration is taking
and nothing for about an hour I only have a 26 minutes left according to top and I am guessing it won't complete and even if it does with the two reading processes consuming 38 and 35GB and them needing to push 1/3 of that onto two other that puts me at 38*(1+2/3)+35*(1 + 2/3)=121GB which is periously close to the 128GB on this node. Other nodes might be worse so I am going to have to find beefier nodes for this first step since I can't do the 1->2 and still land on the part count I want (only odd factors to get to 64800 parts starting from 160 e.g., 160X3X9X15=164800 which is the closest integer to 64*1024 if I want this to be a balanced optimal between running on ALCF 64 cores and NAS 40 cores per node).
So it is really hard to guess how long this is going to take as this is the (lack of) progress when I am about out of time
|
local zrib came close to working before running out of memory AFTER the mesh had been migrated. I am trying it again on a larger number of nodes and am hopeful it will work. This raises an efficiency question. @matthb2 and @cwsmith , do either of you know how to control the process layout to make somethinng like local zrib completely "on-node"? It looks to me like the first stage "works:". That is to say, if I want to part from 160 to 480 and I run this on 160 nodes, chef is smart enough to put one part on each node so that, on each node, I have one proces reading the super heavy part and 2 processes "waiting" until the reading and partingion from 160 to 480 is planned. My question is, is it possible to force that local partitioning to be completely local to a node-- that is that the two nodes waiting to receive the read mesh split into 3 parts are the local part numbers from the "read" part number. I guess this must means that i need my process-id to follow chef's notion of new part numbering, right? Maybe this is already the case but if it is not, I would think it could be a big win (but I guess only if the code exploited this locality and/or mpi was smart enough to see that this message did not have to leave the node?). |
The following worked:
Job Resource Usage Summary for 11205702.pbspl1.nas.nasa.gov
Current directory is /nobackup/kjansen/SeparatedBump/DNS/ReL_2M/MGEN190_Parted/Chef/4320-480-Chef
Job Resource Usage Summary for 11207073.pbspl1.nas.nasa.gov
Job Resource Usage Summary for 11207325.pbspl1.nas.nasa.gov
Note if you add up the chef times you see that Chef completed in a couple of minutes but I followed the Chef run with a convert from posix to syncio format and this apparently produced a hang viz My forensics suggest that the Chef part of the job started at 1:57 and finished at 2:02 (5 minutes) and the converter, which pipes its log to a separate script opened that log script but did nothing for the remaining 115 minutes. Stderr reports nothing beyond the PBS job killed line. I really want to thank you both for your help with this very challenging case. |
UPDATE3: Realizing I should tell you all what branch these fixes (and remaining problem) are on Not exactly an apt name but that is what it looked like from the first signs of a problem. The above celebration was premature. A twice bigger grid has been failing miserably and we have found several more overflows. @jacobmerson, we did finally get the -undefined flag added with @cwsmith help and it was great at finding a few overflowing int products that I did not realize needed casting to land on on a long RHS of the equality without overflow. However, we are apparently still missing something as verify is still reporting And a stacktrace of 3245 MPT: (gdb) #0 0x00002aaaaecaa7da in waitpid () from /lib64/libpthread.so.0 The following errors were also reported early in the program (presumably from undefined) /home5/kjansen/SCOREC-core/core/pcu/pcu_coll.c:34:3: runtime error: null pointer passed as argument 1, which is declared to never be null but it does not crash or give me the stack that leads to this code. grep also did not give me clues how the code gets here. I can't get totalview to launch on this executable either. I guess I will try an exec without -undefined. 714 FILE* fc = fopen(filename , "r"); Though going forward I also see that line 83 of apfBox.cc also hits that function. My TV skill (actually my C++) skills are not good enough to discern which one is getting to that point with NULLS to merge but I suppose both are a one-per-rank event and thus both candidates. Probably neither is causing the INT_MAX issue though. UPDATE2. TV confirms that it is apf::BoxBoulder that is the stack source that gets there with NULL on the first 2 args. Unless someone thinks this could screw up stuff I am going to assume this message from UBSAN is not affecting my problem. The above is all with gnu 6.2 compiler. I am also contemplating a 9.3 build as NAS now has this available. I can dig back over your messages but if you have recommendations regarding flags to try that would be great (since, thanks to your ticket I do now know how to set "extra" flags and not have them clobbered. |
My conversion from 320 parts to 160 parts completed and this was the case I fed to Totalview that I described above. The conversion code was written between 2 hours past when I usually sleep so it could be wrong BUT it does crash during the creation of the remotes rather than during verify so theres that. While waiting for another 160 node allocation I have been reviewing the mesh and I don't see any errors in it. On one hand, this is the only mesh I can get into TotalView (256 core license so 320 part original mesh is a no go). |
I restored the writing of debug information, compiled with gcc 9.3 (newest available), searched the docs for the flags @jacobmerson listed above (found none of them so stuck with -fsanitize=undefined ) and reran the 320 part case (no surgery to come down to 1600. It looks like pretty much the same crash BUT now we hav debug.txt information that was written at the time of the remote construction. Please critique the following autopsy:
MPT: --------stack traceback------- MPT: -----stack traceback ends-----
Note the slab nature of mesh and its ordering is going to make process m have common boundary with m-1 and m+1. Looking at these numbers they seem reasonable and I don't see any reason why verify should come to think that there are messages greater than INT_MAX. The above shows 110,594,040 TOTAL messages which is far less than INT_MAX. The mesh has 4B vertices distributed across 320 processes in tis first stream-partition so that is about 13M vertices per part. It is simply impossible to generate 2B packets of information so I guess something is again getting corrupted. That said, I feel like we are probably logging the PROCESS of finding the remotes but I am not sure we are actually logging the remotes of a given part which is what verify is checking, right? That is, once the process is complete the part boundary information knows nothing about these long global ID's, rather they are only a crutch to allow parts to recognize when they have remotes on other ranks so, while I think we now have that process working correctly with long's it is still not clear to me why verify thinks there is a problem (but this is likely because I don't understand what pcu is doing and lack the c++ skills to figure it out. Sleep may help but I doubt it. I am also running a case with verify turned off just for grins. |
Humbly reporting that the above is full of miss-statements. While there are ONLY 13.5 M verts on a part, I failed to remember that mds tracks edges, and face remotes as well (e = 4.5v, f=6v both approximate for this mesh that is a mix of wedges and tets ) AND the code also sends messages to self . Finally it packs this all into ONE message for all entities and overflows INT_MAX. I broke the messages into one d entity per messages and it looks like this keeps the message size < INT_MAX. I think it would not be TOO hard to put in some logic to do or not do this split based on an upper bound on the size of the message by looking at #e, #f, #v on a part. Of course some headroom could also be made by not sending messages to yourself as I think this is ALWAYS going to be the largest message and would seem to be altogether un-necessary. That is pervasive in the code though so, unless I am missing a need for such a thing, this was deep rooted mistake that will now be not so easy to remove. Seems like it would be influencing performance pretty heavily as well? |
I think some of the self sends are used to deal with matched meshes. |
Sorry if this has been covered before but how big of a mesh should I be able to stream in with this build settingkjansen@pfe26:~/SCOREC-core/buildMGEN_write3D> more doConfigure14_18
#!/bin/bash -ex
For Chef
cmake
-DCMAKE_C_COMPILER=mpicc
-DCMAKE_CXX_COMPILER=mpicxx
-DSCOREC_CXX_WARNINGS=OFF
-DSCOREC_CXX_OPTIMIZE=ON
-DSCOREC_CXX_SYMBOLS=ON
-DENABLE_ZOLTAN=ON
-DENABLE_SIMMETRIX=ON
-DPCU_COMPRESS=ON
-DSIM_MPI="mpt"
-DSIM_PARASOLID=ON
-DMDS_SET_MAX=1024
-DMDS_ID_TYPE=long
-DIS_TESTING=ON
-DMESHES=/projects/tools/SCOREC-core/meshes
-DCMAKE_INSTALL_PREFIX=$PWD/install
../core
#-DMDS_ID_TYPE=int or long
I have progressively thrown more nodes at it (doubling compute nodes 4 times now even though the number of mesh nodes only doubled from a prior successful run) but keep getting this stack trace
MPT: #7 0x00002aaab019c61a in abort () from /lib64/libc.so.6
MPT: #8 0x000000000049109b in reel_fail (
MPT: format=format@entry=0x49be6e "realloc(%p, %lu) failed")
MPT: at /home5/kjansen/SCOREC-core/core/pcu/reel/reel.c:24
MPT: #9 0x0000000000490fc8 in noto_realloc (p=0x5a756a90,
MPT: size=18446744056901788296)
MPT: at /home5/kjansen/SCOREC-core/core/pcu/noto/noto_malloc.c:60
MPT: #10 0x0000000000490626 in pcu_push_buffer (b=0x3afd84b8,
MPT: size=size@entry=18446744056901788288)
MPT: at /home5/kjansen/SCOREC-core/core/pcu/pcu_buffer.c:37
MPT: #11 0x0000000000490917 in pcu_msg_pack (m=m@entry=0x6b79c0 <global_pmsg>,
MPT: id=id@entry=639, size=size@entry=18446744056901788288)
MPT: at /home5/kjansen/SCOREC-core/core/pcu/pcu_msg.c:133
MPT: #12 0x000000000048eb5f in PCU_Comm_Pack (to_rank=to_rank@entry=639,
MPT: data=data@entry=0x2357cb30, size=18446744056901788288)
MPT: at /home5/kjansen/SCOREC-core/core/pcu/pcu.c:141
MPT: #13 0x000000000047e733 in apf::setCoords (m=m@entry=0x3312d300,
MPT: coords=0x2357cb30, nverts=3015292, globalToVert=...)
MPT: at /home5/kjansen/SCOREC-core/core/apf/apfConstruct.cc:202
MPT: #14 0x000000000043722c in main (argc=, argv=)
MPT: at /home5/kjansen/SCOREC-core/core/test/matchedNodeElmReader.cc:832
The text was updated successfully, but these errors were encountered: