-
Notifications
You must be signed in to change notification settings - Fork 134
Data size based aggregation #4548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Data size based aggregation #4548
Conversation
fyi @pnorbert and @eisenhauer This is my strawman for data-size based aggregation. Over the next couple days I plan to comment in various locations in the changeset to ask questions about direction this approach is taking, with the goal of getting your feedback. |
758fb60
to
1956453
Compare
c9de252
to
4408611
Compare
With my current implementation, I have found a broken case when 3 of 5 ranks get an aggregator cache hit, while 2 of 5 do not. I'll try to describe the problem by first showing the expected outputs when the aggregation type is In both cases I run with 5 mpi processes, 3 subfiles, and 2 timesteps. First look at the data produced when using EveryoneWritesSerial - bpls
EveryoneWritesSerial - bp5dbg
Now look at the data produced when using DataSizeBased - bpls
DataSizeBased - bp5dbg
Note when reading the output above that the size based partitioning produced at Any idea where I should be looking to correct that problem? Seems like it's in the metadata index file, but beyond that, I'm not sure. Also, it appears to me that the other writer data locations for timestep 2 are correct, which is confusing, unless I'm interpreting the data wrong. One last note: when repeating the above tests with only one time step, the test passes in both cases, and the data looks fine To me that seems to indicate the problem is related to repeating some of the initialization in |
Hmm. A quick question. Are you using SelectiveAggregationMetadata? I.E. is the UseSelectiveMetadataAggregation engine parameter in its default "true" value and the first branch of this if in BP5Writer being taken?
Because if not, if you're falling into TwoLevelAggregationMetadata, then all the unfortunate effects that I might have anticipated for >6K nodes are hitting you now... |
Ah yes, great question, I assume you asked because you noticed the re-ordering between timesteps. I just confirmed that in this case, it's taking the |
The aggregator rank in each mpi chain receives the final data position from the last rank in the chain. Is that because there is an assumption there that the aggregator rank is going to be writing to the same file the next time around? |
Just wanted to make sure, because the TwoLevel stuff uses those aggregators and if they're changing, potentially bad things. Also, as a debugging aid, make sure this |
Something else to note while I'm thinking about it. You'll notice that WriteData() is also being called in FlushData(). FlushData is a user-visible BP5 call that allows the application to write partial timestep data in the middle of a timestep (I.E. before EndStep()). This is useful if you are pushing memory limits and don't have room to buffer. So, for example you could do BeginStep(),Put(),Put(),FlushData(),Put(),Put(),FlushData(),Put(),EndStep(). Each of those flushes and the end step results in data (but not metadata) going to disk. The FlushPosSizeInfo vector on rank 0 accumulates all the info from each rank for where it's data landed and how much data it was on each Flush. Note that as a natural consequence of writing piece-wise the data from rank X is not contiguous in the data file, but broken up into multiple segments. In order to find the data at Offset M, timestep T, from rank N, we have to have the offset and size of each chunk, so this all ends up in a single metadataindex record for that timestep. The relevant bit for you is that WriterMapRecords only occur between timesteps. Which means that we don't have the ability to let a rank change what file it is writing to in the middle of a timestep. So, the first time WriteData() is called in a timestep (which might be in a FlushData or in EndStep), you are free to change the WriterSubfileMap. But you should not change it again in that timestep. You're probably not playing with those examples yet, but this is something that will require a little bit of extra logic to make sure it doesn't happen. |
By the way, on the Extra debug output:
For subsequent time steps, it seems to print nothing instead of those characters from the buffer. I wonder if that block didn't get updated when some other bit of the code did, due to it not getting exercised and validated? |
Ah, yes, that's probably the case. Those items are size_t and maybe the type of buf changed. Which may mean those index calcs into it are off too. I can't look tonight, but will check tomorrow. |
OK, try substituting in this code into the matching bits of WriteMetadataFileIndex()
(I won't put these changes into a PR now because that would just make merging your code more difficult when it comes to that.) |
Thanks Greg, that works and jives with both the extra debug printing I was doing, as well as with what So I'm just curious if this summary below 1) contains all the relevant information you would expect is needed to indicate correct writing of the data, and 2) if it actually looks correct to you based on the writer chains at each time step (it looks correct to me). Debug output showing writer chains and metadata
And here is the comparison of what is produced by Side-by-side bpls output
Let me know if you spot something that looks wrong, or if I haven't included complete details. |
The metadatafile index stuff does look like what I'd expect. However, things obviously aren't right even with just a single timestep, so something is clearly going badly. I'd like to try to reproduce this output and would be hopeful that I if I step through with the debugger on the reader side I might see what the issue is. Can you tell me what/how you're running on the writer side so I can give that a go? |
Well, if I run the test with just a single timestep, everything looks fine and works. But when I run the test with two timesteps, somehow the first timestep get's messed up in retrospect or something.
😅 That's for sure!
That would be so great, thank you! I'm just running variations of one of the new tests I added, which you can run with all the defaults as:
The
Many combinations of parameters don't result in failure or messed up output file like the example above does. I suspect it's probably related to how some of the ranks change the subfile they're writing between steps. You can try with 1 timestep instead of 2, and should see that it works. Note: I need to |
OK, I'll give this a go. I've got a couple of calls yet this afternoon, so it won't be right away. Just FYI, in case you get there first, my plan is to look at what is happening in BP5Reader::ReadData(). The BP5Deserializer looks at the Get()s that happen in a timestep and translate those into a ReadList of data that it needs. That list looks like : from WriterRank W data from Timestep T, read Length bytes from Offset S. ReadData()'s responsibility is to translate that into actual data access. One of the first things it does is to translate W and T to a SubfileNum. Looking at what's going wrong there should be informative. |
Ok thanks for that information, I'll start looking in |
output.txt
But we're not finding the data that should be there. You say that this works with writing just one timestep. Is it possible that writing the 2nd timestep is killing some of the data from the first? |
Whatever is going on, subfile 1 has zeroes up through Offset 12288, and that doesn't look right.
|
Huh, that is interesting. There is initialization stuff that gets done on every timestep now (used to only be done once), and I'm just selectively guarding against doing some specific things within there. Maybe more bits need guarded against running 1) on a timestep after the first, or 2) in the case of a aggregator output file cache hit. Like seeking to position 0 in a file, for example. I wrote a tiny standalone program just to read the file (mainly so I could run it with a single process), and then added some debug printing to the reader. I see a couple weird things off the bat: single process reader debug output
It's odd to me that the start offset is 0 regardless of timestep or writer rank. It's also odd that three of the read lengths are off by 8 bytes (something I also see in |
Hmmmm, I wonder: If rank X opens a file already opened and written by rankY, should it be doing that in append mode or else the file will get overwritten? |
That was the problem, the hint from running |
Great! Yeah, you may be running into filesystem oddities with this approach. Are you closing files after the write responsibility shifts? If not I can imagine that if the responsibility shifts back you might need to close and re-open to get a sane file pointer. |
Also that clue is huge 😆 Your help was invaluable in finding this bug I wrote, thank you @eisenhauer! I still have 2 things to handle which you pointed out in previous comments:
|
Currently each rank has a separate transport manager for every file it ends up writing to. I'm keeping each of those files open throughout the run, to avoid unnecessary Does that seem like a bad idea? Maybe I need to seek to some location in that case? Or does |
Just to clarify this one, multiple Put() shouldn't change anything for you. You still get a single contiguous data block. it's calls to PerformDataWrite() that would potentially mess you up. That would result in multiple calls to WriteData() in a single timestep. |
This is where I think things might get funky because you are outside of normal Posix semantics. If you write data in a file and leave it open, then someone else opens that file in append mode and adds to it, what of that append does the first file see? Can you just fseek() in the first file to the new end? Or will the first file insist that it knows where the end of the file is and refuse? After all, that's got to be a really rare circumstance that an open file is extended behind your back, so seems like a corner case that might be dropped, not considered, or optimized away. I'm curious to know! |
Ok, yeah, thanks for clarifying. I'll make sure to only repartition ranks and rebuild mpi chains once per time step. |
Well boo, I thought my zero blocks problem was solved, but it has cropped up again on a real HPC with a lustre filesystem. In that environment the problem can happen even on a single timestep 😱 but doesn't seem fully reproducible. Once I saw two blocks in a single time step get zeroed out, another time I saw just one block get zeroed. In both of those cases, the zeroed out block(s) were written by the first rank in a chain of multiple writers. One thing I'm wondering now that I have seen a problem with a single timestep: Maybe there is something happening in between the |
Hmm. If even a single step is being messed up, I'm at a bit of a loss. The interweaving of things in and around transports is a complex though. Maybe Norbert might have ideas? If it were multiple steps, and early blocks being zeroed by writes to the same file from other ranks I'd be suspicious of running afoul of filesystem behaviour. If that were happening, I might consider a different approach where no file is ever written by more than one rank. To preserve the writer side flexibility you want to have, always have rank M write to data.M when it is the aggregator. In the worst case you might have N data files for N ranks, but in the situation where you had relative stability and few repartitions, you'd probably not be that different than what we're doing now (I.E. many ranks would never write). You could even look in each aggregation set and try to choose a rank that has written before as the rank0 of that group to minimize new file creation. (The worry with lots of data files is that a reader has to have a lot open which may put pressure on FDs, etc. But we've got code in place on the reader side to handle that already, so it's maybe not as big a worry as it might otherwise be.) |
It occurred to me that one other difference between |
Hmm. Maybe. Particularly if all are write/create... It's at least a theory. |
Not sure if I should celebrate yet, but I've done a few small runs since I added back a communicator to the |
I swear I knew this a long time ago, but had to double check. So I wrote a test to check POSIX open. Indeed, if multiple processes create the same file for writing, sometimes it occurs that the content some process writes will be strangely (but always entirely) missing from the file when everyone has closed it. Theory is that they get a different file's descriptor (FD) that will be overshadowed by a different FD with same name, and all the writes from that process disappear. The test showed this even in WSL (in a single ext4 vhd on a single nvme), no need to test on Lustre and multiple nodes. |
The Transport::OpenChain() function was added to adios to support the EW/EWS aggregation. Besides that nothing in adios writes to one file from multiple processes, let alone creating one file by multiple processes. So indeed, I knew this a few years ago.
|
62a70c4
to
9a1ca59
Compare
In the most recent commit (
I'm currently working on figuring out what those tests are doing and how the new aggregator may have broken them. But if anyone has a moment to look through the results here and sees anything obvious, I would welcome any clues that could help narrow or otherwise speed up my search 😀 Thanks! |
So for the *.Unique failures. Those are not failures per se, but an indication that the output has changed in some form from any prior version that we have stored. In order to ensure that we continue to be able to read these new files, the .zip versions of the output should be moved into the output_archive/zipped_output directory with an appropriately descriptive name. I'd ignore these for now, you're only getting it because you changed the default aggregator and it naturally had different output. You've got a number of DataWrite failures. This is an indication that maybe what you put into place to make sure that we used the same layout for an entire timestep (I.E. didn't change after we had done an initial DataWrite) might not be working. The easiest test to debug this with is probably Staging.1x1DataWrite.BP5. It looks like that is failing the first time we do performdatawrites in a timestep, so shouldn't be hard to track down (hopefully). The NullBlocks tests involve someone with zero-length data blocks. Maybe that's messing you up somehow? Most of the rest seem to be Append tests. Tackle those last as those are likely the most unpleasant. Finger's crossed that fixing the other issues might fix this one too. |
Thanks @eisenhauer for characterizing the failures so nicely, that's a great summary. I had started with one of the Append failures, but I'll set that aside for the time being, after reading your comments. Instead I'll start with |
Huh. No "Test" results in CDash for this PR or #4611 . What causes that? |
I think open.cdash.org is experiencing some issue across all projects this week, see e.g. https://open.cdash.org/index.php?project=CMake, hopefully it's just a short-term disruption. |
Getting closer now, at least 8 of the 17 test failures were resolved by the commit I pushed yesterday to fix problems with One test I'm not sure what to do about is Any thoughts on how to reconcile these two seemingly opposing requirements? |
959d2e9
to
eb0f93a
Compare
I'm down to a single unexpected test failure, this one. I have narrowed down the cause to this The failure seems to happen every time on windows/msmpi, at least according to CI, but I can reproduce it intermittently on linux as well. For some reason even though I configured the examples on my windows machine, they didn't get built, so I'm debugging on linux at the moment. Of all the heat transfer example variations, the BP4/5 MxN are the only ones that fail. Notably, I can never get the Mx1 or MxM version to fail, only the MxN. BP4 can fail because the second part, the I have no idea what, in my code, is sensitive to the fact that the file being read was originally written with 2x2=4 processes, while the current task is writing with 1x3=3 processes, but it seems to matter. Any suggestions/ideas/insights? Thanks! |
Regarding this comment: I was eventually able to run it on windows, but failed to reproduce there at all. I could only reproduce intermittently on linux, and after a lot of debugging, I discovered that I could do so with just a single timestep (making this feel different from similar issues I faced in the past). In the end, Unfortunately, I still don't have any idea why the breaking commit had that effect, or why the problem didn't seem to happen for other aggregation types. |
I think as of The two test failures on the el8 jobs are just |
@pnorbert Regarding commit I did verify that changing the xml transport configuration so that Note that the problem was originally caused by I have moved the pre-flush barrier into the |
Add a new aggregation method which behaves just like the existing EveryoneWritesSerial method, excepts builds the writer chains after first partitioning ranks in such a way that each chain writes approximately the same amount of data. The current partitioning method is a greedy approach that is fast but not garanteed to produce optimal partitionings. However, this has been done in such a way that adding new partitioning methods should be straightforward.
50336b0
to
d175547
Compare
Add a new aggregation approach on the writer side, where ranks are grouped into MPIChains of roughly equal data size, for writing.
BP5Writer
to collect some member fields into a new structure containing the things that should be cached. The purpose of this is to avoid opening the same files repeatedly for subsequent time steps, when it can be avoided.BP5Writer
initialization logic to support data-size based aggregation (which is off by default and enabled with a configuration setting).