Skip to content

Conversation

scottwittenburg
Copy link
Collaborator

@scottwittenburg scottwittenburg commented Jun 12, 2025

Add a new aggregation approach on the writer side, where ranks are grouped into MPIChains of roughly equal data size, for writing.

  • Add new helper module to handle partitioning of ranks based on data size. It should be easy to add new partitioning strategies, and eventually we should allow selecting one at runtime using config options
  • Add an MPIChain constructor that will put ranks into substreams based on a computed data-size based partitioning of the ranks
  • Update BP5Writer to collect some member fields into a new structure containing the things that should be cached. The purpose of this is to avoid opening the same files repeatedly for subsequent time steps, when it can be avoided.
  • Update BP5Writer initialization logic to support data-size based aggregation (which is off by default and enabled with a configuration setting).
  • Add a new test of writing unbalanced data. The test generates unbalanced data for any number of ranks and any number of timesteps, and can write to any number of subfiles. Currently the default number of ranks are used to write 5 timesteps to 3 subfiles.
  • Update some bp5-only tests to repeat for each of the 4 aggregation types: DataSizeBased (added in this PR), EveryoneWrites, EveryoneWritesSerial, and TwoLevelShm.

@scottwittenburg
Copy link
Collaborator Author

fyi @pnorbert and @eisenhauer

This is my strawman for data-size based aggregation. Over the next couple days I plan to comment in various locations in the changeset to ask questions about direction this approach is taking, with the goal of getting your feedback.

@scottwittenburg scottwittenburg force-pushed the data-size-based-aggregation branch from 758fb60 to 1956453 Compare June 13, 2025 00:55
@scottwittenburg scottwittenburg force-pushed the data-size-based-aggregation branch from c9de252 to 4408611 Compare June 18, 2025 01:56
@scottwittenburg
Copy link
Collaborator Author

scottwittenburg commented Jun 18, 2025

With my current implementation, I have found a broken case when 3 of 5 ranks get an aggregator cache hit, while 2 of 5 do not.

I'll try to describe the problem by first showing the expected outputs when the aggregation type is EveryoneWritesSerial, and then compare that to the outputs when the aggretation type is DataSizeBased.

In both cases I run with 5 mpi processes, 3 subfiles, and 2 timesteps.

First look at the data produced when using EveryoneWritesSerial, and note that the test passes in the process of producing the data.

EveryoneWritesSerial - bpls
$ ~/projects/EfficientInSituIO/fides_adios/adios2/install/bin/bpls -d -D ./unbalanced_output.bp/
  uint64_t  GlobalArray  2*{5, 15}
        step 0: 
          block 0: [0:4,  0: 0]
    (0,0)    0 15 30 45 60 
          block 1: [0:4,  1: 2]
    (0,0)    1 2 16 17 31 32
    (3,0)    46 47 61 62 
          block 2: [0:4,  3: 5]
    (0,0)    3 4 5 18 19 20
    (2,0)    33 34 35 48 49 50
    (4,0)    63 64 65 
          block 3: [0:4,  6: 9]
    (0,0)    6 7 8 9 21 22
    (1,2)    23 24 36 37 38 39
    (3,0)    51 52 53 54 66 67
    (4,2)    68 69 
          block 4: [0:4, 10:14]
    (0,0)    10 11 12 13 14 25
    (1,1)    26 27 28 29 40 41
    (2,2)    42 43 44 55 56 57
    (3,3)    58 59 70 71 72 73
    (4,4)    74 
        step 1: 
          block 0: [0:4,  0: 4]
    (0,0)    74 75 76 77 78 89
    (1,1)    90 91 92 93 104 105
    (2,2)    106 107 108 119 120 121
    (3,3)    122 123 134 135 136 137
    (4,4)    138 
          block 1: [0:4,  5: 6]
    (0,0)    79 80 94 95 109 110
    (3,0)    124 125 139 140 
          block 2: [0:4,  7: 9]
    (0,0)    81 82 83 96 97 98
    (2,0)    111 112 113 126 127 128
    (4,0)    141 142 143 
          block 3: [0:4, 10:13]
    (0,0)    84 85 86 87 99 100
    (1,2)    101 102 114 115 116 117
    (3,0)    129 130 131 132 144 145
    (4,2)    146 147 
          block 4: [0:4, 14:14]
    (0,0)    88 103 118 133 148
EveryoneWritesSerial - bp5dbg
$ PYTHONPATH=~/projects/EfficientInSituIO/fides_adios/adios2/install/local/lib/python3.10/dist-packages ~/projects/EfficientInSituIO/fides_adios/adios2/install/bin/bp5dbg ./unbalanced_output.bp/
========================================================
    Index Table File: ./unbalanced_output.bp//md.idx
========================================================
------------------------------------------------------------------------------------------------------------------
|        Version string            |Major|Minor|Patch|unused|Endian|BP version|BP minor|Active|ColumnMajor|unused|
|          32 bytes                |  1B |  1B |  1B |  1B  |  1B  |    1B    |   1B   |  1B  |    1B     |  23B |
+----------------------------------+-----+-----+-----+------+------+----------+--------+------+-----------+------+
| ADIOS-BP v2.:.0 Index Table      |  2  |  :  |  0  |      | yes  |     5    |   2    |  no  |    no     |      |
------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record 'w', length = [64]
  WriterMap: Writers = 5  Aggregators = 3  Subfiles = 3
  =====================
  |  Rank  |  Subfile |
  ---------------------
  |      0 |        0 |
  |      1 |        0 |
  |      2 |        1 |
  |      3 |        1 |
  |      4 |        2 |
  =====================
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record 's', length = [64]
|   Step = 0     | MetadataPos = 0          |  MetadataSize = 1248         | FlushCount = 0  |
 Writer 0 data loc:0
 Writer 1 data loc:4096
 Writer 2 data loc:0
 Writer 3 data loc:4096
 Writer 4 data loc:0
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record 's', length = [64]
|   Step = 1     | MetadataPos = 1248       |  MetadataSize = 1248         | FlushCount = 0  |
 Writer 0 data loc:8192
 Writer 1 data loc:12288
 Writer 2 data loc:8192
 Writer 3 data loc:12288
 Writer 4 data loc:4096
--------------------------------------------------------------------------------------------------
========================================================
    MetaMetadata File: ./unbalanced_output.bp/mmd.0
========================================================
 File size = 628
 --------------------------------------------------
 |  Record |  Offset  |  ID length |  Info length |
 --------------------------------------------------
 |       0 |        0 |         12 |          600 |
 --------------------------------------------------
========================================================
    Metadata File: ./unbalanced_output.bp/md.0
========================================================
Step 0: 
  Offset = 0
  Size = 1240
  Variable metadata entries: 
    Writer 0: md size 232 offset 80
    Writer 1: md size 232 offset 312
    Writer 2: md size 232 offset 544
    Writer 3: md size 232 offset 776
    Writer 4: md size 232 offset 1008
  Attribute metadata entries: 
    Writer 0: md size 0 offset 1240
    Writer 1: md size 0 offset 1240
    Writer 2: md size 0 offset 1240
    Writer 3: md size 0 offset 1240
    Writer 4: md size 0 offset 1240
========================================================
Step 1: 
  Offset = 1248
  Size = 1240
  Variable metadata entries: 
    Writer 0: md size 232 offset 1328
    Writer 1: md size 232 offset 1560
    Writer 2: md size 232 offset 1792
    Writer 3: md size 232 offset 2024
    Writer 4: md size 232 offset 2256
  Attribute metadata entries: 
    Writer 0: md size 0 offset 2488
    Writer 1: md size 0 offset 2488
    Writer 2: md size 0 offset 2488
    Writer 3: md size 0 offset 2488
    Writer 4: md size 0 offset 2488

Now look at the data produced when using DataSizeBased (which is associated with test failure).

DataSizeBased - bpls
$ ~/projects/EfficientInSituIO/fides_adios/adios2/install/bin/bpls -d -D ./unbalanced_output.bp/
  uint64_t  GlobalArray  2*{5, 15}
        step 0: 
          block 0: [0:4,  0: 0]
    (0,0)    0 0 0 0 0 
          block 1: [0:4,  1: 2]
    (0,0)    1 2 16 17 31 32
    (3,0)    46 47 61 62 
          block 2: [0:4,  3: 5]
    (0,0)    3 4 5 18 19 20
    (2,0)    33 34 35 48 49 50
    (4,0)    63 64 65 
          block 3: [0:4,  6: 9]
    (0,0)    0 0 0 0 0 0
    (1,2)    0 0 0 0 0 0
    (3,0)    0 0 0 0 0 0
    (4,2)    0 0 
          block 4: [0:4, 10:14]
    (0,0)    0 0 0 0 0 0
    (1,1)    0 0 0 0 0 0
    (2,2)    0 0 0 0 0 0
    (3,3)    0 0 0 0 0 0
    (4,4)    0 
        step 1: 
          block 0: [0:4,  0: 4]
    (0,0)    74 75 76 77 78 89
    (1,1)    90 91 92 93 104 105
    (2,2)    106 107 108 119 120 121
    (3,3)    122 123 134 135 136 137
    (4,4)    138 
          block 1: [0:4,  5: 6]
    (0,0)    79 80 94 95 109 110
    (3,0)    124 125 139 140 
          block 2: [0:4,  7: 9]
    (0,0)    81 82 83 96 97 98
    (2,0)    111 112 113 126 127 128
    (4,0)    141 142 143 
          block 3: [0:4, 10:13]
    (0,0)    0 0 0 0 0 0
    (1,2)    0 0 0 0 0 0
    (3,0)    0 0 0 0 0 0
    (4,2)    0 0 
          block 4: [0:4, 14:14]
    (0,0)    88 103 118 133 148
DataSizeBased - bp5dbg
$ PYTHONPATH=~/projects/EfficientInSituIO/fides_adios/adios2/install/local/lib/python3.10/dist-packages ~/projects/EfficientInSituIO/fides_adios/adios2/install/bin/bp5dbg ./unbalanced_output.bp/
========================================================
    Index Table File: ./unbalanced_output.bp//md.idx
========================================================
------------------------------------------------------------------------------------------------------------------
|        Version string            |Major|Minor|Patch|unused|Endian|BP version|BP minor|Active|ColumnMajor|unused|
|          32 bytes                |  1B |  1B |  1B |  1B  |  1B  |    1B    |   1B   |  1B  |    1B     |  23B |
+----------------------------------+-----+-----+-----+------+------+----------+--------+------+-----------+------+
| ADIOS-BP v2.:.0 Index Table      |  2  |  :  |  0  |      | yes  |     5    |   2    |  no  |    no     |      |
------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record 'w', length = [64]
  WriterMap: Writers = 5  Aggregators = 0  Subfiles = 3
  =====================
  |  Rank  |  Subfile |
  ---------------------
  |      0 |        1 |
  |      1 |        2 |
  |      2 |        2 |
  |      3 |        1 |
  |      4 |        0 |
  =====================
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record 's', length = [64]
|   Step = 0     | MetadataPos = 0          |  MetadataSize = 1248         | FlushCount = 0  |
 Writer 0 data loc:4096
 Writer 1 data loc:4096
 Writer 2 data loc:0
 Writer 3 data loc:0
 Writer 4 data loc:0
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record 'w', length = [64]
  WriterMap: Writers = 5  Aggregators = 0  Subfiles = 3
  =====================
  |  Rank  |  Subfile |
  ---------------------
  |      0 |        0 |
  |      1 |        2 |
  |      2 |        2 |
  |      3 |        1 |
  |      4 |        1 |
  =====================
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record 's', length = [64]
|   Step = 1     | MetadataPos = 1248       |  MetadataSize = 1248         | FlushCount = 0  |
 Writer 0 data loc:8192
 Writer 1 data loc:12288
 Writer 2 data loc:8192
 Writer 3 data loc:8192
 Writer 4 data loc:12288
--------------------------------------------------------------------------------------------------
========================================================
    MetaMetadata File: ./unbalanced_output.bp/mmd.0
========================================================
 File size = 628
 --------------------------------------------------
 |  Record |  Offset  |  ID length |  Info length |
 --------------------------------------------------
 |       0 |        0 |         12 |          600 |
 --------------------------------------------------
========================================================
    Metadata File: ./unbalanced_output.bp/md.0
========================================================
Step 0: 
  Offset = 0
  Size = 1240
  Variable metadata entries: 
    Writer 0: md size 232 offset 80
    Writer 1: md size 232 offset 312
    Writer 2: md size 232 offset 544
    Writer 3: md size 232 offset 776
    Writer 4: md size 232 offset 1008
  Attribute metadata entries: 
    Writer 0: md size 0 offset 1240
    Writer 1: md size 0 offset 1240
    Writer 2: md size 0 offset 1240
    Writer 3: md size 0 offset 1240
    Writer 4: md size 0 offset 1240
========================================================
Step 1: 
  Offset = 1248
  Size = 1240
  Variable metadata entries: 
    Writer 0: md size 232 offset 1328
    Writer 1: md size 232 offset 1560
    Writer 2: md size 232 offset 1792
    Writer 3: md size 232 offset 2024
    Writer 4: md size 232 offset 2256
  Attribute metadata entries: 
    Writer 0: md size 0 offset 2488
    Writer 1: md size 0 offset 2488
    Writer 2: md size 0 offset 2488
    Writer 3: md size 0 offset 2488
    Writer 4: md size 0 offset 2488

Note when reading the output above that the size based partitioning produced at t=1 was [[4],[3,0],[2,1]], while at t=2 the partitioning was [[0],[3,4],[2,1]]. Considering that information there's at least one noticeable problem: The Writer 0 data loc: 8192 for the second timestep seems like it should actually be Writer 0 data loc: 4096 since only rank 0 wrote to file 0 on the previous time step.

Any idea where I should be looking to correct that problem? Seems like it's in the metadata index file, but beyond that, I'm not sure. Also, it appears to me that the other writer data locations for timestep 2 are correct, which is confusing, unless I'm interpreting the data wrong.

One last note: when repeating the above tests with only one time step, the test passes in both cases, and the data looks fine To me that seems to indicate the problem is related to repeating some of the initialization in InitTransports() and InitBPBuffer() more than once.

@eisenhauer
Copy link
Member

Hmm. A quick question. Are you using SelectiveAggregationMetadata? I.E. is the UseSelectiveMetadataAggregation engine parameter in its default "true" value and the first branch of this if in BP5Writer being taken?

    if (m_Parameters.UseSelectiveMetadataAggregation)
    {
        SelectiveAggregationMetadata(TSInfo);
    }
    else
    {
        TwoLevelAggregationMetadata(TSInfo);
    }

Because if not, if you're falling into TwoLevelAggregationMetadata, then all the unfortunate effects that I might have anticipated for >6K nodes are hitting you now...

@scottwittenburg
Copy link
Collaborator Author

A quick question. Are you using SelectiveAggregationMetadata?

Ah yes, great question, I assume you asked because you noticed the re-ordering between timesteps. I just confirmed that in this case, it's taking the SelectiveAggregationMetadata path.

@scottwittenburg
Copy link
Collaborator Author

The aggregator rank in each mpi chain receives the final data position from the last rank in the chain. Is that because there is an assumption there that the aggregator rank is going to be writing to the same file the next time around?

@eisenhauer
Copy link
Member

Just wanted to make sure, because the TwoLevel stuff uses those aggregators and if they're changing, potentially bad things. Also, as a debugging aid, make sure this #ifdef DUMPDATALOCINFO evaluates to true so that you get debugging output from writer rank 0 without having to rely on bp5dbg. I'm assuming that you're not doing any data flushes, so what happens in FlushData is irrelevant. So you'll be tracking the m_StartDataPos that comes into SelectiveAggregationMetadata. Whatever happens in WriteData, that should accurately reflect where that rank's data starts in whatever file is landed in. If it doesn't, nothing else will go well after that.

@eisenhauer
Copy link
Member

Something else to note while I'm thinking about it. You'll notice that WriteData() is also being called in FlushData(). FlushData is a user-visible BP5 call that allows the application to write partial timestep data in the middle of a timestep (I.E. before EndStep()). This is useful if you are pushing memory limits and don't have room to buffer. So, for example you could do BeginStep(),Put(),Put(),FlushData(),Put(),Put(),FlushData(),Put(),EndStep(). Each of those flushes and the end step results in data (but not metadata) going to disk. The FlushPosSizeInfo vector on rank 0 accumulates all the info from each rank for where it's data landed and how much data it was on each Flush. Note that as a natural consequence of writing piece-wise the data from rank X is not contiguous in the data file, but broken up into multiple segments. In order to find the data at Offset M, timestep T, from rank N, we have to have the offset and size of each chunk, so this all ends up in a single metadataindex record for that timestep. The relevant bit for you is that WriterMapRecords only occur between timesteps. Which means that we don't have the ability to let a rank change what file it is writing to in the middle of a timestep. So, the first time WriteData() is called in a timestep (which might be in a FlushData or in EndStep), you are free to change the WriterSubfileMap. But you should not change it again in that timestep. You're probably not playing with those examples yet, but this is something that will require a little bit of extra logic to make sure it doesn't happen.

@scottwittenburg
Copy link
Collaborator Author

scottwittenburg commented Jun 25, 2025

By the way, on the master branch, when I #define DUMPDATALOCINFO, this is what it outputs for the first time step:

Extra debug output:
315: Flush count is :0
315: Write Index positions = {
315: Writer 0 has data at: 
315: loc:O
315: Writer 1 has data at: 
315: loc:S
315: Writer 2 has data at: 
315: loc:-
315: Writer 3 has data at: 
315: loc:B
315: Writer 4 has data at: 
315: loc:P
315: Writer 5 has data at: 
315: loc: 
315: Writer 6 has data at: 
315: loc:v
315: Writer 7 has data at: 
315: loc:2
315: Writer 8 has data at: 
315: loc:.
315: Writer 9 has data at: 
315: loc::
315: Writer 10 has data at: 
315: loc:.
315: Writer 11 has data at: 
315: loc:0
315: Writer 12 has data at: 
315: loc: 
315: Writer 13 has data at: 
315: loc:I
315: Writer 14 has data at: 
315: loc:n
315: Writer 15 has data at: 
315: loc:d
315: Writer 16 has data at: 
315: loc:e
315: Writer 17 has data at: 
315: loc:x
315: Writer 18 has data at: 
315: loc: 
315: Writer 19 has data at: 
315: loc:T
315: Writer 20 has data at: 
315: loc:a
315: Writer 21 has data at: 
315: loc:b
315: Writer 22 has data at: 
315: loc:l
315: Writer 23 has data at: 
315: loc:e

For subsequent time steps, it seems to print nothing instead of those characters from the buffer. I wonder if that block didn't get updated when some other bit of the code did, due to it not getting exercised and validated?

@eisenhauer
Copy link
Member

For subsequent time steps, it seems to print nothing instead of those characters from the buffer. I wonder if that block didn't get updated when some other bit of the code did, due to it not getting exercised and validated?

Ah, yes, that's probably the case. Those items are size_t and maybe the type of buf changed. Which may mean those index calcs into it are off too. I can't look tonight, but will check tomorrow.

@eisenhauer
Copy link
Member

OK, try substituting in this code into the matching bits of WriteMetadataFileIndex()

    // Step record
    record = StepRecord;
#ifndef DUMPDATALOCINFO
    size_t StepRecordStartPos = pos;
#endif
    helper::CopyToBuffer(buf, pos, &record, 1); // record type
    d = (3 + ((FlushPosSizeInfo.size() * 2) + 1) * m_Comm.Size()) * sizeof(uint64_t);
    helper::CopyToBuffer(buf, pos, &d, 1); // record length
    helper::CopyToBuffer(buf, pos, &MetaDataPos, 1);
    helper::CopyToBuffer(buf, pos, &MetaDataSize, 1);
    d = static_cast<uint64_t>(FlushPosSizeInfo.size());
    helper::CopyToBuffer(buf, pos, &d, 1);

    for (int writer = 0; writer < m_Comm.Size(); writer++)
    {
        for (size_t flushNum = 0; flushNum < FlushPosSizeInfo.size(); flushNum++)
        {
            // add two numbers here
            helper::CopyToBuffer(buf, pos, &FlushPosSizeInfo[flushNum][2 * writer], 2);
        }
        helper::CopyToBuffer(buf, pos, &m_WriterDataPos[writer], 1);
    }

    m_FileMetadataIndexManager.WriteFiles((char *)buf.data(), buf.size());
#ifndef DUMPDATALOCINFO
    std::cout << "WriterMapRecordType is: " << (buf.data() + StepRecordStartPos)[0] << std::endl;
    size_t *BufPtr = (size_t*)(buf.data() + StepRecordStartPos + 1);
    std::cout << "WriterMapRecordLength is: " << *BufPtr++ << std::endl;
    std::cout << "MetadataPos is: " << *BufPtr++ << std::endl;
    std::cout << "MetadataSize is: " << *BufPtr++ << std::endl;
    std::cout << "Flush count is :" << *BufPtr++ << std::endl;
    std::cout << "Write Index positions = {" << std::endl;

    for (size_t i = 0; i < m_Comm.Size(); ++i)
    {
        std::cout << "Writer " << i << " has data at: " << std::endl;
        uint64_t eachWriterSize = FlushPosSizeInfo.size() * 2 + 1;
        for (size_t j = 0; j < FlushPosSizeInfo.size(); ++j)
        {
            std::cout << "loc:" << *BufPtr++;
	    std::cout << " siz:" << *BufPtr++ << std::endl;
        }
        std::cout << "loc:" << *BufPtr++ << std::endl;
    }
    std::cout << "}" << std::endl;
#endif
    m_FileMetadataIndexManager.FlushFiles();

(I won't put these changes into a PR now because that would just make merging your code more difficult when it comes to that.)

@scottwittenburg
Copy link
Collaborator Author

OK, try substituting in this code into the matching bits of WriteMetadataFileIndex()

Thanks Greg, that works and jives with both the extra debug printing I was doing, as well as with what bp5dbg shows. Having this is nice though because I have been printing the source information and what you have here confirms that is transferred to the buffer correctly.

So I'm just curious if this summary below 1) contains all the relevant information you would expect is needed to indicate correct writing of the data, and 2) if it actually looks correct to you based on the writer chains at each time step (it looks correct to me).

Debug output showing writer chains and metadata
Step 0:

    Rank data sizes: [48, 80, 128, 160, 208]
    Paritioning resulted in 3 substreams:
    0: [4 ]
    1: [3 0 ]
    2: [2 1 ]

    WriterMapRecordType is: s
    WriterMapRecordLength is: 64
    MetadataPos is: 0
    MetadataSize is: 1248
    Flush count is :0

    Write Index positions = {
    Writer 0 has data at:
    loc:4096
    Writer 1 has data at:
    loc:4096
    Writer 2 has data at:
    loc:0
    Writer 3 has data at:
    loc:0
    Writer 4 has data at:
    loc:0
    }

Step 1:

    Rank data sizes: [208, 80, 128, 160, 48]
    Paritioning resulted in 3 substreams:
    0: [0 ]
    1: [3 4 ]
    2: [2 1 ]

    WriterMapRecordType is: s
    WriterMapRecordLength is: 64
    MetadataPos is: 1248
    MetadataSize is: 1248
    Flush count is :0

    Write Index positions = {
    Writer 0 has data at:
    loc:4096
    Writer 1 has data at:
    loc:12288
    Writer 2 has data at:
    loc:8192
    Writer 3 has data at:
    loc:8192
    Writer 4 has data at:
    loc:12288
    }

And here is the comparison of what is produced by EveryoneWritesSerial aggregation vs DataSizeBased:

Side-by-side bpls output
       EveryoneWritesSerial          |             DataSizeBased
-----------------------------------  |  -----------------------------------
  uint64_t  GlobalArray  2*{5, 15}   |    uint64_t  GlobalArray  2*{5, 15}
        step 0:                      |          step 0:
          block 0: [0:4,  0: 0]      |            block 0: [0:4,  0: 0]
    (0,0)    0 15 30 45 60           |      (0,0)    0 0 0 0 0
          block 1: [0:4,  1: 2]      |            block 1: [0:4,  1: 2]
    (0,0)    1 2 16 17 31 32         |      (0,0)    1 2 16 17 31 32
    (3,0)    46 47 61 62             |      (3,0)    46 47 61 62
          block 2: [0:4,  3: 5]      |            block 2: [0:4,  3: 5]
    (0,0)    3 4 5 18 19 20          |      (0,0)    3 4 5 18 19 20
    (2,0)    33 34 35 48 49 50       |      (2,0)    33 34 35 48 49 50
    (4,0)    63 64 65                |      (4,0)    63 64 65
          block 3: [0:4,  6: 9]      |            block 3: [0:4,  6: 9]
    (0,0)    6 7 8 9 21 22           |      (0,0)    0 0 0 0 0 0
    (1,2)    23 24 36 37 38 39       |      (1,2)    0 0 0 0 0 0
    (3,0)    51 52 53 54 66 67       |      (3,0)    0 0 0 0 0 0
    (4,2)    68 69                   |      (4,2)    0 0
          block 4: [0:4, 10:14]      |            block 4: [0:4, 10:14]
    (0,0)    10 11 12 13 14 25       |      (0,0)    0 0 0 0 0 0
    (1,1)    26 27 28 29 40 41       |      (1,1)    0 0 0 0 0 0
    (2,2)    42 43 44 55 56 57       |      (2,2)    0 0 0 0 0 0
    (3,3)    58 59 70 71 72 73       |      (3,3)    0 0 0 0 0 0
    (4,4)    74                      |      (4,4)    0
        step 1:                      |          step 1:
          block 0: [0:4,  0: 4]      |            block 0: [0:4,  0: 4]
    (0,0)    74 75 76 77 78 89       |      (0,0)    74 75 76 77 78 89
    (1,1)    90 91 92 93 104 105     |      (1,1)    90 91 92 93 104 105
    (2,2)    106 107 108 119 120 121 |      (2,2)    106 107 108 119 120 121
    (3,3)    122 123 134 135 136 137 |      (3,3)    122 123 134 135 136 137
    (4,4)    138                     |      (4,4)    138
          block 1: [0:4,  5: 6]      |            block 1: [0:4,  5: 6]
    (0,0)    79 80 94 95 109 110     |      (0,0)    79 80 94 95 109 110
    (3,0)    124 125 139 140         |      (3,0)    124 125 139 140
          block 2: [0:4,  7: 9]      |            block 2: [0:4,  7: 9]
    (0,0)    81 82 83 96 97 98       |      (0,0)    81 82 83 96 97 98
    (2,0)    111 112 113 126 127 128 |      (2,0)    111 112 113 126 127 128
    (4,0)    141 142 143             |      (4,0)    141 142 143
          block 3: [0:4, 10:13]      |            block 3: [0:4, 10:13]
    (0,0)    84 85 86 87 99 100      |      (0,0)    0 0 0 0 0 0
    (1,2)    101 102 114 115 116 117 |      (1,2)    0 0 0 0 0 0
    (3,0)    129 130 131 132 144 145 |      (3,0)    0 0 0 0 0 0
    (4,2)    146 147                 |      (4,2)    0 0
          block 4: [0:4, 14:14]      |            block 4: [0:4, 14:14]
    (0,0)    88 103 118 133 148      |      (0,0)    88 103 118 133 148

Let me know if you spot something that looks wrong, or if I haven't included complete details.

@eisenhauer
Copy link
Member

The metadatafile index stuff does look like what I'd expect. However, things obviously aren't right even with just a single timestep, so something is clearly going badly. I'd like to try to reproduce this output and would be hopeful that I if I step through with the debugger on the reader side I might see what the issue is. Can you tell me what/how you're running on the writer side so I can give that a go?

@scottwittenburg
Copy link
Collaborator Author

However, things obviously aren't right even with just a single timestep,

Well, if I run the test with just a single timestep, everything looks fine and works. But when I run the test with two timesteps, somehow the first timestep get's messed up in retrospect or something.

so something is clearly going badly

😅 That's for sure!

Can you tell me what/how you're running on the writer side so I can give that a go?

That would be so great, thank you! I'm just running variations of one of the new tests I added, which you can run with all the defaults as:

ctest -R Engine.BP.DSATest.TestWriteUnbalancedData.BP5.DSB.MPI --verbose

The --verbose argument to ctest shows the underlying command it runs, and you can just run that if you want to change the number of mpi processes or any of the default arguments (which are: aggregation type, number of subfiles, and number of time steps). This is what I'm doing to produce the broken bp file: 5 ranks, DataSizeBased aggregation, 3 subfiles, 2 timesteps. Here'e the command:

mpiexec "-n" "5" \
    "./bin/Test.Engine.BP.DataSizeAggregate.MPI" \
    "--gtest_filter=DSATest.TestWriteUnbalancedData" \
    "DataSizeBased" "3" "2"

Many combinations of parameters don't result in failure or messed up output file like the example above does. I suspect it's probably related to how some of the ranks change the subfile they're writing between steps. You can try with 1 timestep instead of 2, and should see that it works.

Note: I need to rm -rf ./unbalanaced_output.bp from the build tree between runs, or it sometimes causes the test to fail. Maybe because some of the files don't get overwritten? Or running with more ranks and subsequently fewer leaves files in the directory that break things? 🤷

@eisenhauer
Copy link
Member

eisenhauer commented Jun 26, 2025

OK, I'll give this a go. I've got a couple of calls yet this afternoon, so it won't be right away. Just FYI, in case you get there first, my plan is to look at what is happening in BP5Reader::ReadData(). The BP5Deserializer looks at the Get()s that happen in a timestep and translate those into a ReadList of data that it needs. That list looks like : from WriterRank W data from Timestep T, read Length bytes from Offset S. ReadData()'s responsibility is to translate that into actual data access. One of the first things it does is to translate W and T to a SubfileNum. Looking at what's going wrong there should be informative.

@scottwittenburg
Copy link
Collaborator Author

Ok thanks for that information, I'll start looking in BP5Reader::ReadData() as well, since I think it will be helpful for me to have at least some familiarity with the read side.

@eisenhauer
Copy link
Member

output.txt
Attaching a file with output from bpls with some extra diagnostic output. It looks like we're going to the right data file and the right place in the data file to read for timestep 0, writer rank 0.

(base) eisen@Endor build % bin/bpls -D -d unbalanced_output.bp/
  uint64_t  GlobalArray  2*{5, 15}
        step 0: 
          block 0: [0:4,  0: 0]
WritermapIndex for Timestep 0 is 0
Read from Timestep 0 WriterRank 0 Going to Subfilenum 1
Subfilenum 1 has name unbalanced_output.bp/data.1
DataPosition for Timestep 0 WriterRank 0 is 4096
Offset of Data in block is 0 Final offset in subfile 1 is 4096
    (0,0)    0 0 0 0 0 

But we're not finding the data that should be there. You say that this works with writing just one timestep. Is it possible that writing the 2nd timestep is killing some of the data from the first?

@eisenhauer
Copy link
Member

Whatever is going on, subfile 1 has zeroes up through Offset 12288, and that doesn't look right.

(base) eisen@Endor build % od -A x unbalanced_output.bp/data.1 
0000000    000000  000000  000000  000000  000000  000000  000000  000000
*
0003000    000130  000000  000000  000000  000147  000000  000000  000000
0003010    000166  000000  000000  000000  000205  000000  000000  000000
0003020    000224  000000  000000  000000  000000  000000  000000  000000
0003030

@scottwittenburg
Copy link
Collaborator Author

scottwittenburg commented Jun 26, 2025

subfile 1 has zeroes up through Offset 12288, and that doesn't look right

Huh, that is interesting. There is initialization stuff that gets done on every timestep now (used to only be done once), and I'm just selectively guarding against doing some specific things within there. Maybe more bits need guarded against running 1) on a timestep after the first, or 2) in the case of a aggregator output file cache hit. Like seeking to position 0 in a file, for example.

I wrote a tiny standalone program just to read the file (mainly so I could run it with a single process), and then added some debug printing to the reader. I see a couple weird things off the bat:

single process reader debug output
rank 0 size 1
ReadRequest:   Timestep: 0  WriterRank: 4  StartOffset: 0  ReadLength: 200
ReadData SubFileNum: 0
ReadRequest:   Timestep: 0  WriterRank: 0  StartOffset: 0  ReadLength: 40
ReadData SubFileNum: 1
ReadRequest:   Timestep: 0  WriterRank: 3  StartOffset: 0  ReadLength: 160
ReadData SubFileNum: 1
ReadRequest:   Timestep: 0  WriterRank: 1  StartOffset: 0  ReadLength: 80
ReadData SubFileNum: 2
ReadRequest:   Timestep: 0  WriterRank: 2  StartOffset: 0  ReadLength: 120
ReadData SubFileNum: 2
Step data:
0 1 2 3 4 5 0 0 0 0 0 0 0 0 0 0 16 17 18 19 20 0 0 0 0 0 0 0 0 0 0 31 32 33 34 35 0 0 0 0 0 0 0 0 0 0 46 47 48 49 50 0 0 0 0 0 0 0 0 0 0 61 62 63 64 65 0 0 0 0 0 0 0 0 0 
ReadRequest:   Timestep: 1  WriterRank: 0  StartOffset: 0  ReadLength: 200
ReadData SubFileNum: 0
ReadRequest:   Timestep: 1  WriterRank: 3  StartOffset: 0  ReadLength: 160
ReadData SubFileNum: 1
ReadRequest:   Timestep: 1  WriterRank: 4  StartOffset: 0  ReadLength: 40
ReadData SubFileNum: 1
ReadRequest:   Timestep: 1  WriterRank: 1  StartOffset: 0  ReadLength: 80
ReadData SubFileNum: 2
ReadRequest:   Timestep: 1  WriterRank: 2  StartOffset: 0  ReadLength: 120
ReadData SubFileNum: 2
Step data:
74 75 76 77 78 79 80 81 82 83 0 0 0 0 88 89 90 91 92 93 94 95 96 97 98 0 0 0 0 103 104 105 106 107 108 109 110 111 112 113 0 0 0 0 118 119 120 121 122 123 124 125 126 127 128 0 0 0 0 133 134 135 136 137 138 139 140 141 142 143 0 0 0 0 148

It's odd to me that the start offset is 0 regardless of timestep or writer rank. It's also odd that three of the read lengths are off by 8 bytes (something I also see in bp5dbg output). Based on the amounts of data I debug printed on each rank, I would think 200 above should be 208, 120 should be 128, and 40 should be 48.

@scottwittenburg
Copy link
Collaborator Author

Hmmmm, I wonder: If rank X opens a file already opened and written by rankY, should it be doing that in append mode or else the file will get overwritten?

@scottwittenburg
Copy link
Collaborator Author

scottwittenburg commented Jun 26, 2025

If rank X opens a file already opened and written by rankY, should it be doing that in append mode or else the file will get overwritten?

That was the problem, the hint from running od was the clue that eventually led to the solution. In 13d7f06 (force-pushed), I have patched it so that the problem case now passes the test and produces the expected data!

@eisenhauer
Copy link
Member

Great! Yeah, you may be running into filesystem oddities with this approach. Are you closing files after the write responsibility shifts? If not I can imagine that if the responsibility shifts back you might need to close and re-open to get a sane file pointer.

@scottwittenburg
Copy link
Collaborator Author

But we're not finding the data that should be there. You say that this works with writing just one timestep. Is it possible that writing the 2nd timestep is killing some of the data from the first?

Also that clue is huge 😆

Your help was invaluable in finding this bug I wrote, thank you @eisenhauer!

I still have 2 things to handle which you pointed out in previous comments:

  1. two-level metadata aggregation: either metadata swizzling or else separate comm for metadata chains
  2. guarding against Bad Things when doing multiple Put() within a timestep

@scottwittenburg
Copy link
Collaborator Author

Are you closing files after the write responsibility shifts? If not I can imagine that if the responsibility shifts back you might need to close and re-open to get a sane file pointer.

Currently each rank has a separate transport manager for every file it ends up writing to. I'm keeping each of those files open throughout the run, to avoid unnecessary Open() in case a rank ever switches back to a file it has already written to previously.

Does that seem like a bad idea? Maybe I need to seek to some location in that case? Or does WriteFileAt(DataVec.data(), DataVec.size(), m_StartDataPos) handle writing somewhere away from current fp?

@eisenhauer
Copy link
Member

2. guarding against Bad Things when doing multiple Put() within a timestep

Just to clarify this one, multiple Put() shouldn't change anything for you. You still get a single contiguous data block. it's calls to PerformDataWrite() that would potentially mess you up. That would result in multiple calls to WriteData() in a single timestep.

@eisenhauer
Copy link
Member

Are you closing files after the write responsibility shifts? If not I can imagine that if the responsibility shifts back you might need to close and re-open to get a sane file pointer.

Currently each rank has a separate transport manager for every file it ends up writing to. I'm keeping each of those files open throughout the run, to avoid unnecessary Open() in case a rank ever switches back to a file it has already written to previously.

Does that seem like a bad idea? Maybe I need to seek to some location in that case? Or does WriteFileAt(DataVec.data(), DataVec.size(), m_StartDataPos) handle writing somewhere away from current fp?

This is where I think things might get funky because you are outside of normal Posix semantics. If you write data in a file and leave it open, then someone else opens that file in append mode and adds to it, what of that append does the first file see? Can you just fseek() in the first file to the new end? Or will the first file insist that it knows where the end of the file is and refuse? After all, that's got to be a really rare circumstance that an open file is extended behind your back, so seems like a corner case that might be dropped, not considered, or optimized away. I'm curious to know!

@scottwittenburg
Copy link
Collaborator Author

2. guarding against Bad Things when doing multiple Put() within a timestep

Just to clarify this one, multiple Put() shouldn't change anything for you. You still get a single contiguous data block. it's calls to PerformDataWrite() that would potentially mess you up. That would result in multiple calls to WriteData() in a single timestep.

Ok, yeah, thanks for clarifying. I'll make sure to only repartition ranks and rebuild mpi chains once per time step.

@scottwittenburg
Copy link
Collaborator Author

Well boo, I thought my zero blocks problem was solved, but it has cropped up again on a real HPC with a lustre filesystem. In that environment the problem can happen even on a single timestep 😱 but doesn't seem fully reproducible. Once I saw two blocks in a single time step get zeroed out, another time I saw just one block get zeroed. In both of those cases, the zeroed out block(s) were written by the first rank in a chain of multiple writers.

One thing I'm wondering now that I have seen a problem with a single timestep: Maybe there is something happening in between the BP5Writer() constructor and the first WriteData() that depends on something that happens in InitAggregator() InitTransports() and InitBPBuffer()? I know BeginStep() happens in between there, but it doesn't appear to me that anything there depends on initialization of aggregators, transports, or bpbuffer.

@eisenhauer
Copy link
Member

Hmm. If even a single step is being messed up, I'm at a bit of a loss. The interweaving of things in and around transports is a complex though. Maybe Norbert might have ideas?

If it were multiple steps, and early blocks being zeroed by writes to the same file from other ranks I'd be suspicious of running afoul of filesystem behaviour. If that were happening, I might consider a different approach where no file is ever written by more than one rank. To preserve the writer side flexibility you want to have, always have rank M write to data.M when it is the aggregator. In the worst case you might have N data files for N ranks, but in the situation where you had relative stability and few repartitions, you'd probably not be that different than what we're doing now (I.E. many ranks would never write). You could even look in each aggregation set and try to choose a rank that has written before as the rank0 of that group to minimize new file creation. (The worry with lots of data files is that a reader has to have a lot open which may put pressure on FDs, etc. But we've got code in place on the reader side to handle that already, so it's maybe not as big a worry as it might otherwise be.)

@scottwittenburg
Copy link
Collaborator Author

scottwittenburg commented Jun 30, 2025

It occurred to me that one other difference between EveryoneWritesSerial and DataSizeBased aggregation for a single timestep is that in the latter case, I'm currently calling OpenFiles() without a communicator (to avoid mpi deadlock). Is it possible that lustre could have problems with even a very few ranks opening files the same file simultaneously? Maybe it's worth splitting the communicators of the writer chains at that point so that all ranks that actually need to open files at one time can do so in a synchronized manner.

@eisenhauer
Copy link
Member

It occurred to me that one other difference between EveryoneWritesSerial and DataSizeBased aggregation for a single timestep is that in the latter case, I'm currently calling OpenFiles() without a communicator (to avoid mpi deadlock). Is it possible that lustre could have problems with even a very few ranks opening files the same file simultaneously? Maybe it's worth splitting the communicators of the writer chains at that point so that all ranks that actually need to open files at one time can do so in a synchronized manner.

Hmm. Maybe. Particularly if all are write/create... It's at least a theory.
BTW, I'm off to Guam this week, so responding from odd time zones.

@scottwittenburg
Copy link
Collaborator Author

Hmm. Maybe. Particularly if all are write/create... It's at least a theory.

Not sure if I should celebrate yet, but I've done a few small runs since I added back a communicator to the OpenFiles() calls, and I've not seen the issue with zeroed out blocks again. I wasn't able to do any runs on Friday without encountering the problem, so I'm kinda hopeful. 🤞

@pnorbert
Copy link
Contributor

pnorbert commented Jul 1, 2025

It occurred to me that one other difference between EveryoneWritesSerial and DataSizeBased aggregation for a single timestep is that in the latter case, I'm currently calling OpenFiles() without a communicator (to avoid mpi deadlock). Is it possible that lustre could have problems with even a very few ranks opening files the same file simultaneously? Maybe it's worth splitting the communicators of the writer chains at that point so that all ranks that actually need to open files at one time can do so in a synchronized manner.

I swear I knew this a long time ago, but had to double check. So I wrote a test to check POSIX open. Indeed, if multiple processes create the same file for writing, sometimes it occurs that the content some process writes will be strangely (but always entirely) missing from the file when everyone has closed it. Theory is that they get a different file's descriptor (FD) that will be overshadowed by a different FD with same name, and all the writes from that process disappear.

The test showed this even in WSL (in a single ext4 vhd on a single nvme), no need to test on Lustre and multiple nodes.

@pnorbert
Copy link
Contributor

pnorbert commented Jul 1, 2025

The Transport::OpenChain() function was added to adios to support the EW/EWS aggregation. Besides that nothing in adios writes to one file from multiple processes, let alone creating one file by multiple processes. So indeed, I knew this a few years ago.

commit 6207bbea8bc1f08077f4cfd86e29e4bc27cb4358
Author: Norbert Podhorszki <[email protected]>
Date:   Mon Jun 21 09:47:50 2021 -0400

    aggregation for BP5 - step 1: data files (index not done yet)

commit 0b5262f33a48649e04cc91de5b1c0e36b7335c79
Author: Norbert Podhorszki <[email protected]>
Date:   Mon Jun 21 09:46:35 2021 -0400

    added Transport function to open file by multiple processes in a chain

commit 7e8fbd90f934a818cf28bec91fd938113d7074bf
Author: Norbert Podhorszki <[email protected]>
Date:   Fri Jun 18 07:35:16 2021 -0400

    towards aggregation in BP5

@scottwittenburg scottwittenburg force-pushed the data-size-based-aggregation branch from 62a70c4 to 9a1ca59 Compare July 3, 2025 18:54
@scottwittenburg
Copy link
Collaborator Author

scottwittenburg commented Jul 30, 2025

In the most recent commit (bc366af (force-pushed)) I changed the default aggregator to be the new data-size based approach, and that seems to have caused up to 17 tests to fail. Here's the list:

Archive.Attribute.bp5.Unique
Archive.Common.bp5.Unique
Engine.BP.*/BPAppendAfterStepsP.Test/*.BP5.MPI
Engine.BP.*/BPAppendAfterStepsP.Test/*.BP5.Serial
Engine.BP.*/BPParameterSelectStepsP.Stream/*.BP5.MPI
Engine.BP.*/BPParameterSelectStepsP.Stream/*.BP5.Serial
Engine.BP.BPWriteAppendReadTestADIOS2.ADIOS2BPWriteAppendRead2D2x4.BP5.MPI
Engine.BP.BPWriteAppendReadTestADIOS2.ADIOS2BPWriteAppendRead2D2x4.BP5.Serial
Engine.BP.BPWriteAppendReadTestADIOS2.ADIOS2BPWriteAppendReadAggregate.BP5.MPI
Engine.BP.BPWriteAppendReadTestADIOS2.ADIOS2BPWriteAppendReadAggregate.BP5.Serial
Engine.BP.BPWriteAppendReadTestADIOS2.ADIOS2BPWriteAppendReadVaryingAggregation.BP5.MPI
Engine.BP.BPWriteAppendReadTestADIOS2.ADIOS2BPWriteAppendReadVaryingAggregation.BP5.Serial
Engine.BP.BPWriteReadMultiblockTest.MultiblockNullBlocks.BP5.MPI
Engine.BP.BPWriteReadMultiblockTest.MultiblockNullBlocks.BP5.Serial
Engine.BP.BPWriteReadMultiblockTest.MultiblockPerformDataWrite.BP5.MPI
Engine.BP.BPWriteReadMultiblockTest.MultiblockPerformDataWrite.BP5.Serial
Staging.1x1DataWrite.BP5

I'm currently working on figuring out what those tests are doing and how the new aggregator may have broken them. But if anyone has a moment to look through the results here and sees anything obvious, I would welcome any clues that could help narrow or otherwise speed up my search 😀 Thanks!

@eisenhauer
Copy link
Member

So for the *.Unique failures. Those are not failures per se, but an indication that the output has changed in some form from any prior version that we have stored. In order to ensure that we continue to be able to read these new files, the .zip versions of the output should be moved into the output_archive/zipped_output directory with an appropriately descriptive name. I'd ignore these for now, you're only getting it because you changed the default aggregator and it naturally had different output.

You've got a number of DataWrite failures. This is an indication that maybe what you put into place to make sure that we used the same layout for an entire timestep (I.E. didn't change after we had done an initial DataWrite) might not be working. The easiest test to debug this with is probably Staging.1x1DataWrite.BP5. It looks like that is failing the first time we do performdatawrites in a timestep, so shouldn't be hard to track down (hopefully).

The NullBlocks tests involve someone with zero-length data blocks. Maybe that's messing you up somehow?

Most of the rest seem to be Append tests. Tackle those last as those are likely the most unpleasant. Finger's crossed that fixing the other issues might fix this one too.

@scottwittenburg
Copy link
Collaborator Author

Thanks @eisenhauer for characterizing the failures so nicely, that's a great summary. I had started with one of the Append failures, but I'll set that aside for the time being, after reading your comments. Instead I'll start with Staging.1x1DataWrite.BP5 as you suggest.

@eisenhauer
Copy link
Member

Huh. No "Test" results in CDash for this PR or #4611 . What causes that?

@scottwittenburg
Copy link
Collaborator Author

No "Test" results in CDash for this PR or #4611 . What causes that?

I think open.cdash.org is experiencing some issue across all projects this week, see e.g. https://open.cdash.org/index.php?project=CMake, hopefully it's just a short-term disruption.

@scottwittenburg
Copy link
Collaborator Author

Getting closer now, at least 8 of the 17 test failures were resolved by the commit I pushed yesterday to fix problems with Append. We can't see the results on CDash at the moment, and there's also some degradation on GitHub today, making it hard to see test results there. But on my machine, it looks like most of the 17 failures are now passing.

One test I'm not sure what to do about is testing/adios2/engine/bp/TestBPParameterSelectSteps.cpp. There seems to be a feature that you can open a file for reading as soon as you have opened it for writing, before you actually do any BeginStep() and Put(). I'm assuming the directory that needs to be present for the reader gets created with InitTransports(), which requires that we have already initialized the aggregator, which in the case of data-size based aggregation, must be deferred until the first WriteData(), because we have to know how much data each rank needs to write.

Any thoughts on how to reconcile these two seemingly opposing requirements?

@scottwittenburg scottwittenburg force-pushed the data-size-based-aggregation branch 2 times, most recently from 959d2e9 to eb0f93a Compare August 21, 2025 18:47
@scottwittenburg
Copy link
Collaborator Author

scottwittenburg commented Aug 22, 2025

I'm down to a single unexpected test failure, this one. I have narrowed down the cause to this commit (force-pushed) I added in my effort to fix the select steps (streaming case) test described here. The summary of that commit is I separated InitTransports() into two functions: now initialization of metadata transports gets done during the construction process in all cases, while initialization of the data transport gets done after aggregator initialization in all cases (which still only happens once, during construction, unless doing DataSizeBased aggregation, in which case it happens on every time step as a part of writing). While it fixed the select steps (streaming) case, it seems to have broken this heat transfer example.

The failure seems to happen every time on windows/msmpi, at least according to CI, but I can reproduce it intermittently on linux as well. For some reason even though I configured the examples on my windows machine, they didn't get built, so I'm debugging on linux at the moment.

Of all the heat transfer example variations, the BP4/5 MxN are the only ones that fail. Notably, I can never get the Mx1 or MxM version to fail, only the MxN. BP4 can fail because the second part, the readHeat.cpp part, is configured with engine type set to BPFile, which produces a BP5 file. If I change the engine type in that config file to BP4, I cannot get it to fail, same goes if I leave it BPFile and set AggregationType to anything but DataSizeBased.

I have no idea what, in my code, is sensitive to the fact that the file being read was originally written with 2x2=4 processes, while the current task is writing with 1x3=3 processes, but it seems to matter.

Any suggestions/ideas/insights? Thanks!

@scottwittenburg
Copy link
Collaborator Author

scottwittenburg commented Aug 26, 2025

Regarding this comment: I was eventually able to run it on windows, but failed to reproduce there at all. I could only reproduce intermittently on linux, and after a lot of debugging, I discovered that I could do so with just a single timestep (making this feel different from similar issues I faced in the past). In the end, 34e4933 (force-pushed) seems to have fixed the problem, and the comment there explains the need for the barrier.

Unfortunately, I still don't have any idea why the breaking commit had that effect, or why the problem didn't seem to happen for other aggregation types.

@scottwittenburg
Copy link
Collaborator Author

scottwittenburg commented Aug 26, 2025

I think as of 1e63a7a (force-pushed) there are no more unexpected test failures with the default aggregation type set to DataSizeBased, as can be seen in the associated cdash results, here.

The two test failures on the el8 jobs are just Archive.Attribute.bp5.Unique and Archive.Common.bp5.Unique, which would go away by updating those files for DataSizeBased aggregation. However, my opinion is that, instead, we should retain the previous default aggregation type until the code can be tested on more hpc systems. I have updated a handful of the BP5 tests to run for all aggregation types though, to help prevent regressions related to aggregation type.

@scottwittenburg
Copy link
Collaborator Author

scottwittenburg commented Aug 26, 2025

@pnorbert Regarding commit 34e4933 (force-pushed): I cannot reproduce the HeatTransfer.BP[4|5].MxN problem with EveryoneWritesSerial aggregation by commenting that fix out, and in fact, I was reminded that I did try that already in my local testing.

I did verify that changing the xml transport configuration so that stdio is not used also prevents the problem for DataSizeBased aggregation.

Note that the problem was originally caused by de32cfb (force-pushed), but I cannot see any connection between that change and the seeming race condition in flushing/writing files.

I have moved the pre-flush barrier into the WriteData_EveryoneWrites() method, and since it doesn't seem harmful in any case (at least not to me), I have left it unguarded.

Add a new aggregation method which behaves just like the existing
EveryoneWritesSerial method, excepts builds the writer chains after
first partitioning ranks in such a way that each chain writes
approximately the same amount of data.

The current partitioning method is a greedy approach that is fast
but not garanteed to produce optimal partitionings. However, this
has been done in such a way that adding new partitioning methods
should be straightforward.
@scottwittenburg scottwittenburg force-pushed the data-size-based-aggregation branch from 50336b0 to d175547 Compare August 26, 2025 22:22
@scottwittenburg scottwittenburg marked this pull request as ready for review August 26, 2025 22:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants