Fused Neigbhor Sampling reproducibility issue #7831

taran2210 · 2024-10-28T20:10:13Z

Setting the seed and repeating the fused neighborhood sampling for a source code does not reproduce the same subgraph, have identified a fix that will be slower but allow reproducible subgraphs

To Reproduce

Steps to reproduce the behavior:

Set OMP_NUM_THREADS to >= 2
Define NeighborSampler with fused=True
Set dgl.seed, dgl.random.seed before calling dgl.dataloading.NeighborSampler.sample_blocks
Repeat step 3. and compare blocks[0].srcdata['feat']

The following change allowed for the above to have the same output

diff --git a/src/graph/sampling/neighbor/neighbor.cc b/src/graph/sampling/neighbor/neighbor.cc
index 5393a200..471600dd 100644
--- a/src/graph/sampling/neighbor/neighbor.cc
+++ b/src/graph/sampling/neighbor/neighbor.cc
@@ -14,6 +14,7 @@

 #include <tuple>
 #include <utility>
+#include <execution>

 #include "../../../array/cpu/concurrent_id_hash_map.h"
 #include "../../../c_api_common.h"
@@ -488,6 +489,7 @@ SampleNeighborsFused(
         }
       }
       IdType offset = new_nodes_vec[lhs_node_type].size();
+      auto length = offset;
       new_nodes_vec[lhs_node_type].resize(global_prefix_col.back());
       for (int thread_id = 0; thread_id < num_threads_col; ++thread_id) {
         memcpy(
@@ -496,6 +498,7 @@ SampleNeighborsFused(
             src_nodes_local[thread_id].size() * sizeof(IdType));
         offset += src_nodes_local[thread_id].size();
       }
+    std::sort(std::execution::par_unseq, new_nodes_vec[lhs_node_type].begin() + length, new_nodes_vec[lhs_node_type].end());
     }
   }

This results is the sampling being slower but reproducible, would there be any alternative for multithreaded fused sampling being reproducible?

Environment

DGL Version (e.g., 1.0): 2.4
Backend Library & Version : Torch 2.3.1
OS (e.g., Linux): Linux
How you installed DGL (conda, pip, source): source
Build command you used (if compiling from source): cmake -DBUILD_TYPE='release' -DUSE_LIBXSMM=ON -DUSE_OPENMP=ON -DBUILD_SPARSE=OFF -DUSE_EPOLL=OFF ..
Python version: 3.9
CPU: Intel Xeon

The text was updated successfully, but these errors were encountered:

frozenbugs · 2024-11-07T07:56:09Z

Yes, this is a known issue, our implementation of concurrent_id_hash_map can't guarantee deterministic while maintain the high performance. Feel free to file a PR and add a if-else branch to use deterministic solution while user specify the random seed or add other flag to control the behavoir.

https://github.com/dmlc/dgl/blob/master/graphbolt/src/concurrent_id_hash_map.cc

mfbalin · 2024-11-25T04:16:42Z

Yes, this is a known issue, our implementation of concurrent_id_hash_map can't guarantee deterministic while maintain the high performance. Feel free to file a PR and add a if-else branch to use deterministic solution while user specify the random seed or add other flag to control the behavoir.

https://github.com/dmlc/dgl/blob/master/graphbolt/src/concurrent_id_hash_map.cc

GraphBolt's hashmap should be deterministic, both for CUDA and CPU implementations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fused Neigbhor Sampling reproducibility issue #7831

Fused Neigbhor Sampling reproducibility issue #7831

taran2210 commented Oct 28, 2024

frozenbugs commented Nov 7, 2024 •

edited

Loading

mfbalin commented Nov 25, 2024

Fused Neigbhor Sampling reproducibility issue #7831

Fused Neigbhor Sampling reproducibility issue #7831

Comments

taran2210 commented Oct 28, 2024

To Reproduce

Environment

frozenbugs commented Nov 7, 2024 • edited Loading

mfbalin commented Nov 25, 2024

frozenbugs commented Nov 7, 2024 •

edited

Loading