Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fused Neigbhor Sampling reproducibility issue #7831

Open
taran2210 opened this issue Oct 28, 2024 · 2 comments
Open

Fused Neigbhor Sampling reproducibility issue #7831

taran2210 opened this issue Oct 28, 2024 · 2 comments

Comments

@taran2210
Copy link

Setting the seed and repeating the fused neighborhood sampling for a source code does not reproduce the same subgraph, have identified a fix that will be slower but allow reproducible subgraphs

To Reproduce

Steps to reproduce the behavior:

  1. Set OMP_NUM_THREADS to >= 2
  2. Define NeighborSampler with fused=True
  3. Set dgl.seed, dgl.random.seed before calling dgl.dataloading.NeighborSampler.sample_blocks
  4. Repeat step 3. and compare blocks[0].srcdata['feat']

The following change allowed for the above to have the same output

diff --git a/src/graph/sampling/neighbor/neighbor.cc b/src/graph/sampling/neighbor/neighbor.cc
index 5393a200..471600dd 100644
--- a/src/graph/sampling/neighbor/neighbor.cc
+++ b/src/graph/sampling/neighbor/neighbor.cc
@@ -14,6 +14,7 @@

 #include <tuple>
 #include <utility>
+#include <execution>

 #include "../../../array/cpu/concurrent_id_hash_map.h"
 #include "../../../c_api_common.h"
@@ -488,6 +489,7 @@ SampleNeighborsFused(
         }
       }
       IdType offset = new_nodes_vec[lhs_node_type].size();
+      auto length = offset;
       new_nodes_vec[lhs_node_type].resize(global_prefix_col.back());
       for (int thread_id = 0; thread_id < num_threads_col; ++thread_id) {
         memcpy(
@@ -496,6 +498,7 @@ SampleNeighborsFused(
             src_nodes_local[thread_id].size() * sizeof(IdType));
         offset += src_nodes_local[thread_id].size();
       }
+    std::sort(std::execution::par_unseq, new_nodes_vec[lhs_node_type].begin() + length, new_nodes_vec[lhs_node_type].end());
     }
   }

This results is the sampling being slower but reproducible, would there be any alternative for multithreaded fused sampling being reproducible?

Environment

  • DGL Version (e.g., 1.0): 2.4
  • Backend Library & Version : Torch 2.3.1
  • OS (e.g., Linux): Linux
  • How you installed DGL (conda, pip, source): source
  • Build command you used (if compiling from source): cmake -DBUILD_TYPE='release' -DUSE_LIBXSMM=ON -DUSE_OPENMP=ON -DBUILD_SPARSE=OFF -DUSE_EPOLL=OFF ..
  • Python version: 3.9
  • CPU: Intel Xeon
@frozenbugs
Copy link
Collaborator

frozenbugs commented Nov 7, 2024

Yes, this is a known issue, our implementation of concurrent_id_hash_map can't guarantee deterministic while maintain the high performance. Feel free to file a PR and add a if-else branch to use deterministic solution while user specify the random seed or add other flag to control the behavoir.

https://github.com/dmlc/dgl/blob/master/graphbolt/src/concurrent_id_hash_map.cc

@mfbalin
Copy link
Collaborator

mfbalin commented Nov 25, 2024

Yes, this is a known issue, our implementation of concurrent_id_hash_map can't guarantee deterministic while maintain the high performance. Feel free to file a PR and add a if-else branch to use deterministic solution while user specify the random seed or add other flag to control the behavoir.

https://github.com/dmlc/dgl/blob/master/graphbolt/src/concurrent_id_hash_map.cc

GraphBolt's hashmap should be deterministic, both for CUDA and CPU implementations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants