Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Realm] Race in Realm::atomic<unsigned int>::fetch_or_acqrel() #1798

Open
Jacobfaib opened this issue Dec 2, 2024 · 5 comments
Open

[Realm] Race in Realm::atomic<unsigned int>::fetch_or_acqrel() #1798

Jacobfaib opened this issue Dec 2, 2024 · 5 comments

Comments

@Jacobfaib
Copy link
Contributor

  Atomic write of size 4 at 0x7b0c0008d990 by thread T39:
    #0 __tsan_atomic32_fetch_or ../../../../libsanitizer/tsan/tsan_interface_atomic.cpp:696 (libtsan.so+0x683c0)
    #1 std::__atomic_base<unsigned int>::fetch_or(unsigned int, std::memory_order) /tmp/conda-croot/legate/_build_env/x86_64-conda-linux-gnu/include/c++/11.2.0/bits/atomic_base.h:666 (librealm.so.1+0x1dd236)
    #2 Realm::atomic<unsigned int>::fetch_or_acqrel(unsigned int) /tmp/conda-croot/legate/work/arch-conda/skbuild_core/_deps/legion-src/runtime/realm/atomics.inl:158 (librealm.so.1+0x1dd236)
    #3 Realm::UnfairMutex::lock() /tmp/conda-croot/legate/work/arch-conda/skbuild_core/_deps/legion-src/runtime/realm/mutex.inl:171 (librealm.so.1+0x1d9021)
    #4 Realm::AutoLock<Realm::UnfairMutex>::AutoLock(Realm::UnfairMutex&) /tmp/conda-croot/legate/work/arch-conda/skbuild_core/_deps/legion-src/runtime/realm/mutex.inl:329 (librealm.so.1+0x1dd45b)
    #5 Realm::SequenceAssembler::add_span(unsigned long, unsigned long) /tmp/conda-croot/legate/work/arch-conda/skbuild_core/_deps/legion-src/runtime/realm/transfer/channel.cc:579 (librealm.so.1+0x1b5b6f)
    #6 Realm::SequenceAssembler::import(Realm::SequenceAssembler&) const /tmp/conda-croot/legate/work/arch-conda/skbuild_core/_deps/legion-src/runtime/realm/transfer/channel.cc:479 (librealm.so.1+0x1b546e)
    #7 Realm::XferDesPlaceholder::remove_reference() /tmp/conda-croot/legate/work/arch-conda/skbuild_core/_deps/legion-src/runtime/realm/transfer/channel.cc:5696 (librealm.so.1+0x1d3a7e)
    #8 Realm::XferDesQueue::enqueue_xferDes_local(Realm::XferDes*, bool) /tmp/conda-croot/legate/work/arch-conda/skbuild_core/_deps/legion-src/runtime/realm/transfer/channel.cc:6021 (librealm.so.1+0x1d5285)
    #9 Realm::SingleXDQChannel<Realm::MemcpyChannel, Realm::MemcpyXferDes>::enqueue_ready_xd(Realm::XferDes*) <null> (librealm.so.1+0x20e186)
    #10 Realm::SimpleXferDesFactory::create_xfer_des(unsigned long, int, int, unsigned long long, std::vector<Realm::XferDesPortInfo, std::allocator<Realm::XferDesPortInfo> > const&, std::vector<Realm::XferDesPortInfo, std::allocator<Realm::XferDesPortInfo> > const&, int, Realm::XferDesRedopInfo, void const*, unsigned long, unsigned long) /tmp/conda-croot/legate/work/arch-conda/skbuild_core/_deps/legion-src/runtime/realm/transfer/channel.cc:4505 (librealm.so.1+0x1ccf7d)
    #11 Realm::TransferOperation::create_xds() /tmp/conda-croot/legate/work/arch-conda/skbuild_core/_deps/legion-src/runtime/realm/transfer/transfer.cc:5427 (librealm.so.1+0x230ef6)
    #12 Realm::TransferOperation::allocate_ibs() /tmp/conda-croot/legate/work/arch-conda/skbuild_core/_deps/legion-src/runtime/realm/transfer/transfer.cc:5044 (librealm.so.1+0x22d54c)
    #13 Realm::TransferOperation::start_or_defer() /tmp/conda-croot/legate/work/arch-conda/skbuild_core/_deps/legion-src/runtime/realm/transfer/transfer.cc:4883 (librealm.so.1+0x22c15c)
    #14 Realm::IndexSpace<3, long long>::copy(std::vector<Realm::CopySrcDstField, std::allocator<Realm::CopySrcDstField> > const&, std::vector<Realm::CopySrcDstField, std::allocator<Realm::CopySrcDstField> > const&, std::vector<Realm::CopyIndirection<3, long long>::Base const*, std::allocator<Realm::CopyIndirection<3, long long>::Base const*> > const&, Realm::ProfilingRequestSet const&, Realm::Event, int) const /tmp/conda-croot/legate/work/arch-conda/skbuild_core/_deps/legion-src/runtime/realm/transfer/transfer.cc:5541 (librealm.so.1+0x27107a)
    #15 Legion::Internal::CopyAcrossUnstructuredT<3, long long>::execute(Legion::Internal::Operation*, Legion::Internal::PredEvent, Legion::Internal::ApEvent, Legion::Internal::ApEvent, Legion::Internal::ApEvent, Legion::Internal::PhysicalTraceInfo const&, bool, bool, unsigned int) /tmp/conda-croot/legate/work/arch-conda/skbuild_core/_deps/legion-src/runtime/legion/region_tree.inl:10562 (liblegion.so.1+0x3670bf5)
    #16 Legion::Internal::RegionTreeForest::indirect_across(Legion::RegionRequirement const&, Legion::RegionRequirement const&, Legion::RegionRequirement const&, Legion::RegionRequirement const&, Legion::Internal::InstanceSet const&, Legion::Internal::InstanceSet const&, std::vector<Legion::Internal::IndirectRecord, std::allocator<Legion::Internal::IndirectRecord> >&, Legion::Internal::InstanceSet const&, std::vector<Legion::Internal::IndirectRecord, std::allocator<Legion::Internal::IndirectRecord> >&, Legion::Internal::InstanceSet const&, Legion::Internal::CopyOp*, unsigned int, unsigned int, unsigned int, unsigned int, bool, Legion::Internal::ApEvent, Legion::Internal::ApEvent, Legion::Internal::ApEvent, Legion::Internal::ApEvent, Legion::Internal::ApEvent, Legion::Internal::PredEvent, Legion::Internal::ApEvent, Legion::Internal::ApEvent, Legion::Internal::ApUserEvent, std::map<Realm::Reservation, bool, std::less<Realm::Reservation>, std::allocator<std::pair<Realm::Reservation const, bool> > > const&, Legion::Internal::PhysicalTraceInfo const&, std::set<Legion::Internal::RtEvent, std::less<Legion::Internal::RtEvent>, std::allocator<Legion::Internal::RtEvent> >&, bool, bool, bool, bool) /tmp/conda-croot/legate/work/arch-conda/skbuild_core/_deps/legion-src/runtime/legion/region_tree.cc:2732 (liblegion.so.1+0x2b29ef7)
    #17 Legion::Internal::CopyOp::perform_copy_across(unsigned int, Legion::Internal::ApEvent, Legion::Internal::ApEvent, Legion::Internal::ApEvent, Legion::Internal::ApEvent, Legion::Internal::ApEvent, Legion::Internal::ApUserEvent, Legion::Internal::ApUserEvent, Legion::Internal::ApEvent, Legion::Internal::ApEvent, Legion::Internal::PredEvent, Legion::Internal::InstanceSet const&, Legion::Internal::InstanceSet const&, Legion::Internal::InstanceSet const*, Legion::Internal::InstanceSet const*, Legion::Internal::PhysicalTraceInfo const&, std::set<Legion::Internal::RtEvent, std::less<Legion::Internal::RtEvent>, std::allocator<Legion::Internal::RtEvent> >&, bool) /tmp/conda-croot/legate/work/arch-conda/skbuild_core/_deps/legion-src/runtime/legion/legion_ops.cc:7202 (liblegion.so.1+0x2617f75)
    #18 Legion::Internal::CopyOp::trigger_mapping() /tmp/conda-croot/legate/work/arch-conda/skbuild_core/_deps/legion-src/runtime/legion/legion_ops.cc:7092 (liblegion.so.1+0x2616fe8)
    ...

 Previous write of size 8 at 0x7b0c0008d990 by thread T2:
    #0 operator new(unsigned long) ../../../../libsanitizer/tsan/tsan_new_delete.cpp:64 (libtsan.so+0x6fedf)
    #1 Realm::SequenceAssembler::ensure_mutex() /tmp/conda-croot/legate/work/arch-conda/skbuild_core/_deps/legion-src/runtime/realm/transfer/channel.cc:451 (librealm.so.1+0x1b52c2)
    #2 Realm::SequenceAssembler::add_span(unsigned long, unsigned long) /tmp/conda-croot/legate/work/arch-conda/skbuild_core/_deps/legion-src/runtime/realm/transfer/channel.cc:611 (librealm.so.1+0x1b5d12)
    #3 Realm::XferDes::update_pre_bytes_write(int, unsigned long, unsigned long) /tmp/conda-croot/legate/work/arch-conda/skbuild_core/_deps/legion-src/runtime/realm/transfer/channel.cc:2299 (librealm.so.1+0x1c16f4)
    #4 Realm::XferDesQueue::update_pre_bytes_write(unsigned long long, int, unsigned long, unsigned long) /tmp/conda-croot/legate/work/arch-conda/skbuild_core/_deps/legion-src/runtime/realm/transfer/channel.cc:5822 (librealm.so.1+0x1d42cb)
    #5 Realm::XferDes::update_bytes_write(int, unsigned long, unsigned long) /tmp/conda-croot/legate/work/arch-conda/skbuild_core/_deps/legion-src/runtime/realm/transfer/channel.cc:2268 (librealm.so.1+0x1c145e)
    #6 Realm::XferDes::SequenceCache<&Realm::XferDes::update_bytes_write>::add_span(int, unsigned long, unsigned long) /tmp/conda-croot/legate/work/arch-conda/skbuild_core/_deps/legion-src/runtime/realm/transfer/channel.inl:393 (librealm.so.1+0x1e180d)
    #7 Realm::AddressSplitXferDes<3, long long>::progress_xd(Realm::AddressSplitChannel*, Realm::TimeLimit) /tmp/conda-croot/legate/work/arch-conda/skbuild_core/_deps/legion-src/runtime/realm/transfer/transfer.cc:2640 (librealm.so.1+0x38803b)
    #8 Realm::XDQueue<Realm::AddressSplitChannel, Realm::AddressSplitXferDesBase>::do_work(Realm::TimeLimit) /tmp/conda-croot/legate/work/arch-conda/skbuild_core/_deps/legion-src/runtime/realm/transfer/channel.inl:166 (librealm.so.1+0x3bc8fe)

 Location is heap block of size 40 at 0x7b0c0008d990 allocated by thread T2:
    #0 operator new(unsigned long) ../../../../libsanitizer/tsan/tsan_new_delete.cpp:64 (libtsan.so+0x6fedf)
    #1 Realm::SequenceAssembler::ensure_mutex() /tmp/conda-croot/legate/work/arch-conda/skbuild_core/_deps/legion-src/runtime/realm/transfer/channel.cc:451 (librealm.so.1+0x1b52c2)
    #2 Realm::SequenceAssembler::add_span(unsigned long, unsigned long) /tmp/conda-croot/legate/work/arch-conda/skbuild_core/_deps/legion-src/runtime/realm/transfer/channel.cc:611 (librealm.so.1+0x1b5d12)
    #3 Realm::XferDes::update_pre_bytes_write(int, unsigned long, unsigned long) /tmp/conda-croot/legate/work/arch-conda/skbuild_core/_deps/legion-src/runtime/realm/transfer/channel.cc:2299 (librealm.so.1+0x1c16f4)
    #4 Realm::XferDesQueue::update_pre_bytes_write(unsigned long long, int, unsigned long, unsigned long) /tmp/conda-croot/legate/work/arch-conda/skbuild_core/_deps/legion-src/runtime/realm/transfer/channel.cc:5822 (librealm.so.1+0x1d42cb)
    #5 Realm::XferDes::update_bytes_write(int, unsigned long, unsigned long) /tmp/conda-croot/legate/work/arch-conda/skbuild_core/_deps/legion-src/runtime/realm/transfer/channel.cc:2268 (librealm.so.1+0x1c145e)
    #6 Realm::XferDes::SequenceCache<&Realm::XferDes::update_bytes_write>::add_span(int, unsigned long, unsigned long) /tmp/conda-croot/legate/work/arch-conda/skbuild_core/_deps/legion-src/runtime/realm/transfer/channel.inl:393 (librealm.so.1+0x1e180d)
    #7 Realm::AddressSplitXferDes<3, long long>::progress_xd(Realm::AddressSplitChannel*, Realm::TimeLimit) /tmp/conda-croot/legate/work/arch-conda/skbuild_core/_deps/legion-src/runtime/realm/transfer/transfer.cc:2640 (librealm.so.1+0x38803b)
    #8 Realm::XDQueue<Realm::AddressSplitChannel, Realm::AddressSplitXferDesBase>::do_work(Realm::TimeLimit) /tmp/conda-croot/legate/work/arch-conda/skbuild_core/_deps/legion-src/runtime/realm/transfer/channel.inl:166 (librealm.so.1+0x3bc8fe)
    #9 Realm::BackgroundWorkManager::Worker::do_work(long long, Realm::atomic<bool>*) /tmp/conda-croot/legate/work/arch-conda/skbuild_core/_deps/legion-src/runtime/realm/bgwork.cc:600 (librealm.so.1+0x730e65)
@eddy16112
Copy link
Contributor

Could you please try to add ptr = mutex.load(); before https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/realm/transfer/channel.cc#L457?

If two threads trying to call ensure_mutex at the same time and the mutex is not initialized, then both of them may call new, but only one thread will set the mutex with the new allocated new_mutex, and the other one will return ptr, but the ptr is NULL.

@Jacobfaib
Copy link
Contributor Author

I think it suffices to just do return mutex.load() no? This seems to pass

@eddy16112
Copy link
Contributor

Yep, return mutex.load() is more concise. I will create a PR for the fix.

@eddy16112
Copy link
Contributor

@Jacobfaib Can we close this issue now?

@Jacobfaib
Copy link
Contributor Author

Actually the fix did not seem to take. I have sporadically reproduced the failure, let me double check again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants