Skip to content

Why does UCC consistently insist on using software to simulate PUT/GET operations? #13544

@TroyMitchell911

Description

@TroyMitchell911

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

5.0.x

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

git clone

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running

  • Operating system/version: Linux
  • Computer hardware: RISC-V
  • Network type: eth

Details of the problem

I've encountered some issues using MPI+UCC+UCX, but I'm unsure if it's related to MPI. I'd still like to post this here seeking assistance. Please excuse any noise caused and let me know.

I'm running two systems (sharing the same DRAM) on a single chip, each with its own dedicated network interface card for wireup. I'm then modifying a POSIX template to explore the possibility of shared memory across nodes. First, I changed it to INTERNODE to enable cross-node functionality, and then I implemented my own memory allocation functions. Currently, pt2pt is working perfectly. However, when I tried to test allreduce, I found that all RDMA operations were being simulated using AM (and it appears to be done by UCC, since UCX has reported lane availability). Why is this happening? Here is the information I can provide.
The ucx log(I added):

[1764656842.337472] [a:31893:0]           mpool.c:281  UCX  DEBUG mpool tl_ucp_req_mp: allocated chunk 0x2ae862cb44 of 6228 bytes with 8 elements
[1764656842.337930] [a:31893:0]          ucp_ep.c:408  UCX  DEBUG created ep 0x3f93ede000 to <no debug data> from api call
[1764656842.338350] [a:31893:0]          ucp_ep.c:2954 UCX  WARN    Lane 0 iface flags: PUT_SHORT=YES PUT_BCOPY=YES GET_SHORT=YES GET_BCOPY=YES
[1764656842.338376] [a:31893:0]          ucp_ep.c:2963 UCX  WARN    *************put_short: 4294967295, iface->put_max_short: 4294967295*********
[1764656842.338401] [a:31893:0]          ucp_ep.c:3011 UCX  WARN    RMA lane 0: max_put_short=4294967295, max_get_short=4294967295
[1764656842.338421] [a:31893:0]      ucp_worker.c:1892 UCX  WARN    !!!!!!!!!!!!!!!!!!!!!!!!rma_emul: 0!!!!!!!!!!!!!!!!!!!, rma_lanes_map = 1
[1764656842.338442] [a:31893:0]      ucp_worker.c:1905 UCX  INFO    UCC_UCP_CONTEXT intra-node cfg#0 tag(mytest/memory)  rma(mytest/memory)  amo(mytest/memory)  am(mytest/memory)

And I got:

[1764656842.339034] [a:31893:0]   +----------------------------------+-------------------------------------------------------------+
[1764656842.339054] [a:31893:0]   | UCC_UCP_CONTEXT intra-node cfg#0 | remote memory write by ucp_put* from host memory to host    |
[1764656842.339063] [a:31893:0]   +----------------------------------+-----------------------------------------------+-------------+
[1764656842.339072] [a:31893:0]   |                           0..inf | software emulation                            | mytest/memory |
[1764656842.339079] [a:31893:0]   +----------------------------------+-----------------------------------------------+-------------+

I suspect it's because the intra-node flag is printed here, but I have indeed set up the inter-node and they can communicate normally (if the inter-node flag is not set, MPI+UCC+UCX cannot organize the two hosts).

Here is my command:

mpirun --allow-run-as-root   -np 1 -host 10.0.90.205:4  -x UCX_LOG_LEVEL=info -x UCX_TLS=mytest -x UCX_NET_DEVICES=eth0 --mca pml ucx --mca coll_ucc_enable 1 --mca coll_ucc_priority 100  --mca pml_ucx_tls any --mca pml_ucx_devices any   /opt/mpitest/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency :  -host 10.0.90.212:4 -np 1 -x UCX_NET_DEVICES=end0 --mca pml ucx --mca coll_ucc_enable --mca coll_ucc_priority 100 --mca pml_ucx_tls any --mca pml_ucx_devices any -x UCX_TLS=mytest -x UCX_LOG_LEVEL=info  /opt/mpitest/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency

I use eth to bring up another sytstem and use this uct mytest to transport. mytest uses shared-memory(designed by device-tree) to transport.
Can anyone offer some advice?

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions