Skip to content

Conversation

@Nasf-Fan
Copy link
Contributor

@Nasf-Fan Nasf-Fan commented Dec 5, 2025

When handle collective RPC, some failure may happen before invoking RPC handler for local node process. Then crt_hg_reply_send() may be triggered. And then in subsequent process, crt_rpc_handler_common() will call crt_hg_reply_error_send() to reply the RPC repeatedly. It is observed that the latter one maybe failed with NA_BUSY and cause the callback for former reply to be blocked or lost. Then reference on the RPC cannot be released. Such RPC leaking may cause assertion in UCX environment when destroy related CaRT context.

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

@github-actions
Copy link

github-actions bot commented Dec 5, 2025

Ticket title is 'Segmentation fault against UCX provider during CR test'
Status is 'In Progress'
Labels: '2.8pp,scrubbed_2.6.5,scrubbed_2.8'
https://daosio.atlassian.net/browse/DAOS-17861

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-17861_3_b26 branch from 03cf778 to b84c7bf Compare December 5, 2025 06:19
@daosbuild3
Copy link
Collaborator

@Nasf-Fan
Copy link
Contributor Author

Nasf-Fan commented Dec 8, 2025

Test stage Functional Hardware Medium completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17230/2/execution/node/1497/log

test_pool_destroy_with_io failed for DAOS-18327, not related with the patch, to be retested.

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-17861_3_b26 branch from b84c7bf to ee2d494 Compare December 9, 2025 03:02
@daosbuild3
Copy link
Collaborator

@daosbuild3
Copy link
Collaborator

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-17861_3_b26 branch 2 times, most recently from 4b649a5 to ffb8565 Compare December 10, 2025 12:48
@daosbuild3
Copy link
Collaborator

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-17861_3_b26 branch from ffb8565 to 18c058e Compare December 11, 2025 02:42
@daosbuild3
Copy link
Collaborator

When handle collective RPC, some failure may happen before invoking
RPC handler for local node process. Then crt_hg_reply_send() may be
triggered. And then in subsequent process, crt_rpc_handler_common()
will call crt_hg_reply_error_send() to reply the RPC repeatedly. It
is observed that the latter one maybe failed with NA_BUSY and cause
the callback for former reply to be blocked or lost. Then reference
on the RPC cannot be released. Such RPC leaking may cause assertion
in UCX environment when destroy related CaRT context.

Signed-off-by: Fan Yong <[email protected]>
@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-17861_3_b26 branch from 18c058e to 89bbff4 Compare December 12, 2025 06:07
@Nasf-Fan Nasf-Fan closed this Dec 31, 2025
@Nasf-Fan Nasf-Fan deleted the Nasf-Fan/DAOS-17861_3_b26 branch December 31, 2025 07:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants