Skip to content

YARN-11836. Fixed AM log fetching with YARN CLI #7813

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: trunk
Choose a base branch
from

Conversation

p-szucs
Copy link
Contributor

@p-szucs p-szucs commented Jul 17, 2025

Change-Id: I04678a04503e01e67f1075f3fc543887cf80ac47

Description of PR

YARN-10767 introduced a bug, where YARN Logs CLI is unable to fetch the AM logs using "-am" option if the user is not in the Admin ACLs.

This commit changed the logic for requesting the AM logs and it fetches the "id" of the active RM from the HA service, and requesting the logs from there.

Reproduction:

The issue can be reproduced by calling "yarn logs -applicationId ‹appId› -am 1" command with a user who has not got admin access.

In the RM logs of the test cluster I can see the following error, which states that the user doesn't have permission to call 'getServiceState':

IPC Server handler 0 on default port 8033, call Call#3 Retry#0 org.apache.hadoop.ha.HAServiceProtocol.getServiceStatus
org.apache.hadoop.security.AccessControlException: User systest doesn't have permission to call 'getServiceState'
at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:433)
at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:398)
at org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAccess(AdminService.java:243)
at org.apache.hadoop.yarn.server.resourcemanager.AdminService.getServiceStatus(AdminService.java:396)
at org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.getServiceStatus(HAServiceProtocolServerSideTranslatorPB.java:148)
at org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:6154)
at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:621)
at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589)
at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1247)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1170)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1964)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3200)

Currently in WebAppUtils's execOnActiveRM method we throw an exception when RMHAUtils.findActiveRMHAId returns null here, stating that "No active RM is available". However that method will return null if the permissions are missing to check the service states. I think at this point we could fall back to the original code here, and try to find the active RM by iterating through them. 

The issue only happens in HA mode, and only if we use "-am" option, without this option the AM logs can be retrieved together with the aggregated logs.

How was this patch tested?

Tested manually on a cluster by fetching application logs

  • with and without "-am" option
  • HA and non-HA mode
  • with admin/non-admin user

The aggregated logs and AM logs could be fetched in every cases.

When I stopped the first RM, and the user could not check the active one, we found it when the connection timed out after 30 retries. With admin user it could find the active one for the first time

For code changes:

  • Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
  • Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

Change-Id: I04678a04503e01e67f1075f3fc543887cf80ac47
@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 52s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 1s codespell was not available.
+0 🆗 detsecrets 0m 1s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
-1 ❌ test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
_ trunk Compile Tests _
+1 💚 mvninstall 92m 28s trunk passed
+1 💚 compile 0m 49s trunk passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚 compile 0m 42s trunk passed with JDK Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
+1 💚 checkstyle 0m 39s trunk passed
+1 💚 mvnsite 0m 53s trunk passed
+1 💚 javadoc 1m 0s trunk passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚 javadoc 0m 48s trunk passed with JDK Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
+1 💚 spotbugs 1m 45s trunk passed
+1 💚 shadedclient 41m 55s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 0m 35s the patch passed
+1 💚 compile 0m 39s the patch passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚 javac 0m 39s the patch passed
+1 💚 compile 0m 33s the patch passed with JDK Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
+1 💚 javac 0m 33s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 0m 26s the patch passed
+1 💚 mvnsite 0m 35s the patch passed
+1 💚 javadoc 0m 42s the patch passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚 javadoc 0m 39s the patch passed with JDK Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
+1 💚 spotbugs 1m 44s the patch passed
+1 💚 shadedclient 41m 58s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 5m 47s hadoop-yarn-common in the patch passed.
+1 💚 asflicense 0m 36s The patch does not generate ASF License warnings.
196m 26s
Subsystem Report/Notes
Docker ClientAPI=1.51 ServerAPI=1.51 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7813/1/artifact/out/Dockerfile
GITHUB PR #7813
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux e951d5a7eab0 5.15.0-143-generic #153-Ubuntu SMP Fri Jun 13 19:10:45 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / c15cc96
Default Java Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7813/1/testReport/
Max. process+thread count 579 (vs. ulimit of 5500)
modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7813/1/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@p-szucs p-szucs marked this pull request as ready for review July 18, 2025 12:45
Copy link
Contributor

@K0K0V0K K0K0V0K left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @p-szucs for identify the problem and try to fix it.
I think the basic idea is good, i just had some NIT code style comment

int haIndex = 0;
int activeRMIndex = 0;
int rmCount = 1;

if (HAUtil.isHAEnabled(conf)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we try some early return pattern?

if (!HAUtil.isHAEnabled(conf)) {
  // If i understand right here we have just 1 RM so we can just try with that
  String rmAddress = getRMWebAppURLWithScheme(conf, 0);
  return func.apply(rmAddress, arg);
}
// try RMHAUtils.findActiveRMHAId(conf)
// if null do the iteration
// finally apply and return

Maybe will be a bit easyer to read cause now we have the

int activeRMIndex = 0;
int rmCount = 1;

code what related to non HA, than HA code, than some mix.

if (HAUtil.isHAEnabled(conf)) {
ArrayList<String> rmIds = (ArrayList<String>) HAUtil.getRMHAIds(conf);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we cast to simple List interface here? Maybe will be a bit more robust.

String rmAddress = getRMWebAppURLWithScheme(conf, i);
return func.apply(rmAddress, arg);
} catch (Exception e) {
// Ignore and try next RM if there are any
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe some trace level LOG can be use full

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants