Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[diracx] Jobs stuck in the Matched status #7346

Closed
aldbr opened this issue Dec 6, 2023 · 1 comment
Closed

[diracx] Jobs stuck in the Matched status #7346

aldbr opened this issue Dec 6, 2023 · 1 comment

Comments

@aldbr
Copy link
Contributor

aldbr commented Dec 6, 2023

You may have noticed that all the jobs we submitted in the certification environment have been stuck in a Matched status.
By taking a look at a few pilot outputs, I noticed that the JobWrapper was unable to report the status of the jobs:

2023-12-06T12:10:20,425314Z None/[7879]JobWrapper WARN: Failed setting job status ValueError('No dirax token in the proxy file /var/lib/condor/execute/dir_20392/tmpjubbjrey')

As you can see, the JobWrapper tries to interact with diracx and fails because the diracx token is not embedded in the user proxy. But should it be?

  • The user proxy used (/var/lib/condor/execute/dir_20392/tmpjubbjrey) is not supposed to interact with diracx:
subject      : ...
issuer       : ...
identity     : ...
timeleft     : 23:59:58
DIRAC group  : dteam_user
DiracX       : False
path         : /var/lib/condor/execute/dir_20392/tmpjubbjrey
username     : ...
properties   : NormalUser
VOMS         : True
VOMS fqan    : ['/dteam']
  • The configuration confirms that the VO dteam should not interact with diracx:
DiracX
{
  DisabledVOs = dteam, ...
}

So why does DIRAC tries to interact with diracx anyway?
Because the ClientSelector in charge of choosing the service to use relies on useLegacyAdapter, which only checks if the service to interact with is allowed in the configuration (which is the case):

def useLegacyAdapter(system, service=None) -> bool:
"""Should DiracX be used for this service via the legacy adapter mechanism
:param str system: system name or full name e.g.: Framework/ProxyManager
:param str service: service name, like 'ProxyManager'.
:return: bool -- True if DiracX should be used
"""
system, service = divideFullName(system, service)
value = gConfigurationData.extractOptionFromCFG(f"/DiracX/LegacyClientEnabled/{system}/{service}")
return (value or "no").lower() in ("y", "yes", "true", "1")

DiracX
{
  LegacyClientEnabled
  {
    WorkloadManagement
    {
      JobStateUpdate = True
    }
  }
}

An easy fix would consist in extracting the VO from the proxy and checking whether it is part of the disabled VOs.
How to reproduce the issue:

from DIRAC.WorkloadManagementSystem.Client.JobStateUpdateClient import JobStateUpdateClient

from DIRAC.Core.Base.Script import Script
Script.parseCommandLine()

result = JobStateUpdateClient().setJobSite(123, "LCG.LOL")
print(result)
@chrisburr
Copy link
Member

After discussing about this I don't think it's feasible:

  • loading the proxy during the client selector will be slow
  • there are edge cases with delegation, host certificates, groups without a VO....

Instead we should just require that legacy clients cannot be enabled for as long as DisabledVOs is set. See DIRACGrid/diracx#191

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants