Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refresh test json templates due to removal of input data #11860

Open
amaltaro opened this issue Jan 14, 2024 · 4 comments · May be fixed by #11949
Open

Refresh test json templates due to removal of input data #11860

amaltaro opened this issue Jan 14, 2024 · 4 comments · May be fixed by #11949
Assignees
Labels
Technical Debt Used to track issues that address technical needs internal to WM team Testing

Comments

@amaltaro
Copy link
Contributor

Impact of the bug
WMCore validation in general

Describe the bug
As we get started with the HG2401 / WMAgent 2.2.6 validation, there are many workflows getting stuck in assigned status. Checking MSTransferor logs, one can see that many calls to Rucio are not yielding any results, meaning that data has been completely removed from the grid [1].

How to reproduce it
Inject the relevant test json templates

Expected behavior
Matching those datasets against our test json templates, suggest that the following templates need to be remade/refactored because the RelVal data is no longer available:
test/data/ReqMgr/requests/Integration/SC_ReDigi_Harvest_Prod.json
test/data/ReqMgr/requests/Integration/SC_PY3_PURecyc.json
test/data/ReqMgr/requests/Integration/TaskChain_PUMCRecyc.json

and for the non-relval data that has been removed (e.g. DQMIO), the following needs to be remade:
test/data/ReqMgr/requests/DMWM/DQMHarvest_RunWhitelist.json
test/data/ReqMgr/requests/Integration/DQMHarvesting_MultiRun.json
test/data/ReqMgr/requests/Integration/DQMHarvesting.json
test/data/ReqMgr/requests/Integration/DQMHarvesting_LumiMask.json

Additional context and error message
[1] Relevant log from MStransferor

2024-01-14 00:01:55,786:ERROR:PycurlRucio: Failure in getBlocksAndSizeRucio function for container /RelValTTbar_14TeV/CMSSW_11_2_0_pre8-112X_mcRun3_2024_realistic_v10_forTrk-v1/GEN-SIM. Response: {'url': 'http://cms-rucio.cern.ch/dids/cms/dids/search?type=dataset&long=True&name=/RelValTTbar_14TeV/CMSSW_11_2_0_pre8-112X_mcRun3_2024_realistic_v10_forTrk-v1/GEN-SIM%23%2A', 'data': '', 'headers': 'HTTP/1.1 200 OK\r\nDate: Sun, 14 Jan 2024 00:01:55 GMT\r\nContent-Type: application/x-json-stream\r\nContent-Length: 0\r\nConnection: keep-alive\r\nAccess-Control-Allow-Origin: None\r\nAccess-Control-Allow-Headers: None\r\nAccess-Control-Allow-Methods: *\r\nAccess-Control-Allow-Credentials: true\r\nCache-Control: post-check=0, pre-check=0\r\nPragma: no-cache\r\nX-Rucio-Host: cms-rucio.cern.ch\r\n\r\n'}
2024-01-14 00:01:55,786:ERROR:PycurlRucio: Failure in getBlocksAndSizeRucio function for container /RelValZMM_14/CMSSW_12_0_0_pre6-120X_mcRun3_2021_realistic_v4-v1/GEN-SIM. Response: {'url': 'http://cms-rucio.cern.ch/dids/cms/dids/search?type=dataset&long=True&name=/RelValZMM_14/CMSSW_12_0_0_pre6-120X_mcRun3_2021_realistic_v4-v1/GEN-SIM%23%2A', 'data': '', 'headers': 'HTTP/1.1 200 OK\r\nDate: Sun, 14 Jan 2024 00:01:55 GMT\r\nContent-Type: application/x-json-stream\r\nContent-Length: 0\r\nConnection: keep-alive\r\nAccess-Control-Allow-Origin: None\r\nAccess-Control-Allow-Headers: None\r\nAccess-Control-Allow-Methods: *\r\nAccess-Control-Allow-Credentials: true\r\nCache-Control: post-check=0, pre-check=0\r\nPragma: no-cache\r\nX-Rucio-Host: cms-rucio.cern.ch\r\n\r\n'}
2024-01-14 00:01:55,789:ERROR:PycurlRucio: Failure in getBlocksAndSizeRucio function for container /RelValQCD_Pt_600_800_14/CMSSW_11_2_0_pre8-112X_mcRun3_2024_realistic_v10_forTrk-v1/GEN-SIM. Response: {'url': 'http://cms-rucio.cern.ch/dids/cms/dids/search?type=dataset&long=True&name=/RelValQCD_Pt_600_800_14/CMSSW_11_2_0_pre8-112X_mcRun3_2024_realistic_v10_forTrk-v1/GEN-SIM%23%2A', 'data': '', 'headers': 'HTTP/1.1 200 OK\r\nDate: Sun, 14 Jan 2024 00:01:55 GMT\r\nContent-Type: application/x-json-stream\r\nContent-Length: 0\r\nConnection: keep-alive\r\nAccess-Control-Allow-Origin: None\r\nAccess-Control-Allow-Headers: None\r\nAccess-Control-Allow-Methods: *\r\nAccess-Control-Allow-Credentials: true\r\nCache-Control: post-check=0, pre-check=0\r\nPragma: no-cache\r\nX-Rucio-Host: cms-rucio.cern.ch\r\n\r\n'}
2024-01-14 00:01:55,798:ERROR:PycurlRucio: Failure in getBlocksAndSizeRucio function for container /NoBPTX/Run2016F-23Sep2016-v1/DQMIO. Response: {'url': 'http://cms-rucio.cern.ch/dids/cms/dids/search?type=dataset&long=True&name=/NoBPTX/Run2016F-23Sep2016-v1/DQMIO%23%2A', 'data': '', 'headers': 'HTTP/1.1 200 OK\r\nDate: Sun, 14 Jan 2024 00:01:55 GMT\r\nContent-Type: application/x-json-stream\r\nContent-Length: 0\r\nConnection: keep-alive\r\nAccess-Control-Allow-Origin: None\r\nAccess-Control-Allow-Headers: None\r\nAccess-Control-Allow-Methods: *\r\nAccess-Control-Allow-Credentials: true\r\nCache-Control: post-check=0, pre-check=0\r\nPragma: no-cache\r\nX-Rucio-Host: cms-rucio.cern.ch\r\n\r\n'}
2024-01-14 00:01:55,799:ERROR:PycurlRucio: Failure in getBlocksAndSizeRucio function for container /BTagMu/Run2022D-10Dec2022-v1/DQMIO. Response: {'url': 'http://cms-rucio.cern.ch/dids/cms/dids/search?type=dataset&long=True&name=/BTagMu/Run2022D-10Dec2022-v1/DQMIO%23%2A', 'data': '', 'headers': 'HTTP/1.1 200 OK\r\nDate: Sun, 14 Jan 2024 00:01:55 GMT\r\nContent-Type: application/x-json-stream\r\nContent-Length: 0\r\nConnection: keep-alive\r\nAccess-Control-Allow-Origin: None\r\nAccess-Control-Allow-Headers: None\r\nAccess-Control-Allow-Methods: *\r\nAccess-Control-Allow-Credentials: true\r\nCache-Control: post-check=0, pre-check=0\r\nPragma: no-cache\r\nX-Rucio-Host: cms-rucio.cern.ch\r\n\r\n'}
2024-01-14 00:01:55,801:ERROR:PycurlRucio: Failure in getBlocksAndSizeRucio function for container /ZeroBias/Run2016B-UL16_ver2_forHarvestOnly-v1/DQMIO. Response: {'url': 'http://cms-rucio.cern.ch/dids/cms/dids/search?type=dataset&long=True&name=/ZeroBias/Run2016B-UL16_ver2_forHarvestOnly-v1/DQMIO%23%2A', 'data': '', 'headers': 'HTTP/1.1 200 OK\r\nDate: Sun, 14 Jan 2024 00:01:55 GMT\r\nContent-Type: application/x-json-stream\r\nContent-Length: 0\r\nConnection: keep-alive\r\nAccess-Control-Allow-Origin: None\r\nAccess-Control-Allow-Headers: None\r\nAccess-Control-Allow-Methods: *\r\nAccess-Control-Allow-Credentials: true\r\nCache-Control: post-check=0, pre-check=0\r\nPragma: no-cache\r\nX-Rucio-Host: cms-rucio.cern.ch\r\n\r\n'}
@amaltaro amaltaro added Testing Technical Debt Used to track issues that address technical needs internal to WM team labels Jan 14, 2024
@amaltaro amaltaro linked a pull request Mar 26, 2024 that will close this issue
4 tasks
@amaltaro amaltaro self-assigned this Mar 26, 2024
@amaltaro amaltaro moved this from Todo to In Progress in WMCore quarterly developments Mar 26, 2024
@amaltaro
Copy link
Contributor Author

As I haven't made any progress on this for the last month, I am setting it back to the ToDo queue.

@amaltaro amaltaro moved this from In Progress to ToDo in WMCore quarterly developments Jul 29, 2024
@amaltaro
Copy link
Contributor Author

Our templates have degraded even further and perhaps half of them are now broken. Most common issues are:

  • pileup not available at the expected location
  • SL6 workflows failing to find condor_chirp; etc

Here is a short summary of workflows (templates) and the problems found during Agent 2.3.7 validation:

amaltaro_SC_6Steps_PU_Agent237_Val_241017_144446_265
RootEmbeddedFileSequence no input files specified for secondary input source.

amaltaro_TC_6Tasks_PU_Agent237_Val_241017_144428_6763
RootEmbeddedFileSequence no input files specified for secondary input source.

amaltaro_SC_LHE_Ext_Agent237_Val_241017_144920_9028amaltaro_TaskChain_LumiMask_multiRun_Agent237_Val_241017_144914_4026
{'arguments': ['/bin/bash', '/srv/job/WMTaskSpace/cmsRun1/cmsRun1-main.sh', '', 'slc6_amd64_gcc481', 'scramv1', 'CMSSW', 'CMSSW_7_2_0', 'FrameworkJobReport.xml', 'cmsRun', 'PSet.py', '']}
CMSSW Return code: 7002
locale::facet::_S_create_c_locale name not valid
[Errno 2] No such file or directory: '/srv/.gwms.d/bin/condor_chirp': '/srv/.gwms.d/bin/condor_chirp'
WARNING: There already exists /srv/job/WMTaskSpace/cmsRun1/CMSSW_9_3_7 area for SCRAM_ARCH slc6_amd64_gcc630.

amaltaro_TaskChain_MC_Agent237_Val_241017_144942_8735
SL6 broken workflow

amaltaro_TaskChain_Prod_Agent237_Val_241017_144944_6207
An exception of category 'NoSecondaryFiles' occurred while
RootEmbeddedFileSequence no input files specified for secondary input source.

amaltaro_TC_Drop_Rules_Ext_Agent237_Val_241017_144918_1727
SL6 broken workflow

amaltaro_TC_PY3_Data_LumiList_Agent237_Val_241017_144926_546
An exception of category 'PluginLibraryLoadError' occurred while

amaltaro_TC_PY3_TTbarPU_Agent237_Val_241017_144950_8787
An exception of category 'NoSecondaryFiles' occurred while
RootEmbeddedFileSequence no input files specified for secondary input source.

We will have to work on it ASAP.

@amaltaro
Copy link
Contributor Author

Quick update on the SL6 workflows. They are indeed broken at the moment, but still supported in central production. It will be fixed with a new version of glideinWMS (not yet released). Further details in: https://its.cern.ch/jira/browse/CMSPROD-223

@amaltaro
Copy link
Contributor Author

amaltaro commented Jan 30, 2025

I went through our validation checklist and we can probably reduce our templates to about 20. Here is a breadown of the templates and how to organize them:

ReReco

  • Template 1: with block and run whitelist, include parents and multicore,
  • Template 2: with skims, with LumiList, with automatic harvesting
    NOTE: a workflow processing the full dataset would probably be too large.

DQMHarvest

  • Template 1: harvesting a full data DQMIO dataset
  • Template 2: with run whitelist, set to harvest in byRun mode
  • Template 3: with LumiList, set to harvest in multirun mode

TaskChain

  • Template 1: MC reading LHE files
  • Template 2: MC from scratch, MC extension, with transient output
  • Template 3: MC from scratch, with Pileup, MC automatic harvesting, different Multicore/Memory/EventStreams/MaxPSS
  • Template 4: MC recycling, PrimaryDataset override, different AcqEra/ProcStr/ProcVer/PrepID/CMSSW
  • Template 5: MC recycling, with Pileup, with KeepOutput false
  • Template 6: Data workflow, with !IncludeParents, with LumiList input, Data automatic harvesting
  • Template 7: MC recycling MINIAODSIM - hence with Dataset start policy
  • Others: ensure a mix of Python2 and Python3-based scram releases

StepChain (almost a dup of the TaskChain templates - or vice-versa)

  • Template 1: MC reading LHE files
  • Template 2: MC from scratch, MC extension, with transient output
  • Template 3: MC from scratch, with Pileup, MC automatic harvesting, different Multicore/Memory/EventStreams/MaxPSS
  • Template 4: MC recycling, PrimaryDataset override, different AcqEra/ProcStr/ProcVer/PrepID/CMSSW
  • Template 5: MC recycling, with Pileup, with KeepOutput false
  • Template 6: Data workflow, with !IncludeParents, with LumiList input, Data automatic harvesting
  • Template 7: MC recycling MINIAODSIM - hence with Dataset start policy
  • Others: with duplicate output module, with multiple PU datasets
    NOTE: I think PrimaryDataset override is not supported - maybe ProcVersion for different steps is also not supported

In addition, I would suggest to have 1 template of each spec under the DMWM directory - best is to keep the short/fast ones here - and leave all the others under the Integration directory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Technical Debt Used to track issues that address technical needs internal to WM team Testing
Projects
Status: ToDo
Development

Successfully merging a pull request may close this issue.

1 participant