Provide in the FWJR the CPU and wallclock time for CMSSW subprocesses #11665

vkuznet · 2023-07-19T14:40:40Z

Status

In development

Description

This PR provides a new FWJR section called WMCMSSWSubprocess, which contains the job start and end time (local time). In addition, it also provides wallclocktime, user and system time.

Note that this information is not yet propagated through WMArchive, which will be provided with this issue: #11657

Is it backward compatible (if not, which system it affects?)

YES

Related PRs

#11656 , #11657, #11663

External dependencies / deployment changes

None

cmsdmwmbot · 2023-07-19T14:53:47Z

Jenkins results:

Python3 Unit tests: failed
- 5 new failures
- 1 tests added
- 1 changes in unstable tests
Python3 Pylint check: failed
- 11 warnings and errors that must be fixed
- 4 warnings
- 176 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 24 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14353/artifact/artifacts/PullRequestReport.html

vkuznet · 2023-07-19T15:02:23Z

@amaltaro I made necessary changes to provide CPU/Mem and time info for each cmsRun job and it would be useful if you'll provide initial review. I still need to adjust unit tests though.

cmsdmwmbot · 2023-07-19T15:11:49Z

Jenkins results:

Python3 Unit tests: failed
- 3 new failures
- 1 tests added
Python3 Pylint check: failed
- 9 warnings and errors that must be fixed
- 4 warnings
- 176 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 24 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14354/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2023-07-19T15:44:01Z

Jenkins results:

Python3 Unit tests: failed
- 7 new failures
- 1 tests added
Python3 Pylint check: failed
- 9 warnings and errors that must be fixed
- 4 warnings
- 176 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 24 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14355/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2023-07-19T16:46:11Z

Jenkins results:

Python3 Unit tests: failed
- 4 new failures
- 1 tests added
- 1 changes in unstable tests
Python3 Pylint check: failed
- 9 warnings and errors that must be fixed
- 4 warnings
- 176 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 24 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14356/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2023-07-19T17:58:31Z

Jenkins results:

Python3 Unit tests: failed
- 2 new failures
- 1 tests added
- 1 changes in unstable tests
Python3 Pylint check: failed
- 9 warnings and errors that must be fixed
- 4 warnings
- 176 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 24 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14357/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2023-07-19T18:51:05Z

Jenkins results:

Python3 Unit tests: failed
- 3 new failures
- 1 tests added
Python3 Pylint check: failed
- 9 warnings and errors that must be fixed
- 4 warnings
- 176 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 24 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14358/artifact/artifacts/PullRequestReport.html

amaltaro

@vkuznet I left a brief review along the code.

src/python/WMCore/WMSpec/Steps/Executors/CMSSW.py

cmsdmwmbot · 2023-07-20T15:22:00Z

Jenkins results:

Python3 Unit tests: failed
- 4 new failures
- 1 tests added
Python3 Pylint check: failed
- 14 warnings and errors that must be fixed
- 4 warnings
- 175 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 24 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14364/artifact/artifacts/PullRequestReport.html

vkuznet · 2023-07-20T15:33:10Z

@amaltaro , I spent more time with this PR and end-up with the following observation and changes:

the output of time UNIX command is platform, and, therefore, requires parsing based on specific version of architecture, and may be Linux distribution
usage of psutil or python resource module is desired to get unified codebase
the usage of psutil is cumbersome as we need to obtain stats from running process
and, python resource module seems to be best solution

Here I show how I address all aforementioned issues using pure python solution:

# cat /tmp/subproc.py

import subprocess
import os
from resource import getrusage, RUSAGE_SELF, RUSAGE_CHILDREN

args = ['tree', '/Users/vk/tmp']
proc = subprocess.Popen(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
#ps = psutil.Process(proc.pid)
childPid, exitStatus, resource = os.wait4(proc.pid, os.P_PID)
stdout, stderr = proc.communicate()
print("results: child pid=%s, exit code=%s len of results=%s" % (childPid, exitStatus, len(str(stdout))))
print("resource", resource)
print("system", resource.ru_stime)
print("self", getrusage(RUSAGE_SELF))
print("children", getrusage(RUSAGE_CHILDREN))
print("child syetem CPU", getrusage(RUSAGE_CHILDREN).ru_stime)

Running this script on darwin platform I get the following output:

# time python3 /tmp/subproc.py

results: child pid=0, exit code=0 len of results=4251138
resource resource.struct_rusage(ru_utime=4489086869.692368, ru_stime=140732728986175.6, ru_maxrss=4489086672, ru_ixrss=4493059392, ru_idrss=4492681392, ru_isrss=4493057408, ru_minflt=140732728986544, ru_majflt=4487049074, ru_nswap=4489086672, ru_inblock=4489231440, ru_oublock=4492681392, ru_msgsnd=4494048752, ru_msgrcv=140732728986640, ru_nsignals=4487103223, ru_nvcsw=4493097328, ru_nivcsw=0)
system 140732728986175.6
self resource.struct_rusage(ru_utime=0.047643, ru_stime=0.029029, ru_maxrss=14102528, ru_ixrss=0, ru_idrss=0, ru_isrss=0, ru_minflt=4460, ru_majflt=0, ru_nswap=0, ru_inblock=0, ru_oublock=0, ru_msgsnd=0, ru_msgrcv=0, ru_nsignals=0, ru_nvcsw=1, ru_nivcsw=237)
children resource.struct_rusage(ru_utime=0.218493, ru_stime=0.296701, ru_maxrss=1343488, ru_ixrss=0, ru_idrss=0, ru_isrss=0, ru_minflt=475, ru_majflt=0, ru_nswap=0, ru_inblock=0, ru_oublock=0, ru_msgsnd=0, ru_msgrcv=0, ru_nsignals=0, ru_nvcsw=0, ru_nivcsw=298)
child syetem CPU 0.296701
python3.8 ~/tmp/subproc.py  0.27s user 0.33s system 97% cpu 0.614 total

While running the same command on Debian linux I get the following (I only changed directory for tree command, here I changed args value to ['tree', '/'] when I run it in my pod):

# time python3 /tmp/subproc.py

results: child pid=0, exit code=0 len of results=14688299
resource resource.struct_rusage(ru_utime=139785193971520.0, ru_stime=139785333122281.34, ru_maxrss=139785193854192, ru_ixrss=0, ru_idrss=139785201466048, ru_isrss=139785199893978, ru_minflt=139785201733120, ru_majflt=5408159348949986048, ru_nswap=139785201619392, ru_inblock=5408159348949986048, ru_oublock=139785194010608, ru_msgsnd=139785199911345, ru_msgrcv=139785193787456, ru_nsignals=0, ru_nvcsw=94911867207552, ru_nivcsw=139785193854192)
system 139785333122281.34
self resource.struct_rusage(ru_utime=0.08166999999999999, ru_stime=0.041305, ru_maxrss=32768, ru_ixrss=0, ru_idrss=0, ru_isrss=0, ru_minflt=7460, ru_majflt=0, ru_nswap=0, ru_inblock=0, ru_oublock=0, ru_msgsnd=0, ru_msgrcv=0, ru_nsignals=0, ru_nvcsw=1212, ru_nivcsw=58)
children resource.struct_rusage(ru_utime=0.22930599999999998, ru_stime=0.5738869999999999, ru_maxrss=4644, ru_ixrss=0, ru_idrss=0, ru_isrss=0, ru_minflt=277, ru_majflt=0, ru_nswap=0, ru_inblock=0, ru_oublock=0, ru_msgsnd=0, ru_msgrcv=0, ru_nsignals=0, ru_nvcsw=1, ru_nivcsw=180)
child syetem CPU 0.5738869999999999

real    0m0.932s
user    0m0.314s
sys     0m0.620s

Few observations:

the output of time command on Darwin and Linux is different, and I have not clue how it will be represented on variety of Linux'es, etc. (it may be the same or may not, it may change in future version too).
the sys and system output are close enough to represent child subprocess time. I bet their difference comes from overhead (and internal implementation) of time and python resource module. But, I think we can rely on python resource output
the code does not depend on third party library anymore (psutil) and therefore has advantage of being using pure python solution

The changes now are committed and I'll wait for unit tests to finish. It also allowed me to address your second issue with changing functionality as I returned to subprocess.Popen call the file descriptors.

Please review again and ask your question or provide your feedback.

cmsdmwmbot · 2023-07-20T15:47:54Z

Jenkins results:

Python3 Unit tests: failed
- 3 new failures
- 1 tests added
Python3 Pylint check: failed
- 13 warnings and errors that must be fixed
- 4 warnings
- 172 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 24 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14365/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2023-07-20T17:23:00Z

Jenkins results:

Python3 Unit tests: failed
- 1 tests added
- 1 changes in unstable tests
Python3 Pylint check: failed
- 13 warnings and errors that must be fixed
- 4 warnings
- 172 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 24 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14366/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2023-07-20T17:41:10Z

Jenkins results:

Python3 Unit tests: succeeded
- 1 tests added
Python3 Pylint check: failed
- 13 warnings and errors that must be fixed
- 4 warnings
- 172 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 24 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14367/artifact/artifacts/PullRequestReport.html

vkuznet · 2023-07-20T19:22:49Z

Alan, I was able to successfully run unit tests with new PR implementation, but I step down from using stdoutHandle and stderrHandle (open file descriptors). Instead I used subprocess.PIPE and write its content into stdoutHandle and stderrHandle file descriptors. The reason for that is that subprocess.call relies on Popen.wait call internally, see
https://github.com/python/cpython/blob/3.11/Lib/subprocess.py#L389-L391
and, it blocks usage of resources. While using Popen with PIPE allows to redirect stdout/stderr to these PIPE objects and read it later, therefore it does not block subprocess execution and allows to use wait4 call (as I did) which provides resource object. Anyway, now unit tests are find (and it means running CMSSW job unit test subprocess is also fine) and you can review working version of new implementation.

todor-ivanov · 2023-07-25T11:24:38Z

hi @vkuznet
I am interested in this development here in the context of the following issue: #11667

So reading your comment above:

the code does not depend on third party library anymore (psutil) and therefore has advantage of being using pure python solution

May I ask, if you and @amaltaro have already converged on using psutils or resources libraries already.
I can see they both are still used in your PR, while to me when it comes to the portion of information related to resource consumption by a specific process, they are supposed to be interchangeable. But in your PR, they are some kind of complementary - one used to fetch the subprocess information the other for fetching the resource utilization information.

I am asking because in the issue I am trying to resolve, we will have to support/test a large range of linux kernel versions: 3.* to 6.* And since there have been a change in the /proc/<pid>/smap* resource reports, somewhere at around kernel version 4.*, it means, whichever library we happen to chose here, it should be confirmed working in all of those kernels. One thing which is also important - IIUC, resources comes as a standard library package, while psutils happens to be an external library, hence it brings additional need for making sure we correctly fetch it at runtime - either by a wrapper module inside the WMCore code or through the equivalently provided psutils library with cmssw

todor-ivanov · 2023-07-25T11:55:46Z

Sorry, it seems, there is a detail worth mentioning here.
About:

when it comes to the portion of information related to resource consumption by a specific process, they are supposed to be interchangeable

It seems in some cases psutils and resources may give different answers. Reading this post in stackoverflow: https://stackoverflow.com/questions/63983222/psutil-gives-other-result-than-resource

cmsdmwmbot · 2023-07-25T12:09:39Z

Jenkins results:

Python3 Unit tests: succeeded
- 1 tests added
- 1 changes in unstable tests
Python3 Pylint check: failed
- 13 warnings and errors that must be fixed
- 4 warnings
- 172 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 24 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14369/artifact/artifacts/PullRequestReport.html

makortel · 2023-08-01T21:00:21Z

I just want to note that in #11604 I requested the elapsed wall clock time to be accompanied with a timestamp of the moment the cmsRun was started.

amaltaro · 2023-08-02T02:32:23Z

@makortel Matti, that was an overlook from my side. Valentin did start this implementation by providing start/end timestamps, while I thought we only wanted to know the wallclock time (elapsed time).

I guess the wallclocktime-only has been requested for the whole job wrapper, while for each cmsRun you want to have:

start timestamp (epoch? utc?)
end timestamp
wallclock time (end - start timestamp)

Could you please confirm this?

makortel · 2023-08-02T14:57:05Z

I guess the wallclocktime-only has been requested for the whole job wrapper,

Actually we'd like to have the start timestamp also for the job wrapper (e.g. to see the time difference from job wrapper startup to starting up cmsRun).

while for each cmsRun you want to have:

start timestamp (epoch? utc?)

end timestamp

wallclock time (end - start timestamp)

Either 1+2 or 1+3 would suffice. I don't think all three are needed as (2 or 3) can be calculated from 1 and (3 or 2).

As for the start timestamp format, I'd think the Unix time to be most convenient.

makortel · 2023-08-15T13:33:51Z

src/python/WMCore/WMSpec/Steps/Executors/CMSSW.py

+        userTime = getrusage(RUSAGE_CHILDREN).ru_utime
+        sysTime = getrusage(RUSAGE_CHILDREN).ru_stime


As far as I've understood, the getrusage() returns the resources the process (tree) has consumed up to the call point. Therefore, in order to record only the resources consumed by the subprocess launched by the Popen() call, the user and system time would have to be recorded both before and after the call to Popen(), and their difference would be the ones stored to the userTime and sysTime member variables.

Matti, good catch, I think you are right here. And I adjusted the code accordingly, see this commit c70cedd

Looks good by eye, thanks!

I'll try to add appropriate unit test soon to measure the difference of times of main code flaw and with subprocess call

Matti, have a look at new unit test in this commit: 9216dae where I tested logic of getting subprocess time based on your observation.

cmsdmwmbot · 2023-08-15T13:57:56Z

Jenkins results:

Python3 Unit tests: failed
- 1 new failures
- 1 tests no longer failing
- 1 tests added
- 2 changes in unstable tests
Python3 Pylint check: failed
- 13 warnings and errors that must be fixed
- 4 warnings
- 172 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 24 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14418/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2023-08-15T14:20:23Z

Jenkins results:

Python3 Unit tests: succeeded
- 1 tests no longer failing
- 2 tests added
- 2 changes in unstable tests
Python3 Pylint check: failed
- 17 warnings and errors that must be fixed
- 8 warnings
- 194 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 26 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14419/artifact/artifacts/PullRequestReport.html

vkuznet · 2023-08-15T14:31:13Z

Alan, this PR is mostly complete and I need your initial review to understand if all necessary pieces are in place. Please have a look at your convenience.

amaltaro

Valentin, this code is looking good in general. There are just some minor adjustments for you to consider.

src/python/WMCore/FwkJobReport/Report.py

src/python/WMCore/WMSpec/Steps/Executors/CMSSW.py

amaltaro · 2023-08-18T21:39:06Z

src/python/WMCore/WMSpec/Steps/Executors/CMSSW.py

@@ -273,7 +275,27 @@ def execute(self, emulator=None):

        os.environ.update(envOverride)

-        returnCode = subprocess.call(args, stdout=stdoutHandle, stderr=stderrHandle)
+        startTime = time.time()


It would be great to move all of this under a dedicated method, but given that we have to pass in many parameters and retrieve many out of the method, it might be uglier than keeping this code in this method. Anyhow, if you see I way to isolate all these juggling, please go ahead!

Alan, I made code better and safer (by using with instead of manual close which can never happen due to try/except block) but I think I didn't find way to make it as independent function though and probably better to keep this way.

cmsdmwmbot · 2023-08-21T14:22:10Z

Jenkins results:

Python3 Unit tests: succeeded
- 1 tests no longer failing
- 2 tests added
- 2 changes in unstable tests
Python3 Pylint check: failed
- 17 warnings and errors that must be fixed
- 8 warnings
- 192 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 26 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14431/artifact/artifacts/PullRequestReport.html

amaltaro

@vkuznet Valentin, these changes are looking good to me. Before merging it though, I wanted to ask the following:
a) should we convert all these timing metrics to integers of seconds instead of float? I don't think a fraction of a second hurts these metrics in any shape and form.
b) could you please run another test workflow with this patch in and point me to one of the job logs? I wanted to make sure that we don't messed up with the worker node logging.
c) where is your development for parsing the FJR and pushing it to WMArchive (through the ArchiveDataReporter component)? I assume you are adding it in the PR meant to close this issue #11657?

vkuznet · 2023-08-22T13:24:41Z

Alan, regarding your questions:
a) it depends on use-case and in concurrent environment extra precision is valuable, therefore I do not mind to convert to ints, but keeping in floats seems reasonable too for me
b) I submitted new WM (from vocms0291), see wmagent_TaskChain_MC_Agent0291_forceComplete_230822_131344_5472
c) unless I'm mistaken there is no need for extra development to parse FJR to WMArchive doc since XMLParser.py will parse all available sections in FJR and create appropriate JSON document which will be send to WMArchvie. On WMArchive side we also do not need special handling since it reads JSON as is to send it over to MONIT. Said that, I only adjusted WMArchive schema, see PR$362 to include the metrics. But after the FJR report will be generated via step (b) I'll regenerate JSON file from FJR pkl one and post appropriate link over here.

amaltaro · 2023-08-22T13:33:07Z

@makortel Matti, for the timing measurements, do you have any preference of data type/precision? Right now it's float, but I think seconds in integer would be more than enough. Please let us know your opinion on this.

Valentin, for the point c) above, you need to check the createArchiverDoc method in this module:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Services/WMArchive/DataMap.py#L397

this is how the WMArchive documents gets constructed and it's not clear to me that a new section in the FJR will automatically make it to the final WMArchiver document.

vkuznet · 2023-08-22T14:06:17Z

@amaltaro , I added appropriate PR (#11692) which provides metrics to WMArchive service but I was under impression that we separate creation of metrics and their propagation to WMArchive anyway because it would be much easier to test rather then having so many PRs at the same time (I start preparing doc about CPU/Memory, Timing, CMSSW metrics and it has quite a lot of PRs).

makortel · 2023-08-22T14:14:31Z

Matti, for the timing measurements, do you have any preference of data type/precision? Right now it's float, but I think seconds in integer would be more than enough. Please let us know your opinion on this.

We would prefer float, so that if a less-than-second measurement occurs, a value different from zero would be visible (for exact 0 one could wonder if the value was actually measured or if the 0 is just a default value). But technically integer would work too (time durations of interest are longer than a second).

vkuznet · 2023-08-22T16:42:34Z

@amaltaro , I just checked the job from wmagent_TaskChain_MC_Agent0291_forceComplete_230822_131344_5472 (on cmsweb-testbed) and extracted one of the job pickle files, and decoded it back. Here is relevant subsection for metrics:

FrameworkJobReport.cmsRun1.WMCMSSWSubprocess.sysTime = 94.256681
FrameworkJobReport.cmsRun1.WMCMSSWSubprocess.endTime = 1692718964.7409143
FrameworkJobReport.cmsRun1.WMCMSSWSubprocess.userTime = 6053.7735170000005
FrameworkJobReport.cmsRun1.WMCMSSWSubprocess.startTime = 1692712699.8846245
FrameworkJobReport.cmsRun1.WMCMSSWSubprocess.wallClockTime = 6264.856289863586

and, it is transformed into JSON as following (only showing relevant section):

 'cmsRun1': {'WMCMSSWSubprocess': {'endTime': 1692718964.7409143,
                                   'startTime': 1692712699.8846245,
                                   'sysTime': 94.256681,
                                   'userTime': 6053.7735170000005,
                                   'wallClockTime': 6264.856289863586},
             'analysis': {'files': {'fileCount': 0}},
...

For completeness, here are the links to full report in different formats:

vkuznet · 2023-08-22T16:45:05Z

For completeness here is a simple script to transform pkl report to dot (txt) and JSON representations

#!/usr/bin/env python

import sys
import json
import pprint
import pickle

# WMCore modules
from WMCore.FwkJobReport.Report import Report

fname = sys.argv[1]
fout = fname + '.json'
print("Generating report for", fname)
report  = pickle.load(open(fname, 'rb'))
jsonData = report.dictionary_whole_tree_()
with open(fname + '.txt', 'w') as ostream:
    ostream.write(report.__str__())
with open(fname + '.json', 'w') as ostream:
    ostream.write(json.dumps(jsonData))
with open(fname + '_formated.json', 'w') as ostream:
    data = pprint.pformat(jsonData).replace("'", '"')
    ostream.write(data)

cmsdmwmbot · 2023-08-23T18:04:41Z

Jenkins results:

Python3 Unit tests: succeeded
- 2 tests added
- 2 changes in unstable tests
Python3 Pylint check: failed
- 17 warnings and errors that must be fixed
- 8 warnings
- 192 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 26 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14439/artifact/artifacts/PullRequestReport.html

amaltaro

These changes are looking good to me, thanks!
To avoid another iteration, I updated the PR title and will have to manually fix the commit message when constructing the release notes. Just FYI

UPDATE: PR description was outdated, I am updating it now.

amaltaro · 2023-09-08T02:26:46Z

As we keep discussing this and try to have the whole chain functional, I have some updates that are worth sharing.

That the actual FJR used by the ArchiveDataReporter - to upload a document to WMArchive - does not have this new section added in this pull request, called WMCMSSWSubprocess. This can be verified by pulling any job FJR from CouchDB (provided that the agent version is >=2.2.4rc3), e.g.:

curl ***:***@localhost:5984/wmagent_jobdump%2Ffwjrs/537-0 | jq

Each job step does have a start and stop metric though, but that already existed in the FJR previously to this development, e.g.:

        "start": 1694121397,
        "stop": 1694121805,

Job report pickle file does contain this section. Those report pickle files get transferred back to the agent once the job finishes and one example location to find it in a tarball is install/wmagentpy3/JobArchiver/logDir/a/amaltaro_TaskChain_ProdMinBias_Sept2023_Val_230907_205436_2542/JobCluster_0/Job_537.tar.bz2 (vocms0192). Upon decompressing it and loading the report pickle file

>>> with open("/data/srv/wmagent/current/Job_537/Report.0.pkl", "rb") as po:
...   data = pickle.load(po)
>>> pprint(data.dictionary_whole_tree_())
{'WMAgentJobID': 537,
 'WMAgentJobName': '893188c8-0783-4db2-8f55-ce060a3bdd4e-71',
 'ceName': 'b7s02p8471.cern.ch',
 'cmsRun1': {'WMCMSSWSubprocess': {'endTime': 1694121805.631746,
                                   'startTime': 1694121432.9059343,
                                   'sysTime': 17.675720000000002,
                                   'userTime': 320.763982,
                                   'wallClockTime': 372.7258117198944},

we can properly find it. Another alternative to see the whole structure of this object is with:

>>> pprint(data.pythonise_())
['FrameworkJobReport.WMAgentJobName = '
...
 "FrameworkJobReport.cmsRun1.section_('WMCMSSWSubprocess')",
 'FrameworkJobReport.cmsRun1.WMCMSSWSubprocess.startTime = 1694121432.9059343',
 'FrameworkJobReport.cmsRun1.WMCMSSWSubprocess.userTime = 320.763982',

Having said that, the place to debug this further would be:
https://github.com/dmwm/WMCore/blob/4170e4a/src/python/WMCore/JobStateMachine/ChangeState.py#L346-L404

which generates the relevant information stored in the wmagent_jobdump/fwjrs CouchDB database. FYI @vkuznet

PS.: Note that the overall timing section defined at the wrapper and called WMTiming is nowhere to be found in these documents.

vkuznet · 2023-09-08T18:30:33Z

@amaltaro , thanks for looking into this but now I doubt that ChangeState.py is a right place to dig in. Instead, as I expected I found that there is internal MariaDB database on WMAgent which has wmbs_job and within this table there is a fwjr_path column which contains path to FJR pkl file. With this information in hands now I can see how we can move forward:

modify ArchiveDataPoller and add access to MariaDB
for every job ID load corresponding FJR path from wmbs_job table
open and parse FJR pkl file which will give us content of FJR stored in this file during job creation and processing
enrich existing fwjr document (which comes from local CouchDB) with information from FJR pkl file
send fwjr to WMArchive server

I would appreciate if you'll review this logic and let me know your feedback. To implement it I'll need to know how to pass MariaDB URI to ArchiveDataPoller class. Please let me know how to do it. Once I have that I think I can manage to load FJR pkl file via aforementioned procedure and properly add necessary sections to fwjr (which comes from local CouchDB) and pass it to WMArchive.

amaltaro · 2023-09-08T19:54:05Z

Changing the data source is definitely NOT the right way to proceed. If ArchiveDataReporter is fed with the FJRs stored in CouchDB, what we need to do is to ensure that the CouchDB documents have the proper content.

Please investigate the code I referenced above in ChangeState to ensure that the fwjrDocument object in that code ends with the new metrics.

vkuznet · 2023-09-09T19:21:44Z

@amaltaro , I do not understand what do you mean by saying "changing data source", which data source? In my view there two different FJR in a system. One is generated within job, this is the one where we put metrics. It is written to the pkl file when job is done. And, another which is create by some component in WMA which writes and update it in local CouchDB. If this is the case in order to get CMSSW, or any other job related metrics we need to read job FJR from pkl file because this is where metrics about job are stored. We can read it in ArchiveDataReporter or in ChangeState to complement the FJR which is circulating in WMA and not associated with a job. We verified already that WMCMSSWSubprocess are present in job FJR and are not present in local couchDB one. Therefore, it just confirms that those are two different FJR reports circulating in WMA, and in order to unify them we need to read pkl file elsewhere. In my view ArchiveDataPoller is proper place as it runs after jobs are done, but the same logic can be put in ChangeState module but I doubt that pkl file will exist at every change of the workflow change, and therefore it reference in MariaDB will exist by the time we change state from processing job to job archive for instance. So fart I see pointless to dig further since I have no clear idea how entire workflow and its transition in different system works. I rather prefer to have dedicated discussion/chat about the strategy. At the moment I see that we can simply dump fwjr object in ChangeState but as I pointed out already it does not create it in a first place, instead it is given jobs FJR objects which are created someplace else.

vkuznet added PR: Do not merge yet PR: Work in progress labels Jul 19, 2023

vkuznet requested a review from amaltaro July 19, 2023 14:40

vkuznet self-assigned this Jul 19, 2023

vkuznet force-pushed the fix-issue-11660 branch from 9d34e29 to 4244bfb Compare July 19, 2023 15:03

vkuznet force-pushed the fix-issue-11660 branch from 9cc77d4 to eaa4b8f Compare July 19, 2023 18:41

amaltaro reviewed Jul 19, 2023

View reviewed changes

src/python/WMCore/WMSpec/Steps/Executors/CMSSW.py Outdated Show resolved Hide resolved

src/python/WMCore/WMSpec/Steps/Executors/CMSSW.py Outdated Show resolved Hide resolved

vkuznet force-pushed the fix-issue-11660 branch from f8ea0c5 to f5e1f82 Compare July 20, 2023 15:39

vkuznet force-pushed the fix-issue-11660 branch from 8cf7cf0 to dad8c3f Compare July 20, 2023 17:30

makortel mentioned this pull request Aug 1, 2023

WM measurements of wallclock and cpu time for each cmsRun step #11660

Closed

vkuznet requested a review from amaltaro August 14, 2023 13:13

makortel reviewed Aug 15, 2023

View reviewed changes

makortel mentioned this pull request Aug 15, 2023

Communicate with WM about CMSSW I/O metric reporting cms-sw/framework-team#527

Closed

vkuznet requested a review from amaltaro August 15, 2023 14:31

amaltaro requested changes Aug 18, 2023

View reviewed changes

vkuznet requested a review from amaltaro August 21, 2023 14:33

amaltaro approved these changes Aug 21, 2023

View reviewed changes

vkuznet mentioned this pull request Aug 23, 2023

Add WMCMSSWSubprocess and WMTiming metrics to WMArchvie document #11692

Merged

vkuznet added 2 commits August 23, 2023 13:49

Add CPU/Mem/Time info for CMSSW subprocess

2e6a63d

Adjust unit tests

429078f

vkuznet force-pushed the fix-issue-11660 branch from 663b68b to 429078f Compare August 23, 2023 17:51

amaltaro changed the title ~~Provide CPU/Memory and Elapsed time of CMSSW subprocess based on psutil module~~ Provide in the FWJR the CPU and wallclock time for CMSSW subprocesses Aug 23, 2023

amaltaro approved these changes Aug 23, 2023

View reviewed changes

amaltaro merged commit 10482ee into dmwm:master Aug 23, 2023

This was referenced Sep 12, 2023

Introduce DEBUG workflow with minimum set of parameters and fake job to simplify debugging process #11714

Open

Adds WMCMSSWSubprocess metrics to FJR document #11716

Merged

		userTime = getrusage(RUSAGE_CHILDREN).ru_utime
		sysTime = getrusage(RUSAGE_CHILDREN).ru_stime

Provide in the FWJR the CPU and wallclock time for CMSSW subprocesses #11665

Provide in the FWJR the CPU and wallclock time for CMSSW subprocesses #11665

Conversation

vkuznet commented Jul 19, 2023 • edited by amaltaro Loading

Status

Description

Is it backward compatible (if not, which system it affects?)

Related PRs

External dependencies / deployment changes

cmsdmwmbot commented Jul 19, 2023

vkuznet commented Jul 19, 2023

cmsdmwmbot commented Jul 19, 2023

cmsdmwmbot commented Jul 19, 2023

cmsdmwmbot commented Jul 19, 2023

cmsdmwmbot commented Jul 19, 2023

cmsdmwmbot commented Jul 19, 2023

amaltaro left a comment

Choose a reason for hiding this comment

cmsdmwmbot commented Jul 20, 2023

vkuznet commented Jul 20, 2023

cmsdmwmbot commented Jul 20, 2023

cmsdmwmbot commented Jul 20, 2023

cmsdmwmbot commented Jul 20, 2023

vkuznet commented Jul 20, 2023

todor-ivanov commented Jul 25, 2023 • edited Loading

todor-ivanov commented Jul 25, 2023 • edited Loading

cmsdmwmbot commented Jul 25, 2023

makortel commented Aug 1, 2023

amaltaro commented Aug 2, 2023

makortel commented Aug 2, 2023

makortel Aug 15, 2023

Choose a reason for hiding this comment

vkuznet Aug 15, 2023

Choose a reason for hiding this comment

makortel Aug 15, 2023

Choose a reason for hiding this comment

vkuznet Aug 15, 2023

Choose a reason for hiding this comment

vkuznet Aug 15, 2023

Choose a reason for hiding this comment

cmsdmwmbot commented Aug 15, 2023

cmsdmwmbot commented Aug 15, 2023

vkuznet commented Aug 15, 2023

amaltaro left a comment

Choose a reason for hiding this comment

amaltaro Aug 18, 2023

Choose a reason for hiding this comment

vkuznet Aug 21, 2023

Choose a reason for hiding this comment

cmsdmwmbot commented Aug 21, 2023

amaltaro left a comment

Choose a reason for hiding this comment

vkuznet commented Aug 22, 2023

amaltaro commented Aug 22, 2023

vkuznet commented Aug 22, 2023

makortel commented Aug 22, 2023

vkuznet commented Aug 22, 2023

vkuznet commented Aug 22, 2023

cmsdmwmbot commented Aug 23, 2023

amaltaro left a comment • edited Loading

Choose a reason for hiding this comment

amaltaro commented Sep 8, 2023

vkuznet commented Sep 8, 2023

amaltaro commented Sep 8, 2023

vkuznet commented Sep 9, 2023

vkuznet commented Jul 19, 2023 •

edited by amaltaro

Loading

todor-ivanov commented Jul 25, 2023 •

edited

Loading

todor-ivanov commented Jul 25, 2023 •

edited

Loading

amaltaro left a comment •

edited

Loading