Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide in the FWJR the CPU and wallclock time for CMSSW subprocesses #11665

Merged
merged 2 commits into from
Aug 23, 2023

Conversation

vkuznet
Copy link
Contributor

@vkuznet vkuznet commented Jul 19, 2023

Fixes #11660

Status

In development

Description

This PR provides a new FWJR section called WMCMSSWSubprocess, which contains the job start and end time (local time). In addition, it also provides wallclocktime, user and system time.

Note that this information is not yet propagated through WMArchive, which will be provided with this issue: #11657

Is it backward compatible (if not, which system it affects?)

YES

Related PRs

#11656 , #11657, #11663

External dependencies / deployment changes

None

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 5 new failures
    • 1 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 11 warnings and errors that must be fixed
    • 4 warnings
    • 176 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 24 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14353/artifact/artifacts/PullRequestReport.html

@vkuznet
Copy link
Contributor Author

vkuznet commented Jul 19, 2023

@amaltaro I made necessary changes to provide CPU/Mem and time info for each cmsRun job and it would be useful if you'll provide initial review. I still need to adjust unit tests though.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 3 new failures
    • 1 tests added
  • Python3 Pylint check: failed
    • 9 warnings and errors that must be fixed
    • 4 warnings
    • 176 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 24 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14354/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 7 new failures
    • 1 tests added
  • Python3 Pylint check: failed
    • 9 warnings and errors that must be fixed
    • 4 warnings
    • 176 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 24 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14355/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 4 new failures
    • 1 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 9 warnings and errors that must be fixed
    • 4 warnings
    • 176 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 24 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14356/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 2 new failures
    • 1 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 9 warnings and errors that must be fixed
    • 4 warnings
    • 176 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 24 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14357/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 3 new failures
    • 1 tests added
  • Python3 Pylint check: failed
    • 9 warnings and errors that must be fixed
    • 4 warnings
    • 176 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 24 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14358/artifact/artifacts/PullRequestReport.html

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vkuznet I left a brief review along the code.

src/python/WMCore/WMSpec/Steps/Executors/CMSSW.py Outdated Show resolved Hide resolved
src/python/WMCore/WMSpec/Steps/Executors/CMSSW.py Outdated Show resolved Hide resolved
@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 4 new failures
    • 1 tests added
  • Python3 Pylint check: failed
    • 14 warnings and errors that must be fixed
    • 4 warnings
    • 175 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 24 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14364/artifact/artifacts/PullRequestReport.html

@vkuznet
Copy link
Contributor Author

vkuznet commented Jul 20, 2023

@amaltaro , I spent more time with this PR and end-up with the following observation and changes:

  • the output of time UNIX command is platform, and, therefore, requires parsing based on specific version of architecture, and may be Linux distribution
  • usage of psutil or python resource module is desired to get unified codebase
  • the usage of psutil is cumbersome as we need to obtain stats from running process
  • and, python resource module seems to be best solution

Here I show how I address all aforementioned issues using pure python solution:

# cat /tmp/subproc.py

import subprocess
import os
from resource import getrusage, RUSAGE_SELF, RUSAGE_CHILDREN

args = ['tree', '/Users/vk/tmp']
proc = subprocess.Popen(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
#ps = psutil.Process(proc.pid)
childPid, exitStatus, resource = os.wait4(proc.pid, os.P_PID)
stdout, stderr = proc.communicate()
print("results: child pid=%s, exit code=%s len of results=%s" % (childPid, exitStatus, len(str(stdout))))
print("resource", resource)
print("system", resource.ru_stime)
print("self", getrusage(RUSAGE_SELF))
print("children", getrusage(RUSAGE_CHILDREN))
print("child syetem CPU", getrusage(RUSAGE_CHILDREN).ru_stime)

Running this script on darwin platform I get the following output:

# time python3 /tmp/subproc.py

results: child pid=0, exit code=0 len of results=4251138
resource resource.struct_rusage(ru_utime=4489086869.692368, ru_stime=140732728986175.6, ru_maxrss=4489086672, ru_ixrss=4493059392, ru_idrss=4492681392, ru_isrss=4493057408, ru_minflt=140732728986544, ru_majflt=4487049074, ru_nswap=4489086672, ru_inblock=4489231440, ru_oublock=4492681392, ru_msgsnd=4494048752, ru_msgrcv=140732728986640, ru_nsignals=4487103223, ru_nvcsw=4493097328, ru_nivcsw=0)
system 140732728986175.6
self resource.struct_rusage(ru_utime=0.047643, ru_stime=0.029029, ru_maxrss=14102528, ru_ixrss=0, ru_idrss=0, ru_isrss=0, ru_minflt=4460, ru_majflt=0, ru_nswap=0, ru_inblock=0, ru_oublock=0, ru_msgsnd=0, ru_msgrcv=0, ru_nsignals=0, ru_nvcsw=1, ru_nivcsw=237)
children resource.struct_rusage(ru_utime=0.218493, ru_stime=0.296701, ru_maxrss=1343488, ru_ixrss=0, ru_idrss=0, ru_isrss=0, ru_minflt=475, ru_majflt=0, ru_nswap=0, ru_inblock=0, ru_oublock=0, ru_msgsnd=0, ru_msgrcv=0, ru_nsignals=0, ru_nvcsw=0, ru_nivcsw=298)
child syetem CPU 0.296701
python3.8 ~/tmp/subproc.py  0.27s user 0.33s system 97% cpu 0.614 total

While running the same command on Debian linux I get the following (I only changed directory for tree command, here I changed args value to ['tree', '/'] when I run it in my pod):

# time python3 /tmp/subproc.py

results: child pid=0, exit code=0 len of results=14688299
resource resource.struct_rusage(ru_utime=139785193971520.0, ru_stime=139785333122281.34, ru_maxrss=139785193854192, ru_ixrss=0, ru_idrss=139785201466048, ru_isrss=139785199893978, ru_minflt=139785201733120, ru_majflt=5408159348949986048, ru_nswap=139785201619392, ru_inblock=5408159348949986048, ru_oublock=139785194010608, ru_msgsnd=139785199911345, ru_msgrcv=139785193787456, ru_nsignals=0, ru_nvcsw=94911867207552, ru_nivcsw=139785193854192)
system 139785333122281.34
self resource.struct_rusage(ru_utime=0.08166999999999999, ru_stime=0.041305, ru_maxrss=32768, ru_ixrss=0, ru_idrss=0, ru_isrss=0, ru_minflt=7460, ru_majflt=0, ru_nswap=0, ru_inblock=0, ru_oublock=0, ru_msgsnd=0, ru_msgrcv=0, ru_nsignals=0, ru_nvcsw=1212, ru_nivcsw=58)
children resource.struct_rusage(ru_utime=0.22930599999999998, ru_stime=0.5738869999999999, ru_maxrss=4644, ru_ixrss=0, ru_idrss=0, ru_isrss=0, ru_minflt=277, ru_majflt=0, ru_nswap=0, ru_inblock=0, ru_oublock=0, ru_msgsnd=0, ru_msgrcv=0, ru_nsignals=0, ru_nvcsw=1, ru_nivcsw=180)
child syetem CPU 0.5738869999999999

real    0m0.932s
user    0m0.314s
sys     0m0.620s

Few observations:

  • the output of time command on Darwin and Linux is different, and I have not clue how it will be represented on variety of Linux'es, etc. (it may be the same or may not, it may change in future version too).
  • the sys and system output are close enough to represent child subprocess time. I bet their difference comes from overhead (and internal implementation) of time and python resource module. But, I think we can rely on python resource output
  • the code does not depend on third party library anymore (psutil) and therefore has advantage of being using pure python solution

The changes now are committed and I'll wait for unit tests to finish. It also allowed me to address your second issue with changing functionality as I returned to subprocess.Popen call the file descriptors.

Please review again and ask your question or provide your feedback.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 3 new failures
    • 1 tests added
  • Python3 Pylint check: failed
    • 13 warnings and errors that must be fixed
    • 4 warnings
    • 172 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 24 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14365/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 13 warnings and errors that must be fixed
    • 4 warnings
    • 172 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 24 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14366/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests added
  • Python3 Pylint check: failed
    • 13 warnings and errors that must be fixed
    • 4 warnings
    • 172 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 24 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14367/artifact/artifacts/PullRequestReport.html

@vkuznet
Copy link
Contributor Author

vkuznet commented Jul 20, 2023

Alan, I was able to successfully run unit tests with new PR implementation, but I step down from using stdoutHandle and stderrHandle (open file descriptors). Instead I used subprocess.PIPE and write its content into stdoutHandle and stderrHandle file descriptors. The reason for that is that subprocess.call relies on Popen.wait call internally, see
https://github.com/python/cpython/blob/3.11/Lib/subprocess.py#L389-L391
and, it blocks usage of resources. While using Popen with PIPE allows to redirect stdout/stderr to these PIPE objects and read it later, therefore it does not block subprocess execution and allows to use wait4 call (as I did) which provides resource object. Anyway, now unit tests are find (and it means running CMSSW job unit test subprocess is also fine) and you can review working version of new implementation.

@todor-ivanov
Copy link
Contributor

todor-ivanov commented Jul 25, 2023

hi @vkuznet
I am interested in this development here in the context of the following issue: #11667

So reading your comment above:

the code does not depend on third party library anymore (psutil) and therefore has advantage of being using pure python solution

May I ask, if you and @amaltaro have already converged on using psutils or resources libraries already.
I can see they both are still used in your PR, while to me when it comes to the portion of information related to resource consumption by a specific process, they are supposed to be interchangeable. But in your PR, they are some kind of complementary - one used to fetch the subprocess information the other for fetching the resource utilization information.

I am asking because in the issue I am trying to resolve, we will have to support/test a large range of linux kernel versions: 3.* to 6.* And since there have been a change in the /proc/<pid>/smap* resource reports, somewhere at around kernel version 4.*, it means, whichever library we happen to chose here, it should be confirmed working in all of those kernels. One thing which is also important - IIUC, resources comes as a standard library package, while psutils happens to be an external library, hence it brings additional need for making sure we correctly fetch it at runtime - either by a wrapper module inside the WMCore code or through the equivalently provided psutils library with cmssw

@todor-ivanov
Copy link
Contributor

todor-ivanov commented Jul 25, 2023

Sorry, it seems, there is a detail worth mentioning here.
About:

when it comes to the portion of information related to resource consumption by a specific process, they are supposed to be interchangeable

It seems in some cases psutils and resources may give different answers. Reading this post in stackoverflow: https://stackoverflow.com/questions/63983222/psutil-gives-other-result-than-resource

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 13 warnings and errors that must be fixed
    • 4 warnings
    • 172 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 24 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14369/artifact/artifacts/PullRequestReport.html

@makortel
Copy link

makortel commented Aug 1, 2023

I just want to note that in #11604 I requested the elapsed wall clock time to be accompanied with a timestamp of the moment the cmsRun was started.

@amaltaro
Copy link
Contributor

amaltaro commented Aug 2, 2023

@makortel Matti, that was an overlook from my side. Valentin did start this implementation by providing start/end timestamps, while I thought we only wanted to know the wallclock time (elapsed time).

I guess the wallclocktime-only has been requested for the whole job wrapper, while for each cmsRun you want to have:

  1. start timestamp (epoch? utc?)
  2. end timestamp
  3. wallclock time (end - start timestamp)

Could you please confirm this?

@makortel
Copy link

makortel commented Aug 2, 2023

I guess the wallclocktime-only has been requested for the whole job wrapper,

Actually we'd like to have the start timestamp also for the job wrapper (e.g. to see the time difference from job wrapper startup to starting up cmsRun).

while for each cmsRun you want to have:

  1. start timestamp (epoch? utc?)
  2. end timestamp
  3. wallclock time (end - start timestamp)

Either 1+2 or 1+3 would suffice. I don't think all three are needed as (2 or 3) can be calculated from 1 and (3 or 2).

As for the start timestamp format, I'd think the Unix time to be most convenient.

@vkuznet vkuznet requested a review from amaltaro August 14, 2023 13:13
Comment on lines 287 to 288
userTime = getrusage(RUSAGE_CHILDREN).ru_utime
sysTime = getrusage(RUSAGE_CHILDREN).ru_stime

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I've understood, the getrusage() returns the resources the process (tree) has consumed up to the call point. Therefore, in order to record only the resources consumed by the subprocess launched by the Popen() call, the user and system time would have to be recorded both before and after the call to Popen(), and their difference would be the ones stored to the userTime and sysTime member variables.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Matti, good catch, I think you are right here. And I adjusted the code accordingly, see this commit c70cedd

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good by eye, thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to add appropriate unit test soon to measure the difference of times of main code flaw and with subprocess call

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Matti, have a look at new unit test in this commit: 9216dae where I tested logic of getting subprocess time based on your observation.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 1 tests no longer failing
    • 1 tests added
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 13 warnings and errors that must be fixed
    • 4 warnings
    • 172 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 24 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14418/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests no longer failing
    • 2 tests added
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 17 warnings and errors that must be fixed
    • 8 warnings
    • 194 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 26 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14419/artifact/artifacts/PullRequestReport.html

@vkuznet
Copy link
Contributor Author

vkuznet commented Aug 15, 2023

Alan, this PR is mostly complete and I need your initial review to understand if all necessary pieces are in place. Please have a look at your convenience.

@vkuznet vkuznet requested a review from amaltaro August 15, 2023 14:31
Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valentin, this code is looking good in general. There are just some minor adjustments for you to consider.

src/python/WMCore/FwkJobReport/Report.py Show resolved Hide resolved
src/python/WMCore/WMSpec/Steps/Executors/CMSSW.py Outdated Show resolved Hide resolved
@@ -273,7 +275,27 @@ def execute(self, emulator=None):

os.environ.update(envOverride)

returnCode = subprocess.call(args, stdout=stdoutHandle, stderr=stderrHandle)
startTime = time.time()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to move all of this under a dedicated method, but given that we have to pass in many parameters and retrieve many out of the method, it might be uglier than keeping this code in this method. Anyhow, if you see I way to isolate all these juggling, please go ahead!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alan, I made code better and safer (by using with instead of manual close which can never happen due to try/except block) but I think I didn't find way to make it as independent function though and probably better to keep this way.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests no longer failing
    • 2 tests added
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 17 warnings and errors that must be fixed
    • 8 warnings
    • 192 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 26 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14431/artifact/artifacts/PullRequestReport.html

@vkuznet vkuznet requested a review from amaltaro August 21, 2023 14:33
Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vkuznet Valentin, these changes are looking good to me. Before merging it though, I wanted to ask the following:
a) should we convert all these timing metrics to integers of seconds instead of float? I don't think a fraction of a second hurts these metrics in any shape and form.
b) could you please run another test workflow with this patch in and point me to one of the job logs? I wanted to make sure that we don't messed up with the worker node logging.
c) where is your development for parsing the FJR and pushing it to WMArchive (through the ArchiveDataReporter component)? I assume you are adding it in the PR meant to close this issue #11657?

@vkuznet
Copy link
Contributor Author

vkuznet commented Aug 22, 2023

Alan, regarding your questions:
a) it depends on use-case and in concurrent environment extra precision is valuable, therefore I do not mind to convert to ints, but keeping in floats seems reasonable too for me
b) I submitted new WM (from vocms0291), see wmagent_TaskChain_MC_Agent0291_forceComplete_230822_131344_5472
c) unless I'm mistaken there is no need for extra development to parse FJR to WMArchive doc since XMLParser.py will parse all available sections in FJR and create appropriate JSON document which will be send to WMArchvie. On WMArchive side we also do not need special handling since it reads JSON as is to send it over to MONIT. Said that, I only adjusted WMArchive schema, see PR$362 to include the metrics. But after the FJR report will be generated via step (b) I'll regenerate JSON file from FJR pkl one and post appropriate link over here.

@amaltaro
Copy link
Contributor

@makortel Matti, for the timing measurements, do you have any preference of data type/precision? Right now it's float, but I think seconds in integer would be more than enough. Please let us know your opinion on this.

Valentin, for the point c) above, you need to check the createArchiverDoc method in this module:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Services/WMArchive/DataMap.py#L397

this is how the WMArchive documents gets constructed and it's not clear to me that a new section in the FJR will automatically make it to the final WMArchiver document.

@vkuznet
Copy link
Contributor Author

vkuznet commented Aug 22, 2023

@amaltaro , I added appropriate PR (#11692) which provides metrics to WMArchive service but I was under impression that we separate creation of metrics and their propagation to WMArchive anyway because it would be much easier to test rather then having so many PRs at the same time (I start preparing doc about CPU/Memory, Timing, CMSSW metrics and it has quite a lot of PRs).

@makortel
Copy link

Matti, for the timing measurements, do you have any preference of data type/precision? Right now it's float, but I think seconds in integer would be more than enough. Please let us know your opinion on this.

We would prefer float, so that if a less-than-second measurement occurs, a value different from zero would be visible (for exact 0 one could wonder if the value was actually measured or if the 0 is just a default value). But technically integer would work too (time durations of interest are longer than a second).

@vkuznet
Copy link
Contributor Author

vkuznet commented Aug 22, 2023

@amaltaro , I just checked the job from wmagent_TaskChain_MC_Agent0291_forceComplete_230822_131344_5472 (on cmsweb-testbed) and extracted one of the job pickle files, and decoded it back. Here is relevant subsection for metrics:

FrameworkJobReport.cmsRun1.WMCMSSWSubprocess.sysTime = 94.256681
FrameworkJobReport.cmsRun1.WMCMSSWSubprocess.endTime = 1692718964.7409143
FrameworkJobReport.cmsRun1.WMCMSSWSubprocess.userTime = 6053.7735170000005
FrameworkJobReport.cmsRun1.WMCMSSWSubprocess.startTime = 1692712699.8846245
FrameworkJobReport.cmsRun1.WMCMSSWSubprocess.wallClockTime = 6264.856289863586

and, it is transformed into JSON as following (only showing relevant section):

 'cmsRun1': {'WMCMSSWSubprocess': {'endTime': 1692718964.7409143,
                                   'startTime': 1692712699.8846245,
                                   'sysTime': 94.256681,
                                   'userTime': 6053.7735170000005,
                                   'wallClockTime': 6264.856289863586},
             'analysis': {'files': {'fileCount': 0}},
...

For completeness, here are the links to full report in different formats:

@vkuznet
Copy link
Contributor Author

vkuznet commented Aug 22, 2023

For completeness here is a simple script to transform pkl report to dot (txt) and JSON representations

#!/usr/bin/env python

import sys
import json
import pprint
import pickle

# WMCore modules
from WMCore.FwkJobReport.Report import Report

fname = sys.argv[1]
fout = fname + '.json'
print("Generating report for", fname)
report  = pickle.load(open(fname, 'rb'))
jsonData = report.dictionary_whole_tree_()
with open(fname + '.txt', 'w') as ostream:
    ostream.write(report.__str__())
with open(fname + '.json', 'w') as ostream:
    ostream.write(json.dumps(jsonData))
with open(fname + '_formated.json', 'w') as ostream:
    data = pprint.pformat(jsonData).replace("'", '"')
    ostream.write(data)

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 2 tests added
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 17 warnings and errors that must be fixed
    • 8 warnings
    • 192 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 26 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14439/artifact/artifacts/PullRequestReport.html

@amaltaro amaltaro changed the title Provide CPU/Memory and Elapsed time of CMSSW subprocess based on psutil module Provide in the FWJR the CPU and wallclock time for CMSSW subprocesses Aug 23, 2023
Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes are looking good to me, thanks!
To avoid another iteration, I updated the PR title and will have to manually fix the commit message when constructing the release notes. Just FYI

UPDATE: PR description was outdated, I am updating it now.

@amaltaro amaltaro merged commit 10482ee into dmwm:master Aug 23, 2023
@amaltaro
Copy link
Contributor

amaltaro commented Sep 8, 2023

As we keep discussing this and try to have the whole chain functional, I have some updates that are worth sharing.

  1. That the actual FJR used by the ArchiveDataReporter - to upload a document to WMArchive - does not have this new section added in this pull request, called WMCMSSWSubprocess. This can be verified by pulling any job FJR from CouchDB (provided that the agent version is >=2.2.4rc3), e.g.:
curl ***:***@localhost:5984/wmagent_jobdump%2Ffwjrs/537-0 | jq
  1. Each job step does have a start and stop metric though, but that already existed in the FJR previously to this development, e.g.:
        "start": 1694121397,
        "stop": 1694121805,
  1. Job report pickle file does contain this section. Those report pickle files get transferred back to the agent once the job finishes and one example location to find it in a tarball is install/wmagentpy3/JobArchiver/logDir/a/amaltaro_TaskChain_ProdMinBias_Sept2023_Val_230907_205436_2542/JobCluster_0/Job_537.tar.bz2 (vocms0192). Upon decompressing it and loading the report pickle file
>>> with open("/data/srv/wmagent/current/Job_537/Report.0.pkl", "rb") as po:
...   data = pickle.load(po)
>>> pprint(data.dictionary_whole_tree_())
{'WMAgentJobID': 537,
 'WMAgentJobName': '893188c8-0783-4db2-8f55-ce060a3bdd4e-71',
 'ceName': 'b7s02p8471.cern.ch',
 'cmsRun1': {'WMCMSSWSubprocess': {'endTime': 1694121805.631746,
                                   'startTime': 1694121432.9059343,
                                   'sysTime': 17.675720000000002,
                                   'userTime': 320.763982,
                                   'wallClockTime': 372.7258117198944},

we can properly find it. Another alternative to see the whole structure of this object is with:

>>> pprint(data.pythonise_())
['FrameworkJobReport.WMAgentJobName = '
...
 "FrameworkJobReport.cmsRun1.section_('WMCMSSWSubprocess')",
 'FrameworkJobReport.cmsRun1.WMCMSSWSubprocess.startTime = 1694121432.9059343',
 'FrameworkJobReport.cmsRun1.WMCMSSWSubprocess.userTime = 320.763982',

Having said that, the place to debug this further would be:
https://github.com/dmwm/WMCore/blob/4170e4a/src/python/WMCore/JobStateMachine/ChangeState.py#L346-L404

which generates the relevant information stored in the wmagent_jobdump/fwjrs CouchDB database. FYI @vkuznet

PS.: Note that the overall timing section defined at the wrapper and called WMTiming is nowhere to be found in these documents.

@vkuznet
Copy link
Contributor Author

vkuznet commented Sep 8, 2023

@amaltaro , thanks for looking into this but now I doubt that ChangeState.py is a right place to dig in. Instead, as I expected I found that there is internal MariaDB database on WMAgent which has wmbs_job and within this table there is a fwjr_path column which contains path to FJR pkl file. With this information in hands now I can see how we can move forward:

  • modify ArchiveDataPoller and add access to MariaDB
  • for every job ID load corresponding FJR path from wmbs_job table
  • open and parse FJR pkl file which will give us content of FJR stored in this file during job creation and processing
  • enrich existing fwjr document (which comes from local CouchDB) with information from FJR pkl file
  • send fwjr to WMArchive server

I would appreciate if you'll review this logic and let me know your feedback. To implement it I'll need to know how to pass MariaDB URI to ArchiveDataPoller class. Please let me know how to do it. Once I have that I think I can manage to load FJR pkl file via aforementioned procedure and properly add necessary sections to fwjr (which comes from local CouchDB) and pass it to WMArchive.

@amaltaro
Copy link
Contributor

amaltaro commented Sep 8, 2023

Changing the data source is definitely NOT the right way to proceed. If ArchiveDataReporter is fed with the FJRs stored in CouchDB, what we need to do is to ensure that the CouchDB documents have the proper content.

Please investigate the code I referenced above in ChangeState to ensure that the fwjrDocument object in that code ends with the new metrics.

@vkuznet
Copy link
Contributor Author

vkuznet commented Sep 9, 2023

@amaltaro , I do not understand what do you mean by saying "changing data source", which data source? In my view there two different FJR in a system. One is generated within job, this is the one where we put metrics. It is written to the pkl file when job is done. And, another which is create by some component in WMA which writes and update it in local CouchDB. If this is the case in order to get CMSSW, or any other job related metrics we need to read job FJR from pkl file because this is where metrics about job are stored. We can read it in ArchiveDataReporter or in ChangeState to complement the FJR which is circulating in WMA and not associated with a job. We verified already that WMCMSSWSubprocess are present in job FJR and are not present in local couchDB one. Therefore, it just confirms that those are two different FJR reports circulating in WMA, and in order to unify them we need to read pkl file elsewhere. In my view ArchiveDataPoller is proper place as it runs after jobs are done, but the same logic can be put in ChangeState module but I doubt that pkl file will exist at every change of the workflow change, and therefore it reference in MariaDB will exist by the time we change state from processing job to job archive for instance. So fart I see pointless to dig further since I have no clear idea how entire workflow and its transition in different system works. I rather prefer to have dedicated discussion/chat about the strategy. At the moment I see that we can simply dump fwjr object in ChangeState but as I pointed out already it does not create it in a first place, instead it is given jobs FJR objects which are created someplace else.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

WM measurements of wallclock and cpu time for each cmsRun step
5 participants