Paola's usecase #251

yuyiguo · 2016-11-16T17:51:01Z

Paola request below data searching task:
Given a PrepID and a time frame (a week), she needs to find out:

the workflow names/task names associated with the prepID.
The failed job type.
The site name where the job was failed.
The exitCode of the failed job.

The definition of a failed job is that the job was failed after all the tries, usually Agents will try three times. If a job runs successful after the 3rd try, we will not collect the information for the previous failed tries. And this job is considered as a successful job.

We will only collect the final failed try for above information, the previous tries will not be collected.

currently, PrepID info is at file level. Seangchan and his group is working on getting PrepID and Champaign info into the top level of FWJR. We will need to update WMarchive schema after they put these in production.

Valentin is working on the scripts using the prepID at file level. He will provide Paola a working script for her initial use.

vkuznet · 2016-12-04T14:36:56Z

Hi, the code which describes Paola's use case has been committed here: #273

Here is a recipe to do a job:

acquire account on CERN analytix cluster
login to CERN analytix cluster
cd to your working area
download WMArchive code git clone [email protected]:dmwm/WMArchive.git
setup environment

cd WMArchive
export PYTHONPATH=$PWD/WMArchive/src/python/
export PATH=$PWD/bin:$PATH

write spec file with desired prepid, e.g.

cat prepid.spec
{"spec":{"prep_id":"SUS-RunIISummer16DR80Premix-00169", "timerange":[20161127,20161129]}, "fields":[]}

here you put your desired prep_id and approximate timerange to scan in WMArchive. The dates in timerange list is lower/upper bounds in YYYYMMDD format.

run myspark

myspark --spec=prepid.spec --yarn --records-output=prepid.records --script=RecordFinderFailures

Once job is done the output will be in prepid.records. It is a json file with the following structure:

[
{"site": "T2_ES_CIEMAT", "exitCode": "99109", "exitStep": "stageOut1", "jobtype": "Processing", "workflow": "pdmvserv_task_SUS-RunIISummer15GS-00196__v1_T_161125_233700_7116"},
{"site": "T2_UK_SGrid_RALPP", "exitCode": "99109", "exitStep": "stageOut1", "jobtype": "Processing", "workflow": "pdmvserv_task_SUS-RunIISummer15GS-00196__v1_T_161125_233700_7116"}
]

paorozo · 2016-12-14T18:37:34Z

Hi, I am having a problem running myspark, it seems no records are retrieved, even though I know the task had failures in the time range. I don't know if I am doing something wrong.
I am attaching the file .spec I am using, and the output of myspark.
Thanks.
archive.zip

vkuznet · 2016-12-14T20:12:58Z

Paola, you provided a spec with task name, but originally you requested support for prep_id. That's a reason why script didn't find the record. I updated RecordFinderFailures.py code and included task name look-up. Just update your github code, i.e. ``` cd WMArchvie git pull ``` and you should be able to get your records. I just run the code in my local area and I saw your records. Please try it out. Thanks, Valentin.

…

On 0, Paola Katherine Rozo ***@***.***> wrote: Hi, I am having a problem running myspark, it seems no records are retrieved, even though I know the task had failures in the time range. I don't know if I am doing something wrong. I am attaching the file .spec I am using, and the output of myspark. Thanks. [archive.zip](https://github.com/dmwm/WMArchive/files/652535/archive.zip) -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: #251 (comment)

paorozo · 2016-12-14T21:48:35Z

I guess I misunderstood the instructions. I updated the github code, and it worked for the specific task.
But of course, I need the records by prep_id. So, in this case the .spec file will contain this line:
{"spec":{"prep_id":"B2G-RunIISummer16MiniAODv2-0012,"timerange":[20161203,20161205]}, "fields":[]}
I used the bd12469 commit, but it did not work.
What else do I need to do?
I'm sorry for the inconveniences.

vkuznet · 2016-12-15T01:01:09Z

Paola, I think you made had an error either in spec file or in name of prep_id or in timerange. I used the following spec: ``` {"spec":{"prep_id":"B2G-RunIISummer16MiniAOD","timerange":[20161201,20161213]}, "fields":[]} ``` where I loose prep_id pattern and increased timerange and I found 208 records. When I run this spec with FindRecord script ``` myspark --spec=paola2.spec --script=RecordFinderCMSRun --yarn --records-output=paola.records ``` I was able to see that prep_id which match your patter B2G-RunIISummer16MiniAODv2-0012 are the following B2G-RunIISummer16MiniAODv2-00129 B2G-RunIISummer16MiniAODv2-00120 so may be you mistyped prep_id. And, probably I should add prep_id in output of FindRecordFailures since so far is it not there. This will allow to see all matches. The bottom line, the code works and it can find data, but you must be precise with your constrains. Please let me know if you want to have prep_id values in output of FindRecordFailures script. And, feel free to run script with proper prep_id and/or timerange. Best, Valentin.

…

On 0, Paola Katherine Rozo ***@***.***> wrote: I guess I misunderstood the instructions. I updated the github code, and it worked for the specific task. But of course, I need the records by prep_id. So, in this case the .spec file will contain this line: `{"spec":{"prep_id":"B2G-RunIISummer16MiniAODv2-0012,"timerange":[20161203,20161205]}, "fields":[]}` I used the bd12469 commit, but it did not work. What else do I need to do? I'm sorry for the inconveniences. -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: #251 (comment)

paorozo · 2016-12-16T17:46:55Z

Thanks Valentin.
I am doing a couple of tests with the prep_id 'ReReco-Run2016H-v1-09Nov2016-0009':

{"spec":{"prep_id":"ReReco-Run2016H-v1-09Nov2016-0009","timerange":[20161109,20161121]}, "fields":[]}

I executed:
myspark --spec=rereco.spec --yarn --records-output=rereco.records --script=RecordFinderFailures

The result was:

[
{"site": "T1_US_FNAL_Disk", "exitCode": 99109, "exitStep": "stageOut1", "jobtype": "Processing", "workflow": "fabozzi_Run2016H-v1-ZeroBiasIsolatedBunch4-09Nov2016_8023_161109_181753_3208"},
{"site": "T2_DE_RWTH", "exitCode": 99109, "exitStep": "stageOut1", "jobtype": "Processing", "workflow": "fabozzi_Run2016H-v1-ZeroBiasIsolatedBunch4-09Nov2016_8023_161109_181753_3208"},
{"site": "T2_FR_GRIF_IRFU", "exitCode": 50664, "exitStep": "PerformanceError", "jobtype": "Processing", "workflow": "fabozzi_Run2016H-v1-ZeroBiasIsolatedBunch4-09Nov2016_8023_161109_181753_3208"},
{"site": "T2_CH_CERN_HLT", "exitCode": 50664, "exitStep": "PerformanceError", "jobtype": "Processing", "workflow": "fabozzi_Run2016H-v1-ZeroBiasIsolatedBunch4-09Nov2016_8023_161109_181753_3208"},
{"site": "T2_US_MIT", "exitCode": 99109, "exitStep": "stageOut1", "jobtype": "Merge", "workflow": "fabozzi_Run2016H-v1-ZeroBiasIsolatedBunch4-09Nov2016_8023_161109_181753_3208"},
{"site": "T2_US_MIT", "exitCode": 99109, "exitStep": "stageOut1", "jobtype": "Merge", "workflow": "fabozzi_Run2016H-v1-ZeroBiasIsolatedBunch4-09Nov2016_8023_161109_181753_3208"},
{"site": "T2_US_MIT", "exitCode": 99109, "exitStep": "stageOut1", "jobtype": "Merge", "workflow": "fabozzi_Run2016H-v1-ZeroBiasIsolatedBunch4-09Nov2016_8023_161109_181753_3208"},
{"site": "T2_US_MIT", "exitCode": 99109, "exitStep": "stageOut1", "jobtype": "Merge", "workflow": "fabozzi_Run2016H-v1-ZeroBiasIsolatedBunch4-09Nov2016_8023_161109_181753_3208"},
{"site": "T2_US_MIT", "exitCode": 99109, "exitStep": "stageOut1", "jobtype": "Merge", "workflow": "fabozzi_Run2016H-v1-ZeroBiasIsolatedBunch4-09Nov2016_8023_161109_181753_3208"},
{"site": "T2_US_MIT", "exitCode": 99109, "exitStep": "stageOut1", "jobtype": "Merge", "workflow": "fabozzi_Run2016H-v1-ZeroBiasIsolatedBunch4-09Nov2016_8023_161109_181753_3208"},
{"site": "T2_US_MIT", "exitCode": 99109, "exitStep": "stageOut1", "jobtype": "Merge", "workflow": "fabozzi_Run2016H-v1-ZeroBiasIsolatedBunch4-09Nov2016_8023_161109_181753_3208"},
{"site": "T2_US_MIT", "exitCode": 99109, "exitStep": "stageOut1", "jobtype": "Merge", "workflow": "fabozzi_Run2016H-v1-ZeroBiasIsolatedBunch4-09Nov2016_8023_161109_181753_3208"},
{"site": "T2_US_MIT", "exitCode": 99109, "exitStep": "stageOut1", "jobtype": "Merge", "workflow": "fabozzi_Run2016H-v1-ZeroBiasIsolatedBunch4-09Nov2016_8023_161109_181753_3208"},
{"site": "T2_US_MIT", "exitCode": 99109, "exitStep": "stageOut1", "jobtype": "Merge", "workflow": "fabozzi_Run2016H-v1-ZeroBiasIsolatedBunch4-09Nov2016_8023_161109_181753_3208"},
{"site": "T2_US_MIT", "exitCode": 99109, "exitStep": "stageOut1", "jobtype": "Merge", "workflow": "fabozzi_Run2016H-v1-ZeroBiasIsolatedBunch4-09Nov2016_8023_161109_181753_3208"},
{"site": "T2_US_MIT", "exitCode": 99109, "exitStep": "stageOut1", "jobtype": "Merge", "workflow": "fabozzi_Run2016H-v1-ZeroBiasIsolatedBunch4-09Nov2016_8023_161109_181753_3208"},
{"site": "T2_CH_CERN_HLT", "exitCode": 50664, "exitStep": "PerformanceError", "jobtype": "Processing", "workflow": "fabozzi_Run2016H-v1-ZeroBiasIsolatedBunch4-09Nov2016_8023_161109_181753_3208"}]

According to WMStats the only workflow with this prep_id for that period of time, had these failures (please check "jobfailed": { )
https://cmsweb.cern.ch/wmstatsserver/data/jobdetail/fabozzi_Run2016H-v1-ZeroBiasIsolatedBunch4-09Nov2016_8023_161109_181753_3208

The results are not coherent. What do you think is the problem?

vkuznet · 2016-12-16T18:45:43Z

Paola, before going into discussion about a problem I rather prefer to figure out how to compare these results. The URL you provided is hard to read, the json structure there is different from WMArchvie docs. For instance, there are 265 jobfailed entries, while only 44 with "state": "jobfailed" key value pair. The prep_id ReReco-Run2016H-v1-09Nov2016-0009 does not exist in a document you pointed out. The output I provided is based on your requirements and it would be nice if you'll write a code which will provide similar structure from your wmstats URL. Then we can compare results and discuss any outstanding issues. May be we need to provide more info from WMArchive since I see that output dicts you got from myspark job has similar structure, e.g. there are bunch of docs with "site": "T2_US_MIT", "exitCode": 99109, "exitStep": "stageOut1" which means that in WMArchive the FWJR have these attributes but may different in other dimensions, e.g. wmagent name, input files, etc. If you need to look at full records from WMArchive you can run the same myspark job with RecordFinder script (please use latest github code though since I made few modifications to support different spec conditions). Best, Valentin.

…

On 0, Paola Katherine Rozo ***@***.***> wrote: Thanks Valentin. I am doing a couple of tests with the prep_id 'ReReco-Run2016H-v1-09Nov2016-0009': `{"spec":{"prep_id":"ReReco-Run2016H-v1-09Nov2016-0009","timerange":[20161109,20161121]}, "fields":[]}` I executed: `myspark --spec=rereco.spec --yarn --records-output=rereco.records --script=RecordFinderFailures` The result was: ``` [ {"site": "T1_US_FNAL_Disk", "exitCode": 99109, "exitStep": "stageOut1", "jobtype": "Processing", "workflow": "fabozzi_Run2016H-v1-ZeroBiasIsolatedBunch4-09Nov2016_8023_161109_181753_3208"}, {"site": "T2_DE_RWTH", "exitCode": 99109, "exitStep": "stageOut1", "jobtype": "Processing", "workflow": "fabozzi_Run2016H-v1-ZeroBiasIsolatedBunch4-09Nov2016_8023_161109_181753_3208"}, {"site": "T2_FR_GRIF_IRFU", "exitCode": 50664, "exitStep": "PerformanceError", "jobtype": "Processing", "workflow": "fabozzi_Run2016H-v1-ZeroBiasIsolatedBunch4-09Nov2016_8023_161109_181753_3208"}, {"site": "T2_CH_CERN_HLT", "exitCode": 50664, "exitStep": "PerformanceError", "jobtype": "Processing", "workflow": "fabozzi_Run2016H-v1-ZeroBiasIsolatedBunch4-09Nov2016_8023_161109_181753_3208"}, {"site": "T2_US_MIT", "exitCode": 99109, "exitStep": "stageOut1", "jobtype": "Merge", "workflow": "fabozzi_Run2016H-v1-ZeroBiasIsolatedBunch4-09Nov2016_8023_161109_181753_3208"}, {"site": "T2_US_MIT", "exitCode": 99109, "exitStep": "stageOut1", "jobtype": "Merge", "workflow": "fabozzi_Run2016H-v1-ZeroBiasIsolatedBunch4-09Nov2016_8023_161109_181753_3208"}, {"site": "T2_US_MIT", "exitCode": 99109, "exitStep": "stageOut1", "jobtype": "Merge", "workflow": "fabozzi_Run2016H-v1-ZeroBiasIsolatedBunch4-09Nov2016_8023_161109_181753_3208"}, {"site": "T2_US_MIT", "exitCode": 99109, "exitStep": "stageOut1", "jobtype": "Merge", "workflow": "fabozzi_Run2016H-v1-ZeroBiasIsolatedBunch4-09Nov2016_8023_161109_181753_3208"}, {"site": "T2_US_MIT", "exitCode": 99109, "exitStep": "stageOut1", "jobtype": "Merge", "workflow": "fabozzi_Run2016H-v1-ZeroBiasIsolatedBunch4-09Nov2016_8023_161109_181753_3208"}, {"site": "T2_US_MIT", "exitCode": 99109, "exitStep": "stageOut1", "jobtype": "Merge", "workflow": "fabozzi_Run2016H-v1-ZeroBiasIsolatedBunch4-09Nov2016_8023_161109_181753_3208"}, {"site": "T2_US_MIT", "exitCode": 99109, "exitStep": "stageOut1", "jobtype": "Merge", "workflow": "fabozzi_Run2016H-v1-ZeroBiasIsolatedBunch4-09Nov2016_8023_161109_181753_3208"}, {"site": "T2_US_MIT", "exitCode": 99109, "exitStep": "stageOut1", "jobtype": "Merge", "workflow": "fabozzi_Run2016H-v1-ZeroBiasIsolatedBunch4-09Nov2016_8023_161109_181753_3208"}, {"site": "T2_US_MIT", "exitCode": 99109, "exitStep": "stageOut1", "jobtype": "Merge", "workflow": "fabozzi_Run2016H-v1-ZeroBiasIsolatedBunch4-09Nov2016_8023_161109_181753_3208"}, {"site": "T2_US_MIT", "exitCode": 99109, "exitStep": "stageOut1", "jobtype": "Merge", "workflow": "fabozzi_Run2016H-v1-ZeroBiasIsolatedBunch4-09Nov2016_8023_161109_181753_3208"}, {"site": "T2_US_MIT", "exitCode": 99109, "exitStep": "stageOut1", "jobtype": "Merge", "workflow": "fabozzi_Run2016H-v1-ZeroBiasIsolatedBunch4-09Nov2016_8023_161109_181753_3208"}, {"site": "T2_US_MIT", "exitCode": 99109, "exitStep": "stageOut1", "jobtype": "Merge", "workflow": "fabozzi_Run2016H-v1-ZeroBiasIsolatedBunch4-09Nov2016_8023_161109_181753_3208"}, {"site": "T2_CH_CERN_HLT", "exitCode": 50664, "exitStep": "PerformanceError", "jobtype": "Processing", "workflow": "fabozzi_Run2016H-v1-ZeroBiasIsolatedBunch4-09Nov2016_8023_161109_181753_3208"}] ``` According to WMStats the only workflow with this prep_id for that period of time, had these failures (please check "jobfailed": { ) https://cmsweb.cern.ch/wmstatsserver/data/jobdetail/fabozzi_Run2016H-v1-ZeroBiasIsolatedBunch4-09Nov2016_8023_161109_181753_3208 The results are not coherent. What do you think is the problem? -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: #251 (comment)

jenimal · 2017-01-31T21:34:21Z

Do you need to have an account on an analytix cluster machine? if so how do I get one? Can I do testing on the agent machines?

Jen

jenimal · 2017-01-31T21:34:39Z

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Paola's usecase #251

Paola's usecase #251

yuyiguo commented Nov 16, 2016

vkuznet commented Dec 4, 2016 •

edited

Loading

paorozo commented Dec 14, 2016

vkuznet commented Dec 14, 2016 via email

paorozo commented Dec 14, 2016

vkuznet commented Dec 15, 2016 via email

paorozo commented Dec 16, 2016

vkuznet commented Dec 16, 2016 via email

jenimal commented Jan 31, 2017

jenimal commented Jan 31, 2017

vkuznet commented Jan 31, 2017 via email

jenimal commented Jan 31, 2017

Paola's usecase #251

Paola's usecase #251

Comments

yuyiguo commented Nov 16, 2016

vkuznet commented Dec 4, 2016 • edited Loading

paorozo commented Dec 14, 2016

vkuznet commented Dec 14, 2016 via email

paorozo commented Dec 14, 2016

vkuznet commented Dec 15, 2016 via email

paorozo commented Dec 16, 2016

vkuznet commented Dec 16, 2016 via email

jenimal commented Jan 31, 2017

jenimal commented Jan 31, 2017

vkuznet commented Jan 31, 2017 via email

jenimal commented Jan 31, 2017

vkuznet commented Dec 4, 2016 •

edited

Loading