-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Paola's usecase #251
Comments
Hi, the code which describes Paola's use case has been committed here: #273 Here is a recipe to do a job:
here you put your desired prep_id and approximate timerange to scan in WMArchive. The dates in timerange list is lower/upper bounds in YYYYMMDD format.
Once job is done the output will be in prepid.records. It is a json file with the following structure:
|
Hi, I am having a problem running myspark, it seems no records are retrieved, even though I know the task had failures in the time range. I don't know if I am doing something wrong. |
Paola,
you provided a spec with task name, but originally you requested support for
prep_id. That's a reason why script didn't find the record. I updated
RecordFinderFailures.py code and included task name look-up.
Just update your github code, i.e.
```
cd WMArchvie
git pull
```
and you should be able to get your records. I just run the code in my local
area and I saw your records. Please try it out.
Thanks,
Valentin.
…On 0, Paola Katherine Rozo ***@***.***> wrote:
Hi, I am having a problem running myspark, it seems no records are retrieved, even though I know the task had failures in the time range. I don't know if I am doing something wrong.
I am attaching the file .spec I am using, and the output of myspark.
Thanks.
[archive.zip](https://github.com/dmwm/WMArchive/files/652535/archive.zip)
--
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
#251 (comment)
|
I guess I misunderstood the instructions. I updated the github code, and it worked for the specific task. |
Paola,
I think you made had an error either in spec file or in name of prep_id or in
timerange. I used the following spec:
```
{"spec":{"prep_id":"B2G-RunIISummer16MiniAOD","timerange":[20161201,20161213]}, "fields":[]}
```
where I loose prep_id pattern and increased timerange and I found 208 records.
When I run this spec with FindRecord script
```
myspark --spec=paola2.spec --script=RecordFinderCMSRun --yarn --records-output=paola.records
```
I was able to see that prep_id which match your patter
B2G-RunIISummer16MiniAODv2-0012
are the following
B2G-RunIISummer16MiniAODv2-00129
B2G-RunIISummer16MiniAODv2-00120
so may be you mistyped prep_id. And, probably I should add prep_id in output
of FindRecordFailures since so far is it not there. This will allow
to see all matches.
The bottom line, the code works and it can find data, but you must be precise
with your constrains.
Please let me know if you want to have prep_id values in output of
FindRecordFailures script. And, feel free to run script with proper prep_id
and/or timerange.
Best,
Valentin.
…On 0, Paola Katherine Rozo ***@***.***> wrote:
I guess I misunderstood the instructions. I updated the github code, and it worked for the specific task.
But of course, I need the records by prep_id. So, in this case the .spec file will contain this line:
`{"spec":{"prep_id":"B2G-RunIISummer16MiniAODv2-0012,"timerange":[20161203,20161205]}, "fields":[]}`
I used the bd12469 commit, but it did not work.
What else do I need to do?
I'm sorry for the inconveniences.
--
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
#251 (comment)
|
Thanks Valentin.
I executed: The result was:
According to WMStats the only workflow with this prep_id for that period of time, had these failures (please check "jobfailed": { ) The results are not coherent. What do you think is the problem? |
Paola,
before going into discussion about a problem I rather prefer to figure out how
to compare these results. The URL you provided is hard to read, the json
structure there is different from WMArchvie docs. For instance, there are
265 jobfailed entries, while only 44 with "state": "jobfailed" key value pair.
The prep_id ReReco-Run2016H-v1-09Nov2016-0009 does not exist in a document
you pointed out.
The output I provided is based on your requirements and it would be nice
if you'll write a code which will provide similar structure from your wmstats
URL. Then we can compare results and discuss any outstanding issues.
May be we need to provide more info from WMArchive since I see that output
dicts you got from myspark job has similar structure, e.g. there are bunch
of docs with
"site": "T2_US_MIT", "exitCode": 99109, "exitStep": "stageOut1"
which means that in WMArchive the FWJR have these attributes but may different
in other dimensions, e.g. wmagent name, input files, etc.
If you need to look at full records from WMArchive you can run the same myspark
job with RecordFinder script (please use latest github code though since I
made few modifications to support different spec conditions).
Best,
Valentin.
…On 0, Paola Katherine Rozo ***@***.***> wrote:
Thanks Valentin.
I am doing a couple of tests with the prep_id 'ReReco-Run2016H-v1-09Nov2016-0009':
`{"spec":{"prep_id":"ReReco-Run2016H-v1-09Nov2016-0009","timerange":[20161109,20161121]}, "fields":[]}`
I executed:
`myspark --spec=rereco.spec --yarn --records-output=rereco.records --script=RecordFinderFailures`
The result was:
```
[
{"site": "T1_US_FNAL_Disk", "exitCode": 99109, "exitStep": "stageOut1", "jobtype": "Processing", "workflow": "fabozzi_Run2016H-v1-ZeroBiasIsolatedBunch4-09Nov2016_8023_161109_181753_3208"},
{"site": "T2_DE_RWTH", "exitCode": 99109, "exitStep": "stageOut1", "jobtype": "Processing", "workflow": "fabozzi_Run2016H-v1-ZeroBiasIsolatedBunch4-09Nov2016_8023_161109_181753_3208"},
{"site": "T2_FR_GRIF_IRFU", "exitCode": 50664, "exitStep": "PerformanceError", "jobtype": "Processing", "workflow": "fabozzi_Run2016H-v1-ZeroBiasIsolatedBunch4-09Nov2016_8023_161109_181753_3208"},
{"site": "T2_CH_CERN_HLT", "exitCode": 50664, "exitStep": "PerformanceError", "jobtype": "Processing", "workflow": "fabozzi_Run2016H-v1-ZeroBiasIsolatedBunch4-09Nov2016_8023_161109_181753_3208"},
{"site": "T2_US_MIT", "exitCode": 99109, "exitStep": "stageOut1", "jobtype": "Merge", "workflow": "fabozzi_Run2016H-v1-ZeroBiasIsolatedBunch4-09Nov2016_8023_161109_181753_3208"},
{"site": "T2_US_MIT", "exitCode": 99109, "exitStep": "stageOut1", "jobtype": "Merge", "workflow": "fabozzi_Run2016H-v1-ZeroBiasIsolatedBunch4-09Nov2016_8023_161109_181753_3208"},
{"site": "T2_US_MIT", "exitCode": 99109, "exitStep": "stageOut1", "jobtype": "Merge", "workflow": "fabozzi_Run2016H-v1-ZeroBiasIsolatedBunch4-09Nov2016_8023_161109_181753_3208"},
{"site": "T2_US_MIT", "exitCode": 99109, "exitStep": "stageOut1", "jobtype": "Merge", "workflow": "fabozzi_Run2016H-v1-ZeroBiasIsolatedBunch4-09Nov2016_8023_161109_181753_3208"},
{"site": "T2_US_MIT", "exitCode": 99109, "exitStep": "stageOut1", "jobtype": "Merge", "workflow": "fabozzi_Run2016H-v1-ZeroBiasIsolatedBunch4-09Nov2016_8023_161109_181753_3208"},
{"site": "T2_US_MIT", "exitCode": 99109, "exitStep": "stageOut1", "jobtype": "Merge", "workflow": "fabozzi_Run2016H-v1-ZeroBiasIsolatedBunch4-09Nov2016_8023_161109_181753_3208"},
{"site": "T2_US_MIT", "exitCode": 99109, "exitStep": "stageOut1", "jobtype": "Merge", "workflow": "fabozzi_Run2016H-v1-ZeroBiasIsolatedBunch4-09Nov2016_8023_161109_181753_3208"},
{"site": "T2_US_MIT", "exitCode": 99109, "exitStep": "stageOut1", "jobtype": "Merge", "workflow": "fabozzi_Run2016H-v1-ZeroBiasIsolatedBunch4-09Nov2016_8023_161109_181753_3208"},
{"site": "T2_US_MIT", "exitCode": 99109, "exitStep": "stageOut1", "jobtype": "Merge", "workflow": "fabozzi_Run2016H-v1-ZeroBiasIsolatedBunch4-09Nov2016_8023_161109_181753_3208"},
{"site": "T2_US_MIT", "exitCode": 99109, "exitStep": "stageOut1", "jobtype": "Merge", "workflow": "fabozzi_Run2016H-v1-ZeroBiasIsolatedBunch4-09Nov2016_8023_161109_181753_3208"},
{"site": "T2_US_MIT", "exitCode": 99109, "exitStep": "stageOut1", "jobtype": "Merge", "workflow": "fabozzi_Run2016H-v1-ZeroBiasIsolatedBunch4-09Nov2016_8023_161109_181753_3208"},
{"site": "T2_US_MIT", "exitCode": 99109, "exitStep": "stageOut1", "jobtype": "Merge", "workflow": "fabozzi_Run2016H-v1-ZeroBiasIsolatedBunch4-09Nov2016_8023_161109_181753_3208"},
{"site": "T2_CH_CERN_HLT", "exitCode": 50664, "exitStep": "PerformanceError", "jobtype": "Processing", "workflow": "fabozzi_Run2016H-v1-ZeroBiasIsolatedBunch4-09Nov2016_8023_161109_181753_3208"}]
```
According to WMStats the only workflow with this prep_id for that period of time, had these failures (please check "jobfailed": { )
https://cmsweb.cern.ch/wmstatsserver/data/jobdetail/fabozzi_Run2016H-v1-ZeroBiasIsolatedBunch4-09Nov2016_8023_161109_181753_3208
The results are not coherent. What do you think is the problem?
--
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
#251 (comment)
|
Do you need to have an account on an analytix cluster machine? if so how do I get one? Can I do testing on the agent machines? Jen |
See also |
Jen,
yes, you need an account on analytix cluster, just send request to
Luca Menichetti <[email protected]>
and, no you can't do testing on agent machines because WMArchive requires
to submit job to HDFS cluster and agents machine do not have Hadoop/HDFS
environment for that.
Best,
Valentin.
…On 0, jenimal ***@***.***> wrote:
Do you need to have an account on an analytix cluster machine? if so how do I get one? Can I do testing on the agent machines?
Jen
--
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
#251 (comment)
|
Thanks, |
Paola request below data searching task:
Given a PrepID and a time frame (a week), she needs to find out:
The definition of a failed job is that the job was failed after all the tries, usually Agents will try three times. If a job runs successful after the 3rd try, we will not collect the information for the previous failed tries. And this job is considered as a successful job.
We will only collect the final failed try for above information, the previous tries will not be collected.
currently, PrepID info is at file level. Seangchan and his group is working on getting PrepID and Champaign info into the top level of FWJR. We will need to update WMarchive schema after they put these in production.
Valentin is working on the scripts using the prepID at file level. He will provide Paola a working script for her initial use.
The text was updated successfully, but these errors were encountered: