-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exception instead of LogError in DeepTauId #40733
Comments
A new Issue was created by @VinInn Vincenzo Innocente. @Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign reconstruction |
New categories assigned: reconstruction @mandrenguyen,@clacaputo you have been requested to review this Pull request/Issue and eventually sign? Thanks |
type tau |
Value of |
Yes, I have commented in #28358, if the condition do not satisfied it means that there has been some memory corruption, so any computations after that point shouldn't be trusted. Therefore, simply reporting in LogError is not sufficient. It was confirmed in some private productions that after first event with If you consider aborting the whole CMSSW for that is an overkill, possible solution is to disable modules that can be potentially affected by this memory corruption. E.g. after the condition above is occurred all Tau-related modules are disabled for all consecutive events and instead of tau collection some flag that indicates that tau sequences are disabled is stored in the output root file. However, I don't know if such mechanisms is implemented within CMSSW. |
it would be useful to have full details of the Hardware were the error occurred. We may have to develop and deploy some sort of ad-hoc test or just ask site to regularly run memory-checks. |
My understanding is that TensorFlow is used in DeepTauId. |
@VinInn hm.. I just go one of these errors in some of my private nano production jobs. It seems that NaN conditions is now occurring more frequently wrt to the past. Here is an example log file. Do you think we can extract something useful from it about the hardware? One interesting observation: It was a crab job with a custom executable that consecutively runs CMSSW instances for each input file. This particular job had two input files and for both cmssw crashed on the first event due to the NaN problem. So the behaviour seems to be persistent on that worker node. |
HERE WE GO.
I bet we never tested on sse-only machines.... |
The log happens to contain also the exact processor model
in case that would be useful. |
I failed to find a sse-only machine at cern. |
I found one node with Intel(R) Xeon(R) CPU X5650 @ 2.67GHz. It should also be SSE-only. |
@kandrosov : would it be easy for you to check that the results of deepTauId is identical on this machine and on a AVX2/AVX512 one? |
it's plenty of
is it a false positive? the rest is sometimes scary but clearly in python or cling... |
It’s a false positive. In the ROOT release area they have a filter file one can use with valgrind to ignore those messages. |
I'm running
and btw I also see many
some improper delete
and a large number of memory leaks in tensorflow.... |
IIUC, this was fixed in #40105 in 12_6_X (merged in pre5, apparently) |
Thanks @slava77 , I was suspecting something like that (and sorry if I'm still runnning and old benchmark (hepix benchmark)) |
the nan in deepTau could be blocking the rereco of 2022 |
here
https://cmssdt.cern.ch/dxr/CMSSW/source/RecoTauTag/RecoTau/plugins/DeepTauId.cc#1296
a LogError should be issued not what is an effective job abort
The text was updated successfully, but these errors were encountered: