Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FileReadError exception leading to 8001 exit code instead of 8021 #42179

Closed
makortel opened this issue Jul 3, 2023 · 27 comments
Closed

FileReadError exception leading to 8001 exit code instead of 8021 #42179

makortel opened this issue Jul 3, 2023 · 27 comments

Comments

@makortel
Copy link
Contributor

makortel commented Jul 3, 2023

A log file analyzed in #40437 (comment) shows a case where FileReadError exception lead to 8001 exit code instead of 8021. The job used CMSSW_12_4_14_patch1. Based on the exception message, it seems to me the exception in question is thrown here

try {
TTreeCache* cache = selectCache(branch, entryNumber);
filePtr_->SetCacheRead(cache);
branch->GetEntry(entryNumber);
filePtr_->SetCacheRead(nullptr);
} catch (cms::Exception const& e) {
// We make sure the treeCache_ is detached from the file,
// so that ROOT does not also delete it.
filePtr_->SetCacheRead(nullptr);
Exception t(errors::FileReadError, "", e);
t.addContext(std::string("Reading branch ") + branch->GetName());
throw t;

@makortel
Copy link
Contributor Author

makortel commented Jul 3, 2023

assign core

@cmsbuild
Copy link
Contributor

cmsbuild commented Jul 3, 2023

New categories assigned: core

@Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@cmsbuild
Copy link
Contributor

cmsbuild commented Jul 3, 2023

A new Issue was created by @makortel Matti Kortelainen.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor Author

makortel commented Dec 14, 2023

Another case was reported in https://cms-talk.web.cern.ch/t/corrupted-files/32728, in CMSSW_13_0_15. Snippet from the log

== CMSSW: Begin processing the 317796th record. Run 360459, Event 544665755, LumiSection 259 on stream 0 at 14-Dec-2023 01:11:47.953 EET
== CMSSW: R__unzipLZMA: error 9 in lzma_code
== CMSSW: ----- Begin Fatal Exception 14-Dec-2023 01:11:50 EET-----------------------
== CMSSW: An exception of category 'FileReadError' occurred while
== CMSSW:    [0] Processing  Event run: 360459 lumi: 259 event: 544665755 stream: 0
== CMSSW:    [1] Running path 'MySEARCH'
== CMSSW:    [2] Prefetching for module BadParticleFilter/'BadPFMuonFilterUpdateDz'
== CMSSW:    [3] While reading from source std::vector<pat::PackedCandidate> packedPFCandidates '' PAT
== CMSSW:    [4] Reading branch patPackedCandidates_packedPFCandidates__PAT.
== CMSSW:    Additional Info:
== CMSSW:       [a] Fatal Root Error: @SUB=TBasket::ReadBasketBuffers
== CMSSW: fNbytes = 236064, fKeylen = 143, fObjlen = 4399508, noutot = 0, nout=0, nin=235921, nbuf=4399508
== CMSSW:
== CMSSW: ----- End Fatal Exception -------------------------------------------------

@makortel
Copy link
Contributor Author

Could we have a catch (cms::Exception e) somewhere that slices the caught exception? That could explain the message to be passed, but the return code to become the generic 8001 that cms::Exception returns.

@makortel
Copy link
Contributor Author

I didn't find any slicing catch () in the framework with git grep. I also didn't find anything suspicious by visual inspection in ProductResolver.cc and Path.cc (looking places that added context to the exception message).

@wddgit Could you look into this? I'd hope the problem would be reproducible by just throwing the edm::Exception in place of the try-catch I pointed to in the issue description.

@belforte
Copy link

having the correct exit code will facilitate automated reporting, checking and fixing of corrupted files a topic which has now reached near the top of the queue for DataTransfer Ops https://indico.cern.ch/event/1356295/

@belforte
Copy link

We surely have plenty of examples of

== CMSSW:       [a] Fatal Root Error: @SUB=TBasket::ReadBasketBuffers
== CMSSW: fNbytes = 236064, fKeylen = 143, fObjlen = 4399508, noutot = 0, nout=0, nin=235921, nbuf=4399508

which exit with 8021. I did not check in those cases if there's a lzma report as well, it seemed enough to trigger a file checksum verification to me.
Let me know if you like some examples.

@makortel
Copy link
Contributor Author

We surely have plenty of examples of

== CMSSW:       [a] Fatal Root Error: @SUB=TBasket::ReadBasketBuffers
== CMSSW: fNbytes = 236064, fKeylen = 143, fObjlen = 4399508, noutot = 0, nout=0, nin=235921, nbuf=4399508

which exit with 8021.

Could you give a pointer to the log of this case, or paste the full CMSSW exception message here?

@wddgit
Copy link
Contributor

wddgit commented Dec 14, 2023

I'll start looking at this after lunch.

@belforte
Copy link

belforte commented Dec 14, 2023

here's one such job:
https://cmsweb.cern.ch:8443/scheddmon/0119/jgomespi/231205_134936:jgomespi_crab_DYJetsToLL_M-50_postVFP/job_out.808.16.txt

exception seems very similar

== CMSSW: ----- Begin Fatal Exception 07-Dec-2023 09:22:35 UTC-----------------------
== CMSSW: An exception of category 'FileReadError' occurred while
== CMSSW:    [0] Rethrowing an exception that happened on a different thread.
== CMSSW:    [1] Reading branch recoGenParticles_prunedGenParticles__PAT.
== CMSSW:    Additional Info:
== CMSSW:       [a] Fatal Root Error: @SUB=TBasket::ReadBasketBuffers
== CMSSW: basket: has fNevBuf=0 but fEntryOffset=0, pos=2326698653, len=18666, fNbytes=0, fObjlen=0, trying to repair
== CMSSW:
== CMSSW: ----- End Fatal Exception -------------------------------------------------

@makortel
Copy link
Contributor Author

here's one such job:
https://cmsweb.cern.ch:8443/scheddmon/0119/jgomespi/231205_134936:jgomespi_crab_DYJetsToLL_M-50_postVFP/job_out.808.16.txt

Thanks. This job used CMSSW_10_6_26. The context line 0 is added in

try {
//Run and Lumi only have 1 entry number, which is index 0
tree_.getEntry(br, tree_.entryNumberForIndex(tree_.branchType() == InEvent ? ep->transitionIndex() : 0));
} catch (edm::Exception& exception) {
exception.addContext("Rethrowing an exception that happened on a different thread.");

The exception handling in this class was changed in #36743, that was merged in 12_3_X.

@belforte I'm becoming more interested in various examples in either way (correct vs. incorrect exit code).

@wddgit
Copy link
Contributor

wddgit commented Dec 14, 2023

This looks suspicious. I have not verified this or looked carefully yet but this looks suspicious like it might slice. I just noticed it. Possibly unrelated...

https://cmssdt.cern.ch/dxr/CMSSW/source/FWCore/Framework/src/ProductResolvers.cc#302

I'll continue working on this.

@wddgit
Copy link
Contributor

wddgit commented Dec 14, 2023

Only one match for this:

Singularity> git grep --cached "reading from source"
FWCore/Framework/src/ProductResolvers.cc:      e.addContext(std::string("While reading from source ") + bd.className() + " " + bd.moduleLabel() + " '" +

And this points to ProductResolvers.cc in extendException. There are four places where that is used. One is line 302 and there are 3 other similar spots in the same file. I have not succeeded in reproducing this or that proving my fix gets rid of the problem, but the fix is obvious. I could just submit a PR with the fix and then if the problem does not go away we revisit this.

Change:

throw extendException(e, branchDescription(), mcc);

to:

extendException(e, branchDescription(), mcc);
throw;

@makortel
Copy link
Contributor Author

Good catch! The problem is that e.g. in

try {
  ...
} catch(cms::Exception& e) {
  throw e;
}

the e is sliced to cms::Exception, while plain throw; would re-throw the actual object, right?

This is how I interpret https://en.cppreference.com/w/cpp/language/throw

First, copy-initializes the exception object from expression

vs

Rethrows the currently handled exception.

@wddgit
Copy link
Contributor

wddgit commented Dec 14, 2023

For the moment, I'll continue to try to figure out how to reproduce the problem. Simply throwing at the location you suggested yields the correct error code so there is more to it than that...

@wddgit
Copy link
Contributor

wddgit commented Dec 14, 2023

Yes to your comment.

@makortel
Copy link
Contributor Author

The extendException() function, and throwing its return value, was added in #36739, that was merged in 12_3_X. This is consistent with 10_6_X job giving the correct exit code, but 12_4_X and 13_0_X jobs not (although the sample size is way too small to draw real conclusions from there).

@wddgit
Copy link
Contributor

wddgit commented Dec 14, 2023

I reproduced the error and verified the fix works. At first I was throwing on every getEntry, but the first one is for EventAuxiliary which is not a delayed read. If you wait for some product read by a module then the problem reproduces and the fix makes it go away. I submit a PR soon.

@makortel
Copy link
Contributor Author

I'm becoming more interested in various examples in either way (correct vs. incorrect exit code).

@belforte I think I won't need more examples anymore.

@wddgit
Copy link
Contributor

wddgit commented Dec 15, 2023

Just submitted #43579 which fixes the problem in 14_0_X and adds a unit test.

@makortel
Copy link
Contributor Author

@wddgit The master PR was merged. Could you look into backporting it to the release cycles listed in #42179 (comment) ?

@wddgit
Copy link
Contributor

wddgit commented Dec 18, 2023

I'll start on that this afternoon.

@wddgit
Copy link
Contributor

wddgit commented Dec 19, 2023

All the backport pull requests are submitted now.

@makortel
Copy link
Contributor Author

+core

All backport PRs have been merged

@cmsbuild
Copy link
Contributor

This issue is fully signed and ready to be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants