Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StTpcHitMaker::maxHits array over-written #690

Open
genevb opened this issue Jun 25, 2024 · 7 comments
Open

StTpcHitMaker::maxHits array over-written #690

genevb opened this issue Jun 25, 2024 · 7 comments
Assignees
Labels
bug Something isn't working

Comments

@genevb
Copy link
Contributor

genevb commented Jun 25, 2024

We have memory corruption when running the TPC offline cluster finder on the 130th event (eventID=2663503) of this file:
/star/data03/daq/2024/175/25175022/st_physics_adc_25175022_raw_2300009.daq

The issue is reproducible in DEV by executing:

root4star -b -q -l 'bfc.C(130,130,"pp2024a,TpxRaw,TpxClu","star/data03/daq/2024/175/25175022/st_physics_adc_25175022_raw_2300009.daq")'

The issue shows up in the log file beginning with this line:

StTpcHitMaker:ERROR - Too many hits (0) in one sector (22). Skipping event.

This message is written because of something that happens at line 1137 of StTpcHitMaker...


...which is...

    IDTs[tb] = 65535;

...where we write into the IDTs array past its end. It is sized for 512 unsigned shorts, but in this event, we are seeing tb surpass 512 when reading through the DAQ data for TPX sector 22. The value of tb is assigned just a few lines earlier from DAQ data:

for (;iword != DaqDta()->end();++iword) {
  daq_adc_tb &daqadc = (*(daq_adc_tb *)*iword);
  Int_t tb   = daqadc.tb;

And in StTpcHitMaker.h, the relevant data members are...

#ifdef __TFG__VERSION__
  Int_t    IDTs[512];
#else
  UShort_t IDTs[512];
#endif
  UShort_t fId; // current cluster Id
  Int_t    maxHits[24];

So as tb reaches 512, the assignment first overwrites fId, and at 513, it begins to overwrite maxHits, turning it into negative values, which then trigger the above message of "Too many hits".

I ran the job in valgrind, and found another data member in the DAQ TPC code introduced last year ("DAQ5k") that is evaluated before it is initialized (see PR #688 for the previous one)...

int *store_track_id ;

...and
if(store_track_id==0) {

...however, testing that fix locally did not resolve the "Too many hits" issue of this ticket. And nothing else in valgrind is obvious to me as a cause.

I think it remains an open question whether the tb value greater than 511 is an issue within the raw data, or a consequence of some other memory issue within the DAQ5k code. My impression is that we do not want to simply increase the size of the IDTs array to resolve this, as tb probably should never exceed 511. I will keep trying to understand, but I welcome anyone else taking a look at this too.

I assigned several names of people to this issue, with the intent that they be at the very least observers, if not helpers in resolving the issue.

-Gene

p.s. Credit to the Offline QA folks (Wenyun Bo and Lanny Ray) for spotting the impacts of this and reporting it.

@genevb genevb added the bug Something isn't working label Jun 25, 2024
@genevb
Copy link
Contributor Author

genevb commented Jun 25, 2024

A scan through FastOffline logs reveals that the issue was seen in 11 files over the past 2 weeks (older log files have been erased):

st_physics_adc_25169006_raw_0900009
st_physics_adc_25174042_raw_1000008
st_physics_adc_25174042_raw_2300008
st_physics_adc_25175022_raw_0300009
st_physics_adc_25175022_raw_2300009
st_physics_adc_25175027_raw_2100011
st_physics_adc_25175070_raw_1400006
st_physics_adc_25175073_raw_0500010
st_physics_adc_25175073_raw_1600009
st_physics_adc_25176012_raw_1600009
st_physics_adc_25176014_raw_1500009

DEV has not been modified since June 19th (day 171), which doesn't match with either the single occurrence on day 169, nor the increased rate of occurrences beginning on day 174.

-Gene

@genevb
Copy link
Contributor Author

genevb commented Jun 25, 2024 via email

@fvidebaek
Copy link
Contributor

fvidebaek commented Jun 25, 2024 via email

@tonko-lj
Copy link
Contributor

tonko-lj commented Jun 26, 2024 via email

@genevb
Copy link
Contributor Author

genevb commented Jun 27, 2024

Jeff has included Tonko's fixes for this issue in PR #688 , so I expect this issue will be solved when that PR is merged.

@plexoos
Copy link
Member

plexoos commented Jun 27, 2024

Are you suggesting merging #688 now?

Is Akio fine with reverting his previous modification (https://github.com/star-bnl/star-sw/pull/688/files#r1651407708)?

Is the only change needed to fix this issue the one in StRoot/RTS/src/DAQ_TPC23/itpc23.cxx in this commit (314f90a)?

@genevb
Copy link
Contributor Author

genevb commented Jun 27, 2024

Are you suggesting merging #688 now?

No....I think Akio's concern is appropriate.

Is the only change needed to fix this issue the one in StRoot/RTS/src/DAQ_TPC23/itpc23.cxx in this commit

I think we should move forward with all of the DAQ_TPC23 changes in that PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants