-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
StTpcHitMaker::maxHits array over-written #690
Comments
A scan through FastOffline logs reveals that the issue was seen in 11 files over the past 2 weeks (older log files have been erased): st_physics_adc_25169006_raw_0900009 DEV has not been modified since June 19th (day 171), which doesn't match with either the single occurrence on day 169, nor the increased rate of occurrences beginning on day 174. -Gene |
I see 3 more files to add to the list today to bring my total to 14.
For 8 of the files, I grabbed the DAQ files and processed to generate the following table...
File ; event-in-file ; sector ; row ; largest TB of the DAQ TPC pixel (ADC) data
...for where TB exceeds 511 (event,sector,row all start at 1, not 0)
(NB: inner sectors normally end at TB 420, and outer sectors at TB 414)
st_physics_adc_25175022_raw_2300009 130 22 20 559
st_physics_adc_25175073_raw_0500010 248 7 37 544
st_physics_adc_25175073_raw_1600009 242 7 40 574
st_physics_adc_25176012_raw_1600009 9 9 27 524
st_physics_adc_25176014_raw_1500009 37 14 11 523
st_physics_adc_25176048_raw_0000008 1390 24 29 548
st_physics_adc_25177010_raw_0800009 45 3 1 541
st_physics_adc_25177010_raw_1100008 97 3 2 523
(NB: these are all iTPC rows <= 40)
There is a pair in run 25175073, sector 7, that may be from the same FEE as each other? I'm not sure as they are in rows 37 and 40.
There is another pair in run 25177010, sector 3, from padrows 1 and 2, perhaps also the same FEE as each other?
Otherwise, there isn't an obvious pattern to the sector and row.
But these pairings provide a second bit of non-definitive evidence that this is something _in_ the TPC pixel data. The other such evidence being 100% reproducibility (we saw reproducibility below 20% when dealing with an uninitialized variable in the software a few weeks ago).
…-Gene
|
Important to understand it is reproducible.
Would be useful to understand
-- to see if it comes from similar FEE need to see the pad# too.
-- are this largest timebin isolated i.e previous tb much earlier or
consecutive.
I am thinking bit error.
-- we should involve Tonko to understand if his code expects to have
0<=tb<=511, and if there is any special handling for outside rage ?
…On 2024-06-25 15:09, Gene Van Buren wrote:
I see 3 more files to add to the list today to bring my total to 14.
For 8 of the files, I grabbed the DAQ files and processed to generate
the following table...
File ; event-in-file ; sector ; row ; largest TB of the DAQ TPC pixel
(ADC) data
...for where TB exceeds 511 (event,sector,row all start at 1, not 0)
(NB: inner sectors normally end at TB 420, and outer sectors at TB
414)
st_physics_adc_25175022_raw_2300009 130 22 20 559
st_physics_adc_25175073_raw_0500010 248 7 37 544
st_physics_adc_25175073_raw_1600009 242 7 40 574
st_physics_adc_25176012_raw_1600009 9 9 27 524
st_physics_adc_25176014_raw_1500009 37 14 11 523
st_physics_adc_25176048_raw_0000008 1390 24 29 548
st_physics_adc_25177010_raw_0800009 45 3 1 541
st_physics_adc_25177010_raw_1100008 97 3 2 523
(NB: these are all iTPC rows <= 40)
There is a pair in run 25175073, sector 7, that may be from the same
FEE as each other? I'm not sure as they are in rows 37 and 40.
There is another pair in run 25177010, sector 3, from padrows 1 and 2,
perhaps also the same FEE as each other?
Otherwise, there isn't an obvious pattern to the sector and row.
But these pairings provide a second bit of non-definitive evidence
that this is something _in_ the TPC pixel data. The other such
evidence being 100% reproducibility (we saw reproducibility below 20%
when dealing with an uninitialized variable in the software a few
weeks ago).
-Gene
--
Reply to this email directly, view it on GitHub [1], or unsubscribe
[2].
You are receiving this because you were assigned.Message ID:
***@***.***>
Links:
------
[1]
#690 (comment)
[2]
https://github.com/notifications/unsubscribe-auth/AUIALBBAV3QETGV4NGKCE5DZJG55PAVCNFSM6AAAAABJ25XXWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBZG43TMMZSGM
--
Flemming Videbaek
senior scientist, emeritus
videbaek @ bnl.gov
Brookhaven National Lab
Physics Department
Bldg 510D
Upton, NY 11973
phone: 631-344-4106
cell : 631-681-1596
|
Should be fixed. It's committed to CVS.
What happens is that an occasional ASIC channel gives nonsense.
I normally check for it but I guess I didn't protect it well enough
when running in Offline.
…On Tue, Jun 25, 2024 at 9:37 PM fvidebaek ***@***.***> wrote:
Important to understand it is reproducible.
Would be useful to understand
-- to see if it comes from similar FEE need to see the pad# too.
-- are this largest timebin isolated i.e previous tb much earlier or
consecutive.
I am thinking bit error.
-- we should involve Tonko to understand if his code expects to have
0<=tb<=511, and if there is any special handling for outside rage ?
On 2024-06-25 15:09, Gene Van Buren wrote:
> I see 3 more files to add to the list today to bring my total to 14.
>
> For 8 of the files, I grabbed the DAQ files and processed to generate
> the following table...
>
> File ; event-in-file ; sector ; row ; largest TB of the DAQ TPC pixel
> (ADC) data
> ...for where TB exceeds 511 (event,sector,row all start at 1, not 0)
> (NB: inner sectors normally end at TB 420, and outer sectors at TB
> 414)
>
> st_physics_adc_25175022_raw_2300009 130 22 20 559
> st_physics_adc_25175073_raw_0500010 248 7 37 544
> st_physics_adc_25175073_raw_1600009 242 7 40 574
> st_physics_adc_25176012_raw_1600009 9 9 27 524
> st_physics_adc_25176014_raw_1500009 37 14 11 523
> st_physics_adc_25176048_raw_0000008 1390 24 29 548
> st_physics_adc_25177010_raw_0800009 45 3 1 541
> st_physics_adc_25177010_raw_1100008 97 3 2 523
>
> (NB: these are all iTPC rows <= 40)
>
> There is a pair in run 25175073, sector 7, that may be from the same
> FEE as each other? I'm not sure as they are in rows 37 and 40.
>
> There is another pair in run 25177010, sector 3, from padrows 1 and 2,
> perhaps also the same FEE as each other?
>
> Otherwise, there isn't an obvious pattern to the sector and row.
>
> But these pairings provide a second bit of non-definitive evidence
> that this is something _in_ the TPC pixel data. The other such
> evidence being 100% reproducibility (we saw reproducibility below 20%
> when dealing with an uninitialized variable in the software a few
> weeks ago).
>
> -Gene
>
> --
> Reply to this email directly, view it on GitHub [1], or unsubscribe
> [2].
> You are receiving this because you were assigned.Message ID:
> ***@***.***>
>
> Links:
> ------
> [1]
> #690 (comment)
> [2]
>
https://github.com/notifications/unsubscribe-auth/AUIALBBAV3QETGV4NGKCE5DZJG55PAVCNFSM6AAAAABJ25XXWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBZG43TMMZSGM
--
Flemming Videbaek
senior scientist, emeritus
videbaek @ bnl.gov
Brookhaven National Lab
Physics Department
Bldg 510D
Upton, NY 11973
phone: 631-344-4106
cell : 631-681-1596
—
Reply to this email directly, view it on GitHub
<#690 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ATFNJ4ES3M7WLERSFTLFWH3ZJHBILAVCNFSM6AAAAABJ25XXWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBZHAZDSMJQGM>
.
You are receiving this because you were assigned.Message ID:
***@***.***>
|
Jeff has included Tonko's fixes for this issue in PR #688 , so I expect this issue will be solved when that PR is merged. |
Are you suggesting merging #688 now? Is Akio fine with reverting his previous modification (https://github.com/star-bnl/star-sw/pull/688/files#r1651407708)? Is the only change needed to fix this issue the one in StRoot/RTS/src/DAQ_TPC23/itpc23.cxx in this commit (314f90a)? |
No....I think Akio's concern is appropriate.
I think we should move forward with all of the DAQ_TPC23 changes in that PR. |
We have memory corruption when running the TPC offline cluster finder on the 130th event (eventID=2663503) of this file:
/star/data03/daq/2024/175/25175022/st_physics_adc_25175022_raw_2300009.daq
The issue is reproducible in DEV by executing:
The issue shows up in the log file beginning with this line:
This message is written because of something that happens at line 1137 of StTpcHitMaker...
star-sw/StRoot/StTpcHitMaker/StTpcHitMaker.cxx
Line 1137 in 9811528
...which is...
...where we write into the
IDTs
array past its end. It is sized for 512 unsigned shorts, but in this event, we are seeingtb
surpass 512 when reading through the DAQ data for TPX sector 22. The value oftb
is assigned just a few lines earlier from DAQ data:And in StTpcHitMaker.h, the relevant data members are...
So as
tb
reaches 512, the assignment first overwritesfId
, and at 513, it begins to overwritemaxHits
, turning it into negative values, which then trigger the above message of "Too many hits".I ran the job in valgrind, and found another data member in the DAQ TPC code introduced last year ("DAQ5k") that is evaluated before it is initialized (see PR #688 for the previous one)...
star-sw/StRoot/RTS/src/DAQ_TPC23/tpc23_base.h
Line 167 in 9811528
...and
star-sw/StRoot/RTS/src/DAQ_TPC23/tpc23_base.cxx
Line 86 in 9811528
...however, testing that fix locally did not resolve the "Too many hits" issue of this ticket. And nothing else in valgrind is obvious to me as a cause.
I think it remains an open question whether the
tb
value greater than 511 is an issue within the raw data, or a consequence of some other memory issue within the DAQ5k code. My impression is that we do not want to simply increase the size of theIDTs
array to resolve this, astb
probably should never exceed 511. I will keep trying to understand, but I welcome anyone else taking a look at this too.I assigned several names of people to this issue, with the intent that they be at the very least observers, if not helpers in resolving the issue.
-Gene
p.s. Credit to the Offline QA folks (Wenyun Bo and Lanny Ray) for spotting the impacts of this and reporting it.
The text was updated successfully, but these errors were encountered: