Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Probably NaN in conv_water.F90 with ERS_D_Ld3.ne30pg2_r05_IcoswISC30E3r5.WCYCL1850.pm-cpu_gnu.allactive-nlmaps #6179

Closed
ndkeen opened this issue Jan 25, 2024 · 6 comments · Fixed by #6858

Comments

@ndkeen
Copy link
Contributor

ndkeen commented Jan 25, 2024

Using Jan23 master and ERS_D_Ld3.ne30pg2_r05_IcoswISC30E3r5.WCYCL1850.pm-cpu_gnu.allactive-nlmaps
or the older: ERS_D_Ld3.ne30pg2_r05_EC30to60E2r2.WCYCL1850.pm-cpu_gnu.allactive-nlmaps

256: Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
256:
256: Backtrace for this error:
256: #0  0x148089453dbf in ???
256: #1  0x6ac0b09 in __shr_infnan_mod_MOD_shr_infnan_isnan_double
256:    at /global/cfs/cdirs/e3sm/ndk/repos/me37-jan23/share/util/shr_infnan_mod.F90.in:235
256: #2  0x1fc2bdf in __conv_water_MOD_conv_water_4rad
256:    at /global/cfs/cdirs/e3sm/ndk/repos/me37-jan23/components/eam/src/physics/cam/conv_water.F90:370
256: #3  0x1f7e3db in __cloud_diagnostics_MOD_cloud_diagnostics_calc
256:    at /global/cfs/cdirs/e3sm/ndk/repos/me37-jan23/components/eam/src/physics/cam/cloud_diagnostics.F90:371
256: #4  0x149e051 in tphysbc
256:    at /global/cfs/cdirs/e3sm/ndk/repos/me37-jan23/components/eam/src/physics/cam/physpkg.F90:3046
256: #5  0x14b76d5 in __physpkg_MOD_phys_run1
256:    at /global/cfs/cdirs/e3sm/ndk/repos/me37-jan23/components/eam/src/physics/cam/physpkg.F90:1175
256: #6  0x653d00 in __cam_comp_MOD_cam_run1
256:    at /global/cfs/cdirs/e3sm/ndk/repos/me37-jan23/components/eam/src/control/cam_comp.F90:268
256: #7  0x643747 in __atm_comp_mct_MOD_atm_init_mct
256:    at /global/cfs/cdirs/e3sm/ndk/repos/me37-jan23/components/eam/src/cpl/atm_comp_mct.F90:523
256: #8  0x48d2a9 in __component_mod_MOD_component_init_cc
256:    at /global/cfs/cdirs/e3sm/ndk/repos/me37-jan23/driver-mct/main/component_mod.F90:248
256: #9  0x47dff3 in __cime_comp_mod_MOD_cime_init
256:    at /global/cfs/cdirs/e3sm/ndk/repos/me37-jan23/driver-mct/main/cime_comp_mod.F90:2331
256: #10  0x485d4c in cime_driver
256:    at /global/cfs/cdirs/e3sm/ndk/repos/me37-jan23/driver-mct/main/cime_driver.F90:122

Also fails with DEBUG and intel

@ndkeen ndkeen changed the title Probably NaN in ERS_D_Ld3.ne30pg2_r05_EC30to60E2r2.WCYCL1850.pm-cpu_gnu.allactive-nlmaps Probably NaN in conv_water.F90 with ERS_D_Ld3.ne30pg2_r05_EC30to60E2r2.WCYCL1850.pm-cpu_gnu.allactive-nlmaps Jan 25, 2024
@ndkeen
Copy link
Contributor Author

ndkeen commented Feb 27, 2024

Same issue with master of Feb26th -- even using the updated resolution:
ERS_D_Ld3.ne30pg2_r05_IcoswISC30E3r5.WCYCL1850.pm-cpu_gnu.allactive-nlmaps

Tested again with master of April 22. Same issue as above.

@ndkeen ndkeen changed the title Probably NaN in conv_water.F90 with ERS_D_Ld3.ne30pg2_r05_EC30to60E2r2.WCYCL1850.pm-cpu_gnu.allactive-nlmaps Probably NaN in conv_water.F90 with ERS_D_Ld3.ne30pg2_r05_IcoswISC30E3r5.WCYCL1850.pm-cpu_gnu.allactive-nlmaps Mar 1, 2024
@ndkeen
Copy link
Contributor Author

ndkeen commented May 5, 2024

Same issue with master of May 4th

@ndkeen
Copy link
Contributor Author

ndkeen commented Oct 17, 2024

With master of Oct 16th, still see the issue -- however, with my proposed upgrade to compiler version (to gcc 12.3 -- PR coming), the test passes. Hmm, maybe I was wrong about that...

@ndkeen
Copy link
Contributor Author

ndkeen commented Oct 31, 2024

With checkout of Oct28th, which includes compiler version updates, I tried ERS_D_Ld3.ne30pg2_r05_IcoswISC30E3r5.WCYCL1850.pm-cpu_gnu.allactive-nlmaps and I still see same segfault as noted above.

So that's with gcc-native/12.3

@ndkeen
Copy link
Contributor Author

ndkeen commented Dec 16, 2024

See same error with a Dec5th checkout. This time using a test without the testmod. Fail is on the second run of ERS.
ERS_D_P512x1.ne30pg2_r05_IcoswISC30E3r5.F2010.muller-cpu_gnu

Noting that ERS_D_P512x1.ne30pg2_r05_IcoswISC30E3r5.F2010.muller-cpu_intel passes as it was before. That is, only seeing this error with GNU DEBUG so far.

@ndkeen
Copy link
Contributor Author

ndkeen commented Dec 18, 2024

With #6858, Walter may have found the issue here. An array was not being initialized before use. Required adding

call pbuf_set_field(pbuf2d, fice_idx, 0.0_r8)

singhbalwinder added a commit that referenced this issue Feb 7, 2025
Adds fice initialization to conv_water_init

Add initialization of the "fice" pbuf variable within conv_water_init
to avoid debug mode failure when shr_infnan_isnan is called.

Fixes #6179.

[BFB]

* whannah/atm/fix_conv_water_init:
  add fice initialization to conv_water_init
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant