Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CNPhenology (and other?) unit tests not being run #3015

Open
samsrabin opened this issue Mar 15, 2025 · 6 comments · May be fixed by #3017
Open

CNPhenology (and other?) unit tests not being run #3015

samsrabin opened this issue Mar 15, 2025 · 6 comments · May be fixed by #3017
Labels
bug something is working incorrectly priority: high High priority to fix/merge soon, e.g., because it is a problem in important configurations testing additions or changes to tests

Comments

@samsrabin
Copy link
Collaborator

samsrabin commented Mar 15, 2025

Brief summary of bug

I noticed that I couldn't get tests in test_CNPhenology.pf to fail, even if I did something like @assertTrue(.false.). @ekluzek figured out that this is because it's failing in setUp() at call clm_varcon_init( is_simple_buildtemp = .true.). Specifically, it seems like something is not getting correctly deallocated between tests.

Steps to resolution:

  1. Set up pfunit so that a failure in setUp marks the whole test file as failed.
  2. Fix whatever's wrong in test_CNPhenology.pf's setUp().
  3. Fix any other test files that had been silently failing in setUp().

General bug information

CTSM version you are using: ctsm5.3.031

Does this bug cause significantly incorrect results in the model's science? No, but it might be hiding bugs that would.

Important details of your setup / configuration so we can reproduce the bug

Testing on Derecho.

If I comment out all the tests except check_doonset_normal (the first one), then add @assertTrue( .false. ) at the end of that test, it fails successfully.

If I uncomment the next test (check_doonset_vegdepend_buttemperate), neither test fails.

If I re-comment the latter and then uncomment the last test (test_was_sown_in_this_window_sameday), neither test fails.

If I add a call of endrun() at the end of setUp(), I get the following:

 Start: <test_CNPhenology_suite.check_doonset_normal>
.  ******** CLM Time Manager Configuration ********
   Calendar type:            NO_LEAP
   Timestep size (seconds):          1800
   Start date (yr mon day tod):                1           1           1
           0
   Reference date (yr mon day tod):            1           1           1
           0
   Current step number:                 1
   Current date (yr mon day tod):              1           1           1
        1800
  ************************************************
 ENDRUN:
 Failure in <test_CNPhenology_suite.check_doonset_normal>
F   end: <test_CNPhenology_suite.check_doonset_normal>


 Start: <test_CNPhenology_suite.test_was_sown_in_this_window_sameday>
.  ******** CLM Time Manager Configuration ********
   Calendar type:            NO_LEAP
   Timestep size (seconds):          1800
   Start date (yr mon day tod):                1           1           1
           0
   Reference date (yr mon day tod):            1           1           1
           0
   Current step number:                 1
   Current date (yr mon day tod):              1           1           1
        1800
  ************************************************
forrtl: severe (151): allocatable array is already allocated
Image              PC                Routine            Line        Source
CNPhenology        00000000008113A0  clm_varcon_mp_clm         291  clm_varcon.F90
CNPhenology        00000000004151F1  test_cnphenology_          72  test_CNPhenology.F90
CNPhenology        0000000000A9A44B  pf_testcase_mp_ru         134  TestCase.F90
CNPhenology        0000000000A9A855  pf_testcase_mp_ru         146  TestCase.F90
CNPhenology        0000000000B11B44  pf_testresult_mp_         237  TestResult.F90
CNPhenology        0000000000A9A0E3  pf_testcaserun_mp         121  TestCase.F90
CNPhenology        0000000000A99A9C  pf_testcase_mp_ru         108  TestCase.F90
CNPhenology        0000000000A9F7DA  pf_testsuite_mp_r         108  TestSuite.F90
CNPhenology        0000000000A9F7DA  pf_testsuite_mp_r         108  TestSuite.F90
CNPhenology        0000000000B183FB  pf_testrunner_mp_         139  TestRunner.F90
CNPhenology        0000000000B17528  pf_testrunner_mp_         117  TestRunner.F90
CNPhenology        0000000000AB7201  funit_mp_generic_         118  FUnit.F90
CNPhenology        0000000000AB0C47  funit_mp_run_              33  FUnit.F90
CNPhenology        0000000000ABA1D4  funit_main_                16  funit_main.F90
CNPhenology        000000000041B119  MAIN__                     82  CNPhenology_driver.F90
CNPhenology        000000000041497D  Unknown               Unknown  Unknown
libc-2.31.so       00007FF51417C29D  __libc_start_main     Unknown  Unknown
CNPhenology        00000000004148AA  Unknown               Unknown  Unknown

I thought maybe tearDown() wasn't getting called, but when I paste the contents of tearDown() at the end of the first test, it doesn't make a difference.

@samsrabin samsrabin added bug something is working incorrectly priority: high High priority to fix/merge soon, e.g., because it is a problem in important configurations testing additions or changes to tests next this should get some attention in the next week or two. Normally each Thursday SE meeting. labels Mar 15, 2025
@samsrabin
Copy link
Collaborator Author

Commenting out lines in various CMakeLists.txts files so that only the CNPhenology tests are run, then manually calling cmake like so:

cd src/unit_tests.temp/__command_line_test__/__command_line_test__
ctest -VV --output-on-failure

gives:

UpdateCTestConfiguration  from :/glade/work/samrabin/ctsm_fix-cnphenology-unittests/src/unit_tests.temp/__command_line_test__/__command_line_test__/DartConfiguration.tcl
UpdateCTestConfiguration  from :/glade/work/samrabin/ctsm_fix-cnphenology-unittests/src/unit_tests.temp/__command_line_test__/__command_line_test__/DartConfiguration.tcl
Test project /glade/work/samrabin/ctsm_fix-cnphenology-unittests/src/unit_tests.temp/__command_line_test__/__command_line_test__
Constructing a list of tests
Done constructing a list of tests
Updating test list for fixtures
Added 0 tests to meet fixture requirements
Checking test dependency graph...
Checking test dependency graph end
test 1
    Start 1: CNPhenology

1: Test command: /glade/work/samrabin/ctsm_fix-cnphenology-unittests/src/unit_tests.temp/__command_line_test__/__command_line_test__/clm_biogeochem_test/CNPhenology_test/CNPhenology "--verbose"
1: Working Directory: /glade/work/samrabin/ctsm_fix-cnphenology-unittests/src/unit_tests.temp/__command_line_test__/__command_line_test__/clm_biogeochem_test/CNPhenology_test
1: Test timeout computed to be: 10000000
1:
1:
1:  Start: <test_CNPhenology_suite.check_doonset_normal>
1: .  ******** CLM Time Manager Configuration ********
1:    Calendar type:            NO_LEAP
1:    Timestep size (seconds):          1800
1:    Start date (yr mon day tod):                1           1           1
1:            0
1:    Reference date (yr mon day tod):            1           1           1
1:            0
1:    Current step number:                 1
1:    Current date (yr mon day tod):              1           1           1
1:         1800
1:   ************************************************
1:    end: <test_CNPhenology_suite.check_doonset_normal>
1:
1:
1:  Start: <test_CNPhenology_suite.check_doonset_vegdepend_buttemperate>
1:  ESMF_LogWrite(): ESMF_Log not open -- cannot ESMF_LogWrite().  Log message =
1:  Object being used after deletion  - Bad Object
1:  ESMF_LogWrite(): ESMF_Log not open -- cannot ESMF_LogWrite().  Log message =
1:  Object being used after deletion  - Internal subroutine call returned Error
1: Error in ESMF_Initialize
1: .
1/1 Test #1: CNPhenology ......................   Passed    0.05 sec

100% tests passed, 0 tests failed out of 1

Total Test time (real) =   0.06 sec

with exit code 0. So something seems to be wrong with the building of the tests.

@samsrabin
Copy link
Collaborator Author

If I do src/unit_tests.temp/__command_line_test__/__command_line_test__/clm_biogeochem_test/CNPhenology_test/CNPhenology, it seems to fail:

.  ******** CLM Time Manager Configuration ********
   Calendar type:            NO_LEAP
   Timestep size (seconds):          1800
   Start date (yr mon day tod):                1           1           1
           0
   Reference date (yr mon day tod):            1           1           1
           0
   Current step number:                 1
   Current date (yr mon day tod):              1           1           1
        1800
  ************************************************
. ESMF_LogWrite(): ESMF_Log not open -- cannot ESMF_LogWrite().  Log message =
 Object being used after deletion  - Bad Object
 ESMF_LogWrite(): ESMF_Log not open -- cannot ESMF_LogWrite().  Log message =
 Object being used after deletion  - Internal subroutine call returned Error
Error in ESMF_Initialize

but it returns exit code 0.

A different such executable—src/__command_line_test__/__command_line_test__/clm_biogeochem_test/CNVegComputeSeed_test/CNVegComputeSeed—passes:

.............
Time:         0.002 seconds

 OK
 (13 tests)

@billsacks
Copy link
Member

I'm looking at this....

@billsacks
Copy link
Member

Okay, I think I partially understand why this is failing, and have at least a partial solution.

The clue is in the ESMF errors you saw in a few places. By looking through output of various tests (in __command_line_test__/__command_line_test__/Testing/Temporary/LastTest.log), I saw similar errors in some other tests. I noticed that these errors appear when you have multiple tests in the same executable that do an initialization / teardown of the time manager. I tracked this down to having repeated calls of ESMF_Initialize / ESMF_Finalize. As noted in the ESMF documentation, it is an error to re-call ESMF_Initialize after an ESMF_Finalize. I think the history in the CTSM unit test code was that I put in these calls when we were using the "fake" ESMF time manager code, and I think it was fine to re-initialize ESMF with that version of the code. But now that we're using the real ESMF, it no longer works.

I don't understand why this leads to a silent crash of the CTest / PFUnit code, but given that ESMF probably does various low-level things in its initialization and finalization, I guess I'm not too surprised. My feeling is that trying to track down the cause of that might be a long exercise. However, I will reach out to Tom Clune - the lead developer of PFUnit, who also uses ESMF extensively - to see if he has any insight.

There's a simple fix that solves your issue:

diff --git a/src/unit_test_shr/unittestTimeManagerMod.F90 b/src/unit_test_shr/unittestTimeManagerMod.F90
index 8e1cdb3f3..bb2e1beeb 100644
--- a/src/unit_test_shr/unittestTimeManagerMod.F90
+++ b/src/unit_test_shr/unittestTimeManagerMod.F90
@@ -48,7 +48,7 @@ contains
     ! Should be called once for every test that uses the time manager.
     !
     ! !USES:
-    use ESMF, only : ESMF_Initialize, ESMF_SUCCESS
+    use ESMF, only : ESMF_Initialize, ESMF_IsInitialized, ESMF_SUCCESS
     use clm_time_manager, only : set_timemgr_init, timemgr_init, NO_LEAP_C, GREGORIAN_C
     !
     ! !ARGUMENTS:
@@ -59,6 +59,7 @@ contains
     integer :: l_dtime  ! local version of dtime
     logical :: l_use_gregorian_calendar  ! local version of use_gregorian_calendar
     character(len=:), allocatable :: calendar
+    logical :: esmf_is_initialized
     integer :: rc ! return code

     integer, parameter :: dtime_default = 1800  ! time step (seconds)
@@ -89,9 +90,15 @@ contains
        l_use_gregorian_calendar = .false.
     end if

-    call ESMF_Initialize(rc=rc)
+    esmf_is_initialized = ESMF_IsInitialized(rc=rc)
     if (rc /= ESMF_SUCCESS) then
-       stop 'Error in ESMF_Initialize'
+       stop 'Error in ESMF_IsInitialized'
+    end if
+    if (.not. esmf_is_initialized) then
+       call ESMF_Initialize(rc=rc)
+       if (rc /= ESMF_SUCCESS) then
+          stop 'Error in ESMF_Initialize'
+       end if
     end if

     if (l_use_gregorian_calendar) then
@@ -219,10 +226,12 @@ contains

     call timemgr_reset()

-    call ESMF_Finalize(rc=rc)
-    if (rc /= ESMF_SUCCESS) then
-       stop 'Error in ESMF_Finalize'
-    end if
+    ! If this is the end of the executable, we should call
+    ! ESMF_Finalize. But the timemgr setup and teardown routines can
+    ! be called multiple times within a single unit test executable,
+    ! and it's an error to re-call ESMF_Initialize after calling
+    ! ESMF_Finalize. So for now we just won't attempt to do an
+    ! ESMF_Finalize.

   end subroutine unittest_timemgr_teardown

With that diff, errors are reported as expected in CNPhenologyMod (when adding assertTrue(.false.), for example). I'm a little worried that skipping any ESMF_Finalize could cause problems in some circumstances... I'm not sure if there's a way to do an ESMF_Finalize at the end of the whole test executable, which would be ideal... I'll also ask Tom if he has any ideas about that.

However, as you thought, @samsrabin , there are now some failures in other unit test modules that had been going unreported before. I'll look into this, since I think they're tests I wrote.

I'll open a draft CTSM PR with these changes for further discussion / work.

(As a side note: I see that, under some circumstances, a failure in the setUp method does cause a test failure. For example, I commented out all but the first test, and then inserted a second call to clm_varcon_init right after the first and this leads to a failure as expected. I was worried that there might be something generally wrong with pFUnit's catching/reporting of errors in the setup routine, but that doesn't seem to be the case. Maybe you already understood this, but I wasn't clear on this from the various tests you showed and wanted to confirm it.)

@samsrabin
Copy link
Collaborator Author

Oh excellent, thanks for tracking this down @billsacks!

@billsacks
Copy link
Member

billsacks commented Mar 17, 2025

Draft PR here: #3017

I'll leave it as draft until I fix the unit tests that are now failing more visibly than before.

billsacks added a commit to billsacks/ctsm that referenced this issue Mar 18, 2025
ESMF was being repeatedly initialized in unittestTimeManagerMod.F90,
which is an error. This led to unit tests failing silently. This commit
fixes the issue by checking if ESMF is already initialized before
calling ESMF_Initialize, and not calling ESMF_Finalize at the end of
each unit test.

Currently this skips any calls to ESMF_Finalize, as it is an error to
re-call ESMF_Initialize after calling ESMF_Finalize. A better solution
would be to only call ESMF_Initialize and ESMF_Finalize once per unit
test executable. I am looking into whether that's possible. I'm not sure
if it will cause any problems to not do the ESMF_Finalize.

See ESCOMP#3015 (comment)
for details.

Resolves ESCOMP#3015
@ekluzek ekluzek removed the next this should get some attention in the next week or two. Normally each Thursday SE meeting. label Mar 20, 2025
ekluzek added a commit to ekluzek/CTSM that referenced this issue Mar 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug something is working incorrectly priority: high High priority to fix/merge soon, e.g., because it is a problem in important configurations testing additions or changes to tests
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants