Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid repeated ESMF initialization in unittestTimeManagerMod.F90 #3017

Open
wants to merge 7 commits into
base: b4b-dev
Choose a base branch
from

Conversation

billsacks
Copy link
Member

@billsacks billsacks commented Mar 17, 2025

Description of changes

ESMF was being repeatedly initialized in unittestTimeManagerMod.F90, which is an error. This led to unit tests failing silently. This PR fixes the issue by checking if ESMF is already initialized before calling ESMF_Initialize, and not calling ESMF_Finalize at the end of each unit test. Instead, this adds code to finalize ESMF at the end of a given pFUnit executable. (See #3015 (comment) for some additional details.)

This also fixes some time-related unit tests that were failing with the real ESMF library; these failures were hidden/silent before the above fix.

Specific notes

Contributors other than yourself, if any: @samsrabin

CTSM Issues Fixed (include github issue)

Are answers expected to change (and if so in what way)? No

Any User Interface Changes (namelist or namelist defaults changes)? No

Does this create a need to change or add documentation? Did you do so? No

Testing performed, if any:

  • Did some runs of the unit tests on derecho

@billsacks billsacks requested a review from samsrabin March 17, 2025 20:53
@billsacks
Copy link
Member Author

I'm marking this as a draft because now there are some failing unit tests. I'll work on those....

@ekluzek ekluzek added testing additions or changes to tests blocker another issue/PR depends on this one bfb bit-for-bit devops Development Operations to improve development throughput, E.g., adding GitHub Workflows labels Mar 17, 2025
@samsrabin samsrabin self-assigned this Mar 17, 2025
@ekluzek
Copy link
Collaborator

ekluzek commented Mar 17, 2025

@billsacks I think we should rebase this to go onto b4b-dev. I think this also might be holding up work that @samsrabin wants to do, so it might be good to get this in as quickly as can be done. Although you are right we need to get all the tests to work. Do you have time to work on this? And/or do you need some help here?

From looking at PFUnit documentation and code, I didn't see a way, to do only one Initialize/Finalize per test suite. The method you have here is fine for ESMF_Initialize. The finalize would be the hard part as the teardown would somehow need to know that this is the last test.

Note, I add some ESMF_Initialization in my PR: #2844. I just have an ESMF_Initialize there, which does happen multiple times. So I'll add the same check you have here so it's only done once. Thanks for figuring this out.

@samsrabin
Copy link
Collaborator

Oh I don't think this is holding up my work. The commit Bill has on this branch is all I need for test_CNphenology.pf to start working again!

@ekluzek
Copy link
Collaborator

ekluzek commented Mar 17, 2025

Ahh awesome. Glad to hear it.

@samsrabin samsrabin changed the base branch from master to b4b-dev March 17, 2025 23:53
@samsrabin
Copy link
Collaborator

samsrabin commented Mar 17, 2025

I've retargeted this so that it'll merge to b4b-dev. Bill, could you rebase it onto the tip of b4b-dev and then force-push?

ESMF was being repeatedly initialized in unittestTimeManagerMod.F90,
which is an error. This led to unit tests failing silently. This commit
fixes the issue by checking if ESMF is already initialized before
calling ESMF_Initialize, and not calling ESMF_Finalize at the end of
each unit test.

Currently this skips any calls to ESMF_Finalize, as it is an error to
re-call ESMF_Initialize after calling ESMF_Finalize. A better solution
would be to only call ESMF_Initialize and ESMF_Finalize once per unit
test executable. I am looking into whether that's possible. I'm not sure
if it will cause any problems to not do the ESMF_Finalize.

See ESCOMP#3015 (comment)
for details.

Resolves ESCOMP#3015
@billsacks billsacks force-pushed the fix_pfunit_multiple_esmf_inits branch from 50c4eef to 16b7ed5 Compare March 18, 2025 00:08
@billsacks
Copy link
Member Author

I just rebased and force pushed.

I'm going to look at the failing unit tests tonight and see if I can fix them... I'll keep you posted.

@samsrabin samsrabin added bug something is working incorrectly and removed blocker another issue/PR depends on this one labels Mar 18, 2025
@billsacks
Copy link
Member Author

Sorry for the delay on this: I got caught up in trying to get the unit tests working again on my Mac (leading to #3023) and then some other troubleshooting this week on an unrelated issue. But I'm getting back to this now.

@billsacks
Copy link
Member Author

Here is information from Tom Clune with guidance on how to do initialization or finalization once per executable. I'll try to put something like this in place for ESMF Finalization. (I think our current mechanism for ESMF Initialization is sufficient, and in our case, probably better than doing it this way, since the way we do it now, we're only doing ESMF Initialization for tests that really need it, and it seems like that might be harder to do with an approach like the one Tom is using. But this approach seems good for ESMF finalization; I'll just make the finalization call conditional on whether ESMF has been initialized.)

We’ll probably have to iterate a bit on this as more details are exposed. Usually I use pFUnit within a CTest using the macro add_pfunit_ctest. An example from MAPL:

add_pfunit_ctest(MAPL.base.tests
                TEST_SOURCES ${TEST_SRCS}
                OTHER_SOURCES ${SRCS}
#                LINK_LIBRARIES MAPL.base MAPL.shared MAPL.pfio base_extras MAPL.pfunit
                LINK_LIBRARIES MAPL.base MAPL.shared MAPL.constants MAPL.generic MAPL.pfio MAPL.pfunit
                EXTRA_INITIALIZE Initialize
                EXTRA_USE MAPL_pFUnit_Initialize
                MAX_PES 8
                )

Note the EXTRA_INITIALIZE and EXTRA_USE options. (There are corresponding finalize ones.) That module looks like this:

module MAPL_pFUnit_Initialize
contains
   subroutine Initialize()
      use ESMF
      use MAPL_ThrowMod, only: MAPL_set_throw_method
      use MAPL_pFUnit_ThrowMod
      use pflogger, only: pfl_initialize => initialize
      use udunits2f, only: UDUNITS_Initialize => Initialize
  
      call ESMF_Initialize(logKindFlag=ESMF_LOGKIND_MULTI,defaultCalKind=ESMF_CALKIND_GREGORIAN)
      call MAPL_set_throw_method(throw)
      call pfl_initialize()
      call UDUNITS_Initialize()
 
   end subroutine Initialize
end module MAPL_pFUnit_Initialize

You see that we initialize several non-reentrant layers in there. Apparently none really require a finalize to exit cleanly.

The bit I’m a bit foggy about just now is that there is FUnit (serial) and pFUnit (parallel via MPI). The latter is a simple extension of the former, but I’m not certain how the driver exactly controls which logical path is used. I had thought the MAX_PES option above triggered the use of pFUnit’s main initialization but the generated code indicates otherwise. If it is an issue let me know. I know it can work both ways as I have other non-ESMF layers that do MPI testing without any special initialization.

And he added in a follow-up:

Basically, there will be an analogous FINALIZE option on the cmake macro for each INITIALIZE option. We generally have not needed that in MAPL2 tests, though I’m not sure why. MPI usually complains without a clean finalize and I could not immediately spot the chain of calls that triggers that automatically in our cases.

Without this change, the unit tests of clm_time_manager were aborting
due to trying to destroy an ESMF clock that hadn't been initialized.
Other pieces of this reset routine don't appear to be causing problems,
but it seems most robust to just skip the whole thing if the time
manager hasn't been initialized.
@billsacks
Copy link
Member Author

Down to one failure, in a dynTimeInfo test... working on this... this one seems like it might be more problematic....

@billsacks
Copy link
Member Author

This last issue is an interesting one, and one where the now-failing unit test is picking up a possibly problematic difference in behavior between the esmf_wrf_timemgr code we used to use and the real ESMF time manager code. In particular, there's a difference in the meaning of prevTime in these two: in the esmf_wrf_timemgr code, this was defined as currTime - timeStep, and that seems to be the definition we assume in the clm time manager. However, in the real ESMF time manager code, prevTime is based on storing the previous value of currTime. Generally, I think these would only differ if you're using a variable time step size (which I imagine is the reason for the implementation in the actual ESMF library). But I want to do a bit of digging to see if this might be causing problems in initialization or the first time step. (And it will require some tweaks to the unit test that relies on this.)

@billsacks
Copy link
Member Author

I want to do a bit of digging to see if this might be causing problems in initialization or the first time step.

As far as I can tell, there aren't any issues in the production code due to this change, but I'm tearing my hair out trying to get all of the unit tests working with this behavior change. I'll keep working on it.... Also, I do think we should make some behaviors more consistent than they are now to avoid problems down the road... I'll follow up on that with a separate issue.

There is a behavior difference between the real ESMF library and the
ESMF WRF Timemgr we had been using previously, in how prevTime is
defined: the ESMF WRF Timemgr defined prevTime using a -dtime offset
from currTime; the real ESMF library instead defines prevTime by
tracking the previous value of currTime. Therefore, now that we are
using the real ESMF library, if a unit test relies on prevTime (as is
the case, for example, with test_year_position_start_of_timestep in
test_dynTimeInfo.pf), it is now important that any setting of the time
be done in a way such that there is a call to ESMF_ClockAdvance, since
that is the only way that prevTime will get set correctly. Thus, this
commit updates the implementation of for_test_set_curr_date so that it
involves a call to ESMF_ClockAdvance.

Making this change in the implementation of for_test_set_curr_date also
required some other changes in some unit test code:
- I needed to remove the call to unittest_timemgr_set_curr_date at the
  end of unittest_timemgr_setup. This call now leads to an increment of
  the time step count, which was causing failures in some time manager
  tests that check the time step count. And it turns out that (at least
  with the other changes here), the call to
  unittest_timemgr_set_curr_date was unnecessary: it wasn't actually
  leading to any change in time from what was set in the initialization
  of the time manager.
- I needed to change some set_date calls to use a year of 2 instead of
  1. This is needed for calls setting the date to January 1 and the time
  to 0: It is no longer allowed to set the date to yr,mon,day,tod =
  1,1,1,0, because the new method sets the date/time to one time step
  earlier and then advances the clock, and it is illegal to set the
  date/time to one time step before 1,1,1,0.
@billsacks
Copy link
Member Author

Okay, I have fixed the time-related unit tests in the commit I just pushed. I have also updated the top-level PR description.

I may make a few other small changes, then will take this out of draft.

'stop' calls were leading to pFUnit passes instead of fails. Need to
either have shr_sys_abort (which raises a pFUnit exception) or stop with
a non-zero integer return value (which causes the program to stop with
an error code).
@billsacks billsacks marked this pull request as ready for review March 22, 2025 16:32
@billsacks
Copy link
Member Author

Okay, this is ready for review now.

@samsrabin - I noticed that there had been some calls to 'stop' (possibly with an error message); these were leading to premature exits followed by pass results from unit tests. I fixed some of these in the most recent commit. I see at least one more - in shr_mpi_abort in src/unit_test_stubs/csm_share/shr_mpi_mod_stub.F90. I'm not sure what the right thing is to do there, because I think that may be called in cases where we want a 0 (success) exit status, but it's possible that it's also called in cases where we want a non-zero exit status. Maybe we want to use the rcode here? If I get time, I may look into this a little more... it's possible this was responsible for the issue you originally ran into.

@billsacks
Copy link
Member Author

I see at least one more - in shr_mpi_abort in src/unit_test_stubs/csm_share/shr_mpi_mod_stub.F90. I'm not sure what the right thing is to do there, because I think that may be called in cases where we want a 0 (success) exit status, but it's possible that it's also called in cases where we want a non-zero exit status. Maybe we want to use the rcode here? If I get time, I may look into this a little more... it's possible this was responsible for the issue you originally ran into.

I spent a few more minutes looking at this, and I think this is fine: I don't see any code paths that would lead to calling shr_mpi_abort with an error condition in unit testing (since, in unit testing, we use stub replacements for both shr_mpi_mod and shr_abort_mod, which are the two main sources of calls to shr_mpi_abort). So, while I'm not totally sure about this, my thinking is that this is okay.

@billsacks
Copy link
Member Author

Also, I do think we should make some behaviors more consistent than they are now to avoid problems down the road... I'll follow up on that with a separate issue.

See #3026

@billsacks
Copy link
Member Author

@ekluzek @samsrabin @adrifoster - I have added some notes on pFUnit-based unit testing connected with this PR: https://github.com/billsacks/ctsm/blob/fix_pfunit_multiple_esmf_inits/doc/design/pfunit_testing.rst . I'm not tied to anything in there, so if you want to make changes, I welcome you to change it directly - no need to go back and forth on it via review.

One thing I noticed today is that I think it would actually be okay to initialize ESMF multiple times: the problem just comes from trying to initialize it after finalizing. For now I'm leaving the ESMF_Initialize in a conditional on whether it has already been initialized, but if that starts to feel too cumbersome, I think it would be safe to take out that conditional: from examining the ESMF code, it looks like ESMF_Initialize returns without doing anything if ESMF has already been initialized. (I have addressed this a bit in the above design document.)

@billsacks
Copy link
Member Author

Now I see why the unit tests were looking like they passed even though they were aborting early: It was because of the way I had originally written the stop here (referencing unittestTimeManagerMod.F90 before the changes in this PR):

    call ESMF_Initialize(rc=rc)
    if (rc /= ESMF_SUCCESS) then
       stop 'Error in ESMF_Initialize'
    end if

This change (without the other changes in this PR) would result in a reported failure instead of pass:

diff --git a/src/unit_test_shr/unittestTimeManagerMod.F90 b/src/unit_test_shr/unittestTimeManagerMod.F90
index 8e1cdb3f3..0db91fb35 100644
--- a/src/unit_test_shr/unittestTimeManagerMod.F90
+++ b/src/unit_test_shr/unittestTimeManagerMod.F90
@@ -91,7 +91,8 @@ subroutine unittest_timemgr_setup(dtime, use_gregorian_calendar)
 
     call ESMF_Initialize(rc=rc)
     if (rc /= ESMF_SUCCESS) then
-       stop 'Error in ESMF_Initialize'
+       print *, 'Error in ESMF_Initialize'
+       stop 1
     end if
 
     if (l_use_gregorian_calendar) then

@billsacks
Copy link
Member Author

Now I see why the unit tests were looking like they passed even though they were aborting early: It was because of the way I had originally written the stop here (referencing unittestTimeManagerMod.F90 before the changes in this PR):

which is to say: this was not really a problem in ESMF or pFUnit, but rather a problem with how I was doing the error handling.

@billsacks
Copy link
Member Author

Here is some more information from Tom Clune... I don't think this changes anything, but I'm just noting it to keep others in the loop:

Yes STOP (and the more modern ERROR STOP) do not play well with unit testing frameworks. Nor do outright crashes or hangs.

There is a layer in pFUnit that attempts to run the test suite in a separate process and can then in theory identify tests that crash or hang. I called that TestRunner subclass “RobustTestRunner” or something like that. Unfortunately, the unix pipe mechanism I use for inter process communication in that mode is, er, not very robust. It passes my tests that exercise the robust layer, but in practice it was too fragile to be useful.

With pFunit the next best thing is to use the “-d” (debug) flag on the command line. Then you at least can see which test crashed or is hanging.

I should add that ERROR STOP might be sufficient for your purposes in the snippet of code you sent. It will generally cause the Fortran executable to have a nonzero status. (Fortran standard does not admit the existence of *nix, so this cannot be specified there, but in practice.) Of course there is also nothing in the MPI standard that guarantees that the error status of the Fortran executable is passed along as the error status of mpirun …

@samsrabin
Copy link
Collaborator

samsrabin commented Mar 27, 2025

Thanks so much for this work, Bill. I'm going to try and review in the next couple of days. For now I want to add this, since #3023 was merged (I'll handle doing this):

@samsrabin samsrabin linked an issue Mar 27, 2025 that may be closed by this pull request
@samsrabin samsrabin added the test: PFUNIT Pass FUNITCTSM tests before merging label Mar 27, 2025
@billsacks
Copy link
Member Author

Just FYI, I have changed ESMF to give a better error message in this init-final-init case (esmf-org/esmf#368), but as I noted above, it wouldn't have helped in the case that caused the problems leading to this PR.

ekluzek added a commit to ekluzek/CTSM that referenced this pull request Mar 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bfb bit-for-bit bug something is working incorrectly devops Development Operations to improve development throughput, E.g., adding GitHub Workflows test: PFUNIT Pass FUNITCTSM tests before merging testing additions or changes to tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CNPhenology (and other?) unit tests not being run FUNIT test not working on izumi
3 participants