Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix clm2 support #602

Merged
merged 1 commit into from
Jul 19, 2024
Merged

Fix clm2 support #602

merged 1 commit into from
Jul 19, 2024

Conversation

forsyth2
Copy link
Collaborator

Fix clm2 support. Follow-up fix to #421, addressing an accidental reversion in #424. (See #421 (comment) for details)

Comment on lines 112 to 116
{%- if input_files.split(".")[0] == 'cam' or input_files.split(".")[0] == 'eam' or input_files.split(".")[0] == 'elm' or input_files.split(".")[0] == 'clm2' %}
--prc_typ={{ input_files.split(".")[0][:3] }}
{%- else %}
--prc_typ={{ prc_typ }}
{%- endif %}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BunnyVon @chengzhuzhang Does this still accomplish what #421 was trying to do? For reference, that change was:

--prc_typ={{ input_files.split(".")[0] }}

to

--prc_typ={{ input_files.split(".")[0][:3] }}

@golaz Does this still accomplish what #424 was trying to do? For reference, that change was:

{%- if input_files.split(".")[0] == 'cam' or input_files.split(".")[0] == 'eam' or input_files.split(".")[0] == 'elm' or input_files.split(".")[0] == 'clm2'  %}
--prc_typ={{ input_files.split(".")[0][:3] }}
{%- else %}
--prc_typ=sgs
{%- endif %}

to:

--prc_typ={{ prc_typ }}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. I was able to run the bash files by changing --prc_typ=clm

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@forsyth2

I think for following the logic for #424,
We can just modify the block here:

zppy/zppy/utils.py

Lines 159 to 161 in 0883805

elif tmp in ("clm2", "elm"):
component = "lnd"
prc_typ = tmp
to make sure the prc_typ for clm2 should be clm.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chengzhuzhang I added a second commit with that change:

    elif tmp in ("clm2",):
        component = "lnd"
        prc_typ = "clm"
    elif tmp in ("elm",):

Is that correct? Or did we want it for elm too?

@forsyth2 forsyth2 marked this pull request as ready for review June 17, 2024 19:34
@chengzhuzhang
Copy link
Collaborator

chengzhuzhang commented Jun 18, 2024

@czender Hi Charlie, could you help to review if the logic here for get prc_typ still good for latest nco?

zppy/zppy/utils.py

Lines 143 to 169 in 0883805

def getComponent(input_component, input_files):
if input_component != "":
tmp = input_component
else:
tmp = input_files.split(".")[0]
# Default ncclim procedure type is "sgs"
prc_typ = "sgs"
# Output component (for directory structure) and ncclimo procedure type
if tmp in ("cam", "eam", "eamxx"):
component = "atm"
prc_typ = tmp
elif tmp in ("cpl",):
component = "cpl"
elif tmp in ("clm2", "elm"):
component = "lnd"
prc_typ = tmp
elif tmp in ("mosart",):
component = "rof"
else:
raise ValueError(
f"Cannot extract output component name from {input_component} or {input_files}."
)
return component, prc_typ

The needed change is to make prc_typ = clm for clm2 output.

Copy link

@czender czender left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I expect this will work fine.

@chengzhuzhang
Copy link
Collaborator

Yes, I expect this will work fine.

Thank you, Charlie!

@chengzhuzhang
Copy link
Collaborator

@forsyth2 The new commit 48988df looks good. I don't think we need d936b85 with this change.

@forsyth2
Copy link
Collaborator Author

$ cd /lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/issue-421-post-600-both-commits/v2.LR.historical_0201/post/scripts
$ grep -v "OK" *status
# No results => everything worked

$ cd /lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/issue-421-post-600-1st-commit-only/v2.LR.historical_0201/post/scripts
$ grep -v "OK" *status
e3sm_diags_atm_monthly_180x360_aave_mvm_model_vs_model_1852-1853_vs_1850-1851.status:WAITING 548863
e3sm_diags_atm_monthly_180x360_aave_tc_analysis_model_vs_obs_1850-1851.status:WAITING 548862
e3sm_diags_lnd_monthly_mvm_lnd_model_vs_model_1850-1851_vs_1850-1851.status:WAITING 548864
tc_analysis_1850-1851.status:RUNNING 548854
tc_analysis_1852-1853.status:WAITING 548855
$ sq
USER     ACCOUNT NODE PARTITION JOBID   ST     REASON       TIME TIME_LIMIT NAME
ac.forsy    e3sm    1     debug 548855  PD Dependency       0:00      30:00 tc_analysis_1852-1853
ac.forsy    e3sm    1     debug 548862  PD Dependency       0:00      30:00 e3sm_diags_atm_monthly_180x360_aave_tc_analysis_model_vs_obs_1850-1851
ac.forsy    e3sm    1   compute 548863  PD Dependency       0:00    5:00:00 e3sm_diags_atm_monthly_180x360_aave_mvm_model_vs_model_1852-1853_vs_1850-1851
ac.forsy    e3sm    1   compute 548864  PD Dependency       0:00      30:00 e3sm_diags_lnd_monthly_mvm_lnd_model_vs_model_1850-1851_vs_1850-1851

The other commit is in fact necessary. Without it, 3 e3sm_diags jobs don't run because they are waiting on jobs that don't exist. TC Analysis also seems to fail on 1850-1851, meaning 1852-1853 can't run.

@forsyth2
Copy link
Collaborator Author

It appears the both-commits test was done without the #611 commit, (meaning the global time series plots don't look right), so I need to re-test with that. But it does look like we need both commits.

@forsyth2
Copy link
Collaborator Author

@chengzhuzhang This is my testing summary so far. I'm hoping to get back to this later in the afternoon or tomorrow.

Also, I think there was some confusion over my debugging/testing process. To clarify, when I say "test off latest main" I mean rebasing the code off the latest main branch. This was important because the latest commit on main, #611, changed testing results, so it was important to test off the latest main. The main branch itself is unaffected -- there is no merging from this branch yet.

Furthermore, the issues I describe below specifically come up in the context of the complete-run test, so it is only possible to narrow scope/runtime so much. (The dry-run parameter is helpful, for instance, for checking pre-runtime things like dependencies). At this point, I am confident that the both-commits version is working.

But it seems that what you were getting at in yesterday's meeting was that we really shouldn't need the 1st commit (the ts.bash edits). In that case, see the below testing log for dropping the first commit. Again, I'm planning to do further debugging myself later, but so far all I can see is that something is happening with the TC analysis.

For reference, the respective changes:

1st commit -- ts.bash edits:

--prc_typ={{ prc_typ }}

=>

{%- if input_files.split(".")[0] == 'cam' or input_files.split(".")[0] == 'eam' or input_files.split(".")[0] == 'elm' or input_files.split(".")[0] == 'clm2'  %}
--prc_typ={{ input_files.split(".")[0][:3] }}
{%- else %}
--prc_typ={{ prc_typ }}
{%- endif %}

2nd commit -- utils.py edits:

elif tmp in ("clm2", "elm"):

=>

    elif tmp in ("clm2",):
        component = "lnd"
        prc_typ = "clm"
    elif tmp in ("elm",):

Both commits

test output dir branch base commit conda env Ran pip install . && python tests/integration/utils.py? lessons learned
/lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/issue-421-post-611-both-commits/v2.LR.historical_0201/post/scripts/ issue-421-fix Add center times (611) zppy_dev_n600 y No errors. Can proceed with bundles run and integration testing.

With both the zppy/templates/ts.bash changes and the zppy/utils.py changes, the output is ready for final testing. (The branch with both commits actually did previously pass the tests aside from the global-time-series plot test -- and I'm almost certain that was failing because I hadn't yet rebased to include #611 -- hence why rebasing to include #611 was important).

Dropping the first commit (ts.bash edits is dropped, leaving only the utils.py edits)

test output dir branch base commit conda env Ran pip install . && python tests/integration/utils.py? lessons learned
/lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/issue-421-post-600-1st-commit-only/v2.LR.historical_0201/post/scripts/ issue-421-fix-drop-1st-commit Add center times (611) zppy_dev_n600 y Note the unique_id is a misnomer. It should really be drop-1st-commit. grep -v "OK" *status shows e3sm_diags_atm_monthly_180x360_aave_mvm_model_vs_model_1852-1853_vs_1850-1851, e3sm_diags_atm_monthly_180x360_aave_tc_analysis_model_vs_obs_1850-1851, e3sm_diags_lnd_monthly_mvm_lnd_model_vs_model_1850-1851_vs_1850-1851, tc_analysis_1852-1853 waiting, and their jobs were still in the squeue output marked as waiting for a dependency. There's also tc_analysis_1850-1851.status:RUNNING 548854. So we have a couple problems:

1) e3sm_diags_atm_monthly_180x360_aave_mvm_model_vs_model_1852-1853_vs_1850-1851 and e3sm_diags_lnd_monthly_mvm_lnd_model_vs_model_1850-1851_vs_1850-1851 are waiting on jobs that apparently don't exist.

2) e3sm_diags_atm_monthly_180x360_aave_tc_analysis_model_vs_obs_1850-1851 and tc_analysis_1852-1853 are waiting on tc_analysis_1850-1851, which is apparently silently failing. The .o file indicates /var/spool/slurmd/job548854/slurm_script: line 114: /lcrc/globalscratch/ac.forsyth2//tc-analysis_1850_1851/aew_v2.LR.historical_0201_1850_1851.txt: No such file or directory. This doesn't make any sense though because tc_analysis doesn't depend on the ts task, so dropping that commit should not have affected this.

Going to try re-running.
(Rerunning incomplete tasks) issue-421-fix-drop-1st-commit Add center times (611) zppy_dev_n600 y Nothing actually re-ran because zppy won't re-run tasks that say "WAITING" or "RUNNING". Need to delete those status files.
(Rerunning incomplete tasks) issue-421-fix-drop-1st-commit Add center times (611) zppy_dev_n600 y It doesn't seem like it was a one-time issue. The bad status files still occur. Turning on dry-run to take a look at dependencies.
(Rerunning incomplete tasks) issue-421-fix-drop-1st-commit Add center times (611) zppy_dev_n600 y Dependency file paths are long. Changing print statement to print([os.path.basename(d) for d in dependencies])
(Rerunning incomplete tasks) issue-421-fix-drop-1st-commit Add center times (611) zppy_dev_n600 y e3sm_diags_atm_monthly_180x360_aave_mvm_model_vs_model_1852-1853_vs_1850-1851 and e3sm_diags_lnd_monthly_mvm_lnd_model_vs_model_1850-1851_vs_1850-1851 have identical dependenices: ['tc_analysis_1850-1851.status', 'climo_atm_monthly_180x360_aave_1852-1853.status', 'climo_atm_monthly_diurnal_8xdaily_180x360_aave_1852-1853.status', 'tc_analysis_1852-1853.status', 'ts_atm_monthly_180x360_aave_1852-1853-0002.status', 'ts_rof_monthly_1852-1853-0002.status']. It looks like everything else is blocked because TC Analysis 1850-1851 is failing. I'm not sure why tc_analysis is showing up as a dependecy for the land model-vs-model when it has sets = "lat_lon_land", only. In the tc_analysis .o file, I see NetCDF: HDF error /var/spool/slurmd/job552637/slurm_script: line 80: 331340 Killed GenerateCSMesh --res $res --alt --file ${result_dir}outCSne$res.g, but that isn't particularly helpful.

@forsyth2
Copy link
Collaborator Author

@chengzhuzhang I just discovered the same TC analysis error testing a completely different PR (#612), so that leads me to believe it's a problem on my end, perhaps with my scratch space (since I know TC Analysis makes use of that). What I don't understand is why the both-commits version didn't run into that issue when running the complete-run test.

@chengzhuzhang
Copy link
Collaborator

@forsyth2 I reviewed code change in first commit, and second commit. If I understand correctly both should take identical effect. I don't know why keeping both commits works, while drop the first commit doesn't...
Also in the /lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/issue-421-post-600-1st-commit-only/v2.LR.historical_0201/post/scripts/, i saw all stutus files show okay. Am I in the wrong output directory?

@forsyth2
Copy link
Collaborator Author

i saw all stutus files show okay.

I deleted the WAITING/RUNNING status files so I could re-run. I was re-running with dry-run so they haven't been reproduced yet.

I suspect the error is actually unrelated to this PR, being as I ran into a similar issue elsewhere (#602 (comment)). That will require more debugging to see what's wrong on my end.

@chengzhuzhang
Copy link
Collaborator

chengzhuzhang commented Jul 16, 2024

@forsyth2 Got it. Thank you for clarifying. I agree there might be a separate issue, which is irrelevant to code change in the PR.

@forsyth2
Copy link
Collaborator Author

For reference, code changes summary:

#421 zppy/templates/ts.bash:

  • --prc_typ={{ input_files.split(".")[0] }} => --prc_typ={{ input_files.split(".")[0][:3] }}
  • {% if input_files == 'elm.h0' -%} => {% if input_files.split(".")[0] == 'clm2' or input_files.split(".")[0] == 'elm' -%}
  • {% if input_files == 'eam.h0' -%} => {% if input_files.split(".")[0] == 'cam' or input_files.split(".")[0] == 'eam' -%}

#424 zppy/templates/ts.bash:

{%- if input_files.split(".")[0] == 'cam' or input_files.split(".")[0] == 'eam' or input_files.split(".")[0] == 'elm' or input_files.split(".")[0] == 'clm2'  %}
--prc_typ={{ input_files.split(".")[0][:3] }}
{%- else %}
--prc_typ=sgs
{%- endif %}

=> --prc_typ={{ prc_typ }}
where prc_typ is defined in zppy/ts.py:

c["component"], c["prc_typ"] = getComponent(
            c["input_component"], c["input_files"]
        )

The zppy/utils.py change in this PR applies the following in that function:
elif tmp in ("clm2", "elm"): =>

    elif tmp in ("clm2",):
        component = "lnd"
        prc_typ = "clm"
    elif tmp in ("elm",):

@forsyth2
Copy link
Collaborator Author

Confirmed the issue is #613, doing final checks now.

@forsyth2 forsyth2 merged commit df88dbe into main Jul 19, 2024
4 checks passed
@forsyth2 forsyth2 deleted the issue-421-fix branch July 19, 2024 19:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants