Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Refactor]: CDAT Migration Phase 3: testing and documentation update #846

Merged
merged 24 commits into from
Sep 27, 2024

Conversation

tomvothecoder
Copy link
Collaborator

@tomvothecoder tomvothecoder commented Sep 5, 2024

Description

1. Examples

In the root examples/ directory, there are example run scripts that use the legacy way of configuring and running E3SM diagnostics. They should all be refactored to reflect the latest changes.

Tasks

  • Make sure examples scripts work and produce close/same results as main
    • ex1
    • ex2
    • ex3
    • ex4
    • ex5 -- a few variables failed due to bug FIXED
    • ex6 -- a few variables failed due to bug FIXED
    • ex7
  • Fix incorrect time coordinate for obs dataset affecting ex5 and ex6 variable (comment)
  • Copy fixed obs dataset from NERSC to LCRC -- in progress (@chengzhuzhang)
    • /global/cfs/cdirs/e3sm/e3sm_diags/obs_for_e3sm_diags/climatology/ISCCPCOSP/ISCCPCOSP_ANN_climo.nc

2. Run script with model_vs_model

Tasks

3. Documentation

Tasks

  • Update /tutorials as needed -- none needed
  • Update /tutorials/2024 as needed -- none needed
  • Delete all run_...py scripts (only keep run_v2_9_0_all_sets_E3SM_machines.py
  • Rename run_v2_9_0_all_sets_E3SM_machines.py to run_all_sets_E3SM_machines.py
  • Update documentation to reflect any changes in the refactored codebase
    • CDP references

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • My changes generate no new warnings
  • Any dependent changes have been merged and published in downstream modules

If applicable:

  • New and existing unit tests pass with my changes (locally and CI/CD build)
  • I have added tests that prove my fix is effective or that my feature works
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have noted that this is a breaking change for a major release (fix or feature that would cause existing functionality to not work as expected)

- Delete outdated run scripts
@tomvothecoder tomvothecoder changed the base branch from main to cdat-migration-fy24 September 5, 2024 22:50
@tomvothecoder
Copy link
Collaborator Author

@chengzhuzhang The following variables failed in ex5.py and ex6.py:

['/global/cfs/cdirs/e3sm/www/cdat-migration-fy24/examples-dev/ex5_model_to_obs/lat_lon/Cloud ISCCP/ISCCPCOSP-CLDTOT_TAU1.3_9.4_ISCCP-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/cdat-migration-fy24/examples-dev/ex5_model_to_obs/lat_lon/Cloud ISCCP/ISCCPCOSP-CLDTOT_TAU1.3_9.4_ISCCP-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/cdat-migration-fy24/examples-dev/ex5_model_to_obs/lat_lon/Cloud ISCCP/ISCCPCOSP-CLDTOT_TAU1.3_ISCCP-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/cdat-migration-fy24/examples-dev/ex5_model_to_obs/lat_lon/Cloud ISCCP/ISCCPCOSP-CLDTOT_TAU1.3_ISCCP-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/cdat-migration-fy24/examples-dev/ex5_model_to_obs/lat_lon/Cloud ISCCP/ISCCPCOSP-CLDTOT_TAU9.4_ISCCP-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/cdat-migration-fy24/examples-dev/ex5_model_to_obs/lat_lon/Cloud ISCCP/ISCCPCOSP-CLDTOT_TAU9.4_ISCCP-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/cdat-migration-fy24/examples-dev/ex6_zonal_mean_2d_and_lat_lon_demo/lat_lon/Cloud ISCCP/ISCCPCOSP-CLDTOT_TAU1.3_9.4_ISCCP-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/cdat-migration-fy24/examples-dev/ex6_zonal_mean_2d_and_lat_lon_demo/lat_lon/Cloud ISCCP/ISCCPCOSP-CLDTOT_TAU1.3_9.4_ISCCP-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/cdat-migration-fy24/examples-dev/ex6_zonal_mean_2d_and_lat_lon_demo/lat_lon/Cloud ISCCP/ISCCPCOSP-CLDTOT_TAU1.3_ISCCP-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/cdat-migration-fy24/examples-dev/ex6_zonal_mean_2d_and_lat_lon_demo/lat_lon/Cloud ISCCP/ISCCPCOSP-CLDTOT_TAU1.3_ISCCP-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/cdat-migration-fy24/examples-dev/ex6_zonal_mean_2d_and_lat_lon_demo/lat_lon/Cloud ISCCP/ISCCPCOSP-CLDTOT_TAU9.4_ISCCP-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/cdat-migration-fy24/examples-dev/ex6_zonal_mean_2d_and_lat_lon_demo/lat_lon/Cloud ISCCP/ISCCPCOSP-CLDTOT_TAU9.4_ISCCP-ANN-global_test.nc']

Root Cause

I found that the time attributes for this file is bad, which breaks decoding (decode_times=True) with xCDAT/Xarray:
/global/cfs/cdirs/e3sm/e3sm_diags/obs_for_e3sm_diags/climatology/ISCCPCOSP/ISCCPCOSP_ANN_climo.nc

Checking time attributes (with decode_times=False)

<xarray.DataArray 'time' (time: 1)> Size: 4B
array([150.5], dtype=float32)
Coordinates:
  * time     (time) float32 4B 150.5
Attributes:
    long_name:  Time
    units:      months since 1983-06

Breaks with decode_times=True

import xarray as xr
import xcdat as xc

filepath = "/global/cfs/cdirs/e3sm/e3sm_diags/obs_for_e3sm_diags/climatology/ISCCPCOSP/ISCCPCOSP_ANN_climo.nc"

ds_xc = xc.open_dataset(filepath)
# ValueError: Non-integer years and months are ambiguous and not currently supported.

ds_xr = xr.open_dataset(filepath)
# ValueError: Failed to decode variable 'time': unable to decode time units 'months since 1983-06' # with 'the default calendar'. Try opening your dataset with decode_times=False or installing
# cftime if it is not installed.

Workaround

This file can be opened with decode_times=False (which is why it worked in the CDAT codebase), but that means modifying the Xarray code to accommodate this specific dataset. I'm not a fan of logic to accommodate bad data.

Unless we can get the time coordinates fixed, this might be the only other option. Happy to hear your thoughts.

  • ds = self._open_climo_dataset(filepath)
  • ds = self._open_climo_dataset(filepath)
  • # Time coordinates are decoded because there might be cases where
    # a multi-file climatology dataset has different units between files
    # but raw encoded time values overlap. Decoding with Xarray allows
    # concatenation of datasets with this issue (e.g., `area_cycle_zonal_mean`
    # set with the MERRA2_Aerosols climatology datasets).
    # NOTE: This GitHub issue explains why the "coords" and "compat" args
    # are defined as they are below: https://github.com/xCDAT/xcdat/issues/641
    args = {
    "paths": filepath,
    "decode_times": True,
    "add_bounds": ["X", "Y"],
    "coords": "minimal",
    "compat": "override",
    }

@tomvothecoder
Copy link
Collaborator Author

tomvothecoder commented Sep 9, 2024

To correct myself, xCDAT can handle a missing "calendar" attribute by defaulting it to standard (Example: 2024-09-09 11:55:22,114 [WARNING]: dataset.py(decode_time:360) >> 'time' does not have a calendar attribute set. Defaulting to CF 'standard' calendar.)

# ValueError: Non-integer years and months are ambiguous and not currently supported.

Root cause of this issue:

Related dataset: /global/cfs/cdirs/e3sm/e3sm_diags/obs_for_e3sm_diags/climatology/ISCCPCOSP/ISCCPCOSP_ANN_climo.nc

This issue is that this dataset uses a time scalar float of 150.5, which represents the middle of the month. xCDAT's decoding logic uses the relativedelta module, which expects the time coordinates to be placed at the start of each the month (e.g., 150, 151). As a result, the relativedelta module raises ValueError: Non-integer years and months are ambiguous and not currently supported.

MVCE based on xCDAT logic

from datetime import datetime
from dateutil import parser
from dateutil import relativedelta as rd

import numpy as np

flat_offsets = np.array([150.5])
ref_date = "1983-06"
units_type = "months"

ref_datetime: datetime = parser.parse(ref_date, default=datetime(2000, 1, 1))

# ValueError: Non-integer years and months are ambiguous and not currently supported.
times = np.array(
    [
        ref_datetime + rd.relativedelta(**{units_type: offset})
        for offset in flat_offsets
    ],
    dtype="object",
)

Possible workarounds

  1. Update dataset to replace scalar value 150.5 (representing 12/15/1995 or 12/16/1995 I think) with 150.0 (representing 12/01/1995)
    • Maybe low risk because it is a single time coordinate and it is still within the same month of the climatology. I don't think this time coordinate is used downstream?

@chengzhuzhang
Copy link
Contributor

Possible workarounds

1. Update dataset to replace scalar value 150.5 (representing 12/15/1995 or 12/16/1995 I think) with 150.0 (representing 12/01/1995)
   
   * Maybe low risk because it is a single time coordinate and it is still within the same month of the climatology. I don't think this time coordinate is used downstream?

Hi Tom, I think the fractional month could be ambiguous and that's why relativedelta does not support that. I think the workaround you proposed would be fine in this case. I don't think the time coordinate is used later.

@tomvothecoder tomvothecoder self-assigned this Sep 10, 2024
@tomvothecoder tomvothecoder added the cdat-migration-fy24 CDAT Migration FY24 Task label Sep 10, 2024
@tomvothecoder
Copy link
Collaborator Author

Hey @chengzhuzhang, the file with the fixed time coordinate that needs to be copied from NERSC to LCRC is /global/cfs/cdirs/e3sm/e3sm_diags/obs_for_e3sm_diags/climatology/ISCCPCOSP/ISCCPCOSP_ANN_climo.nc.

I created a backup of this file in the same directory on NERSC just in case. The backup has .bak appended to the filename.

@tomvothecoder
Copy link
Collaborator Author

tomvothecoder commented Sep 24, 2024

Findings for the "Test run script with model_vs_model run" task

main

  • Viewer link
  • 128 .nc files (not including diffs)
  • 143 .png files (not including diffs)
  • Sets: area_mean_time_series, arm_diags, lat_lon, mp_partition

dev

  • Viewer link
  • 1040 .nc files (not including diffs)
  • 635 .png files (not including diffs)
  • Sets: annual_cycle_zonal_mean, area_mean_time_series, arm_diags, cosp_histogram, lat_lon, meridional_mean_2d, mp_partition, polar, zonal_mean_2d, zonal_mean_2d_stratosphere, zonal_mean_xy

Why is main producing less files?

In the main log file, the specific errors that I think are preventing the generation of the remaining files are:

OSError: No file found for  and JJA in /global/cfs/cdirs/e3sm/diagnostics/observations/Atm/climatology
OSError: No file found for  and SON in /global/cfs/cdirs/e3sm/diagnostics/observations/Atm/climatology
OSError: No file found for  and MAM in /global/cfs/cdirs/e3sm/diagnostics/observations/Atm/climatology
OSError: No file found for  and DJF in /global/cfs/cdirs/e3sm/diagnostics/observations/Atm/climatology

What I think is happening

  • Dev branch -- dataset_xr.py has get_ref_climo() method that is being used for most/all sets. This method has a try and except statement that catches the IOError from _get_climo_filepath(), allowing sets to continue to run as a model-only run by substituting the test data for the reference data.
    • def get_ref_climo_dataset(
      self, var_key: str, season: ClimoFreq, ds_test: xr.Dataset
      ):
      """Get the reference climatology dataset for the variable and season.
      If the reference climatatology does not exist or could not be found, it
      will be considered a model-only run. For this case the test dataset
      is returned as a default value and subsequent metrics calculations will
      only be performed on the original test dataset.
      Parameters
      ----------
      var_key : str
      The key of the variable.
      season : CLIMO_FREQ
      The climatology frequency.
      ds_test : xr.Dataset
      The test dataset, which is returned if the reference climatology
      does not exist or could not be found.
      Returns
      -------
      xr.Dataset
      The reference climatology if it exists or a copy of the test dataset
      if it does not exist.
      Raises
      ------
      RuntimeError
      If `self.data_type` is not "ref".
      """
      # TODO: This logic was carried over from legacy implementation. It
      # can probably be improved on by setting `ds_ref = None` and not
      # performing unnecessary operations on `ds_ref` for model-only runs,
      # since it is the same as `ds_test`. In addition, returning ds_test
      # makes it difficult for debugging.
      if self.data_type == "ref":
      try:
      ds_ref = self.get_climo_dataset(var_key, season)
      self.model_only = False
      except (RuntimeError, IOError):
      ds_ref = ds_test.copy()
      self.model_only = True
      logger.info("Cannot process reference data, analyzing test data only.")
      else:
      raise RuntimeError(
      "`Dataset._get_ref_dataset` only works with "
      f"`self.data_type == 'ref'`, not {self.data_type}."
      )
      return ds_ref
  • Main branch -- dataset.py has get_climo_variable() and only lat_lon set has the try and except for model-only runs. This means all other sets will fail, leading to less files being produced compared to dev.
    • def get_climo_variable(self, var, season, extra_vars=[], *args, **kwargs):
      """
      For a given season, get the variable and any extra variables and run
      the climatology on them.
      These variables can either be from the test data or reference data.
      """
      self.var = var
      self.extra_vars = extra_vars
      if not self.var:
      raise RuntimeError("Variable is invalid.")
      if not season:
      raise RuntimeError("Season is invalid.")
      # We need to make two decisions:
      # 1) Are the files being used reference or test data?
      # - This is done with self.ref and self.test.
      # 2) Are the files being used climo or timeseries files?
      # - This is done with the ref_timeseries_input and test_timeseries_input parameters.
      if self.ref and self.is_timeseries():
      # Get the reference variable from timeseries files.
      data_path = self.parameters.reference_data_path
      timeseries_vars = self._get_timeseries_var(data_path, *args, **kwargs)
      # Run climo on the variables.
      variables = [self.climo_fcn(v, season) for v in timeseries_vars]
      elif self.test and self.is_timeseries():
      # Get the test variable from timeseries files.
      data_path = self.parameters.test_data_path
      timeseries_vars = self._get_timeseries_var(data_path, *args, **kwargs)
      # Run climo on the variables.
      variables = [self.climo_fcn(v, season) for v in timeseries_vars]
      elif self.ref:
      # Get the reference variable from climo files.
      filename = self.get_ref_filename_climo(season)
      variables = self._get_climo_var(filename, *args, **kwargs)
      elif self.test:
      # Get the test variable from climo files.
      filename = self.get_test_filename_climo(season)
      variables = self._get_climo_var(filename, *args, **kwargs)
      else:
      msg = "Error when determining what kind (ref or test) "
      msg += "of variable to get and where to get it from "
      msg += "(climo or timeseries files)."
      raise RuntimeError(msg)
      # Needed so we can do:
      # v1 = Dataset.get_variable('v1', season)
      # and also:
      # v1, v2, v3 = Dataset.get_variable('v1', season, extra_vars=['v2', 'v3'])
      return variables[0] if len(variables) == 1 else variables
    • lat_lon code:
      mv1 = test_data.get_climo_variable(var, season)
      try:
      mv2 = ref_data.get_climo_variable(var, season)
      except (RuntimeError, IOError):
      mv2 = mv1
      logger.info("Can not process reference data, analyse test data only")
      parameter.model_only = True

Options

  1. Update all sets to use get_climo_dataset() and only add try and except to lat_lon, then remove get_ref_climo_dataset() -- cleanest solution
  2. Update logic on get_ref_climo_dataset() to perform model-only run for lat_lon if reference climo file is not found -- easiest solution
  3. Keep things as is if we want to perform model-only runs -- note, this might lead to slower overall runtime compared to main since more sets and variables are being processed

@tomvothecoder
Copy link
Collaborator Author

Hey @chengzhuzhang, can you provide your thoughts to my above comment?

@chengzhuzhang
Copy link
Contributor

Hey @chengzhuzhang, can you provide your thoughts to my above comment?

Hi @tomvothecoder I'm voting option 1 as the best. It can be a good opportunity to also address this relevant issue? #823 .

- Refactor lat_lon driver to split up functions used for model-only run
- remove `_get_ref_climo_dataset` from `dataset_xr.py`
@chengzhuzhang
Copy link
Contributor

chengzhuzhang commented Sep 25, 2024

Hey @chengzhuzhang, the file with the fixed time coordinate that needs to be copied from NERSC to LCRC is /global/cfs/cdirs/e3sm/e3sm_diags/obs_for_e3sm_diags/climatology/ISCCPCOSP/ISCCPCOSP_ANN_climo.nc.

I created a backup of this file in the same directory on NERSC just in case. The backup has .bak appended to the filename.

@tomvothecoder I have copied the data to LCRC input data server. We can run mache to sync data to all machines once new e3sm_unified version is deployed.

@tomvothecoder
Copy link
Collaborator Author

Hey @chengzhuzhang, the file with the fixed time coordinate that needs to be copied from NERSC to LCRC is /global/cfs/cdirs/e3sm/e3sm_diags/obs_for_e3sm_diags/climatology/ISCCPCOSP/ISCCPCOSP_ANN_climo.nc.
I created a backup of this file in the same directory on NERSC just in case. The backup has .bak appended to the filename.

@tomvothecoder I have copied the data to LCRC input data server. We can run mache to sync data to all machines once new e3sm_unified version is deployed.

I found a bug with that file where the variable values were replaced with 0.0 for some reason. I recreated the file on NERSC.

Can you point me to the path and I will copy it over again?

@chengzhuzhang
Copy link
Contributor

Can you point me to the path and I will copy it over again?

The LCRC path is /lcrc/group/e3sm/public_html/diagnostics/observations/Atm/climatology/ISCCPCOSP. Thank you for taking care of this @tomvothecoder

@tomvothecoder
Copy link
Collaborator Author

Can you point me to the path and I will copy it over again?

The LCRC path is /lcrc/group/e3sm/public_html/diagnostics/observations/Atm/climatology/ISCCPCOSP. Thank you for taking care of this @tomvothecoder

I can't overwrite the file due to permissions and different group access.

-rw-r--r--  1 ac.zhang40 cels    2238458 Sep  9 17:12 ISCCPCOSP_ANN_climo.nc
-rwxr-xr-x  1 ac.zhang40 E3SM    2162972 Nov  5  2019 ISCCPCOSP_ANN_climo.nc.bak
-rwxr-xr-x  1 ac.zhang40 E3SM    2162972 Nov  5  2019 ISCCPCOSP_DJF_climo.nc
-rwxr-xr-x  1 ac.zhang40 E3SM    2162972 Nov  5  2019 ISCCPCOSP_JJA_climo.nc
-rwxr-xr-x  1 ac.zhang40 E3SM    2162972 Nov  5  2019 ISCCPCOSP_MAM_climo.nc
-rwxr-xr-x  1 ac.zhang40 E3SM    2162972 Nov  5  2019 ISCCPCOSP_SON_climo.nc

Whenever you have time, can you move the latest fixed version of this file from NERSC? Thanks!

Copy link
Collaborator Author

@tomvothecoder tomvothecoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regression test results

  • examples scripts -- all run, within rtol
  • model_vs_model -- 124/128 variables within rtol, 4/128 not (but okay due to bug on main -> refer to PR description
  • model_vs_obs -- 1213/1215 variables within rtol, 2/1215 not within rtol due to <=8/64800 elements mismatching (0.04% and 1.2% max rel diff) -- not a concern

Summary of Changes

e3sm_diags/driver/

  • annual_cycle_zonal_mean_driver.py, cosp_histogram_driver.py, meridional_mean_2d_driver.py, polar_driver.py, zonal_mean_2d_driver.py, zonal_mean_xy_driver.py -- replaced get_ref_climo_dataset() with get_climo_dataset()
  • lat_lon_driver.py
    • Add _run_diags_2d_model_only() and _run_diags_3d_model_only() functions (when ds_ref reference dataset is None)
    • Add _get_ref_climo_dataset() -- now returns None if reference dataset is not passed instead of setting ds_ref = ds_test
    • Refactor _create_metrics_dict()
      • Use DEFAULT_METRICS_VALUE (999.99) as default value for missing metrics to support viewer, otherwise it will break if None

driver/utils/dataset_xr.py

  • Remove get_ref_climo_dataset() since it is only used by lat_lon_driver.py
  • Fix _subset_vars_and_load() to keep "area" variable

Run scripts

  • Delete all outdated run_....py scripts
  • Rename run_v_2_9_0_all_sets_E3SM_machines.py to run_all_sets_E3SM_machine.py -- the filename can be version agonistic and updated based on the latest version

Update complete_run.py

  • Port over functions from run_v2_6_0_all_sets.py needed in this module

@@ -1,12 +1,15 @@
from __future__ import annotations
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A file of interest for review

@tomvothecoder tomvothecoder marked this pull request as ready for review September 27, 2024 19:07
Copy link
Collaborator Author

@tomvothecoder tomvothecoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Misc. findings

e3sm_diags/driver/zonal_mean_xy_driver.py Outdated Show resolved Hide resolved
e3sm_diags/driver/lat_lon_driver.py Outdated Show resolved Hide resolved
e3sm_diags/driver/lat_lon_driver.py Outdated Show resolved Hide resolved
@chengzhuzhang
Copy link
Contributor

Whenever you have time, can you move the latest fixed version of this file from NERSC? Thanks!

This one is done!

@tomvothecoder tomvothecoder changed the title [Refactor]: CDAT Migration Phase 3: testing, documentation update, prepare for new release [Refactor]: CDAT Migration Phase 3: testing, performance benchmark, documentation update Sep 27, 2024
docs/source/dev_guide/using-output-viewer.rst Outdated Show resolved Hide resolved
docs/source/dev_guide/using-output-viewer.rst Outdated Show resolved Hide resolved
Comment on lines -54 to -56
- interacts effectively with the PCMDI's metrics package and the ARM
diagnostics package through a unifying framework: `Community
Diagnostics Package (CDP) <https://github.com/CDAT/cdp>`_.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e3sm_diags no longer uses CDP

@tomvothecoder tomvothecoder changed the title [Refactor]: CDAT Migration Phase 3: testing, performance benchmark, documentation update [Refactor]: CDAT Migration Phase 3: testing and documentation update Sep 27, 2024
@tomvothecoder
Copy link
Collaborator Author

Merging this PR based on my review comment above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cdat-migration-fy24 CDAT Migration FY24 Task
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Refactor]: CDAT Migration Phase 3: testing, documentation update
2 participants