Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

U/jrbogart/config reorg #112

Merged
merged 23 commits into from
Aug 2, 2024
Merged

U/jrbogart/config reorg #112

merged 23 commits into from
Aug 2, 2024

Conversation

JoanneBogart
Copy link
Collaborator

@JoanneBogart JoanneBogart commented Jul 15, 2024

Reorganize config file so that almost everything is stored per object type. The top level file only includes schema version and the information necessary to locate the config and data directory and a mapping of object types to the yaml files containing details about them. Since catalog files for different object types are in general made at different times, using different run options and possibly different versions of the code, provenance is stored per object type. Cosmology (for galaxy-like types) and SED information is also now stored within the per-object block.

The old object types used to represent components of galaxies (e.g. bulge_basic) are no longer full-fledged object types. Their information is in the parent file.

The creation code handles writing of the top-level config file and the file for the object type data being generated. If the top-level file already exists because other object type data have already been created, the creation code will update the top file with an entry for the new object type. Entries for data not handled by the creation code (e.g. gaia stars; snana) must be added by hand. Examples of suitable per-object configs may be found under skycatalogs/data/cfg_templates

The SkyCatalog class can handle old-style configs as well as those in the new (schema version 1.3.0) format.

Some notes on details of and motivation for this update can be found here.

@JoanneBogart JoanneBogart marked this pull request as draft July 15, 2024 22:36
@JoanneBogart JoanneBogart marked this pull request as ready for review July 15, 2024 23:17
Copy link
Collaborator

@jchiang87 jchiang87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made a few suggestions, but after getting part way through the changes, it might be better to just make a couple of general comments:

  • There's a lot of code to make things backwards compatible. For both near and long term maintainability, it might be preferable not to support backwards compatibility and instead provide scripts to convert existing catalogs to the new format.
  • Instead of having separate utility functions for writing the provenance for each object type, I'm wondering if each BaseObject subclass could implement its own member function to write provenance. I think that approach would avoid the code complexity that comes with the different if-elif branches in config_utils.py and would make it easier to add new object types in the future.

if self._schema_version:
# cmps = self._cmps
# if (cmps[0] > 1) or (cmps[0] == 1 and cmps[1] > 2):
old_style = self._schema_pre_130
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

old_style is already set on line 281. Also, is there any reason to keep the commented out lines?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope

Comment on lines 295 to 296
if old_style:
return self._cfg['Cosmology']
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you move this if-block to the top of this function, it looks like you can omit any further references to old_style and thereby simplify the logic for non-old_style cases.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally self._schema_pre_130 wasn't pre-computed so more complicated logic was needed here. I neglected to simply here after that was changed. Will fix.

Comment on lines 288 to 291
if 'galaxy' in self._cfg['object_types']:
object_type = 'galaxy'
elif 'diffsky_galaxy' in self._cfg['object_types']:
object_type = 'diffsky_galaxy'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code assumes that 'galaxy' and 'diffsky_galaxy' types can never be used at the same time. Is that enforced somewhere? Also, there must be some cosmology assumed for SNe and AGNs. Are we not keeping track of cosmology for those object types? How, if at all, do we ensure the cosmologies are consistent?

Copy link
Collaborator Author

@JoanneBogart JoanneBogart Jul 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no AGN currently, of course. For snana-style SNe I don't see any mention of cosmology in the code. If it's needed it should probably be included in the snana section of the config (ideally to be supplied by the time domain people along with the data). Or else there should be something connecting snana to the galaxy type.

I agree something needs to be done about multiple galaxy types - either support them or consistently disallow. The strategy here is

  • use caller-specified object_type if there is one
  • else if galaxy is present, use it
  • else if diffsky_galaxy is present use it
  • else throw up our hands and return None since no other object_types currently have an associated cosmology



def get_file_metadata(fpath, key='provenance'):
class ConfigWriter():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the () is needed here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved the code to create fragments to little per-object-type classes which subclass BaseConfigFragment. The subclasses are within the xxx_object.py files.

Comment on lines 521 to 531
if not self._overwrite:
try:
with open(outpath, mode='x') as f:
yaml.dump(input_dict, f)
except FileExistsError:
txt = 'write_yaml: Will not overwrite pre-existing config'
self._logger.warning(txt + outpath)
return
else:
return self.update_yaml(input_dict, outpath)

Copy link
Collaborator

@jchiang87 jchiang87 Jul 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be clearer if this function were written like this:

if self._overwrite:
    return self.update_yaml(input_dict, outpath)

try:
    with open(outpath, mode='x') as f:
        yaml.dump(input_dict, f)
except FileExistsError:
    txt = 'write_yaml: Will not overwrite pre-existing config'
    self._logger.warning(txt + outpath)
    return

return outpath

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed.

except FileExistsError:
txt = 'write_yaml: Will not overwrite pre-existing config'
self._logger.warning(txt + outpath)
return
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, the return value of write_yaml isn't used anywhere, so it's confusing to see the exception caught here and the function returning None when the other case are returning non-None values.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't used currently, but returning the path written seemed like it might be useful. In case of the exception nothing is written so None is returned.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on how a return value would be used, I would nominally expect that

try:
    with open(outpath, mode='x') as f:
        yaml.dump(input_dict, f)
except FileExistsError:
    txt = 'write_yaml: Will not overwrite pre-existing config'
    self._logger.warning(txt + outpath)

return outpath

has the desired behavior, since in both cases (i.e., with and without the exception), the path to the yaml file with the specified config is returned.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning None is an indication (available programmatically, not just in a log message) that nothing got written

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but since nothing is using the return values, it's not clear that is or is not the right thing.

-------
A dict

The resulting dict looks much like the file galaxy_template.yaml under
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to indent these lines to match the rest of the docstring.

@JoanneBogart
Copy link
Collaborator Author

Concerning your general comments -
Providing scripts to convert older configs to the current schema would no doubt be useful. It would also take some work, significantly more than went into the code providing backward compatibility since there are plenty of differences in form which don't affect the operation of the API. Also, when there is a change in structure like the one proposed here, there could be some awkwardness for people running an older version of the code and trying to access a catalog with a new-style config. The old code can't warn them what the problem is. After a while presumably everyone would know to use a newer version of the code, but during the transition it could be frustrating. I propose that the code for this release should have backwards-compatibility more or less as currently implemented (taking your suggestions into account). Since I don't anticipate changing the location of schema_version, now a top-level key, ever again, I can put something in the code to check that version against what's in the code and put out a message if it's newer, so in the future there will be a warning if there is any kind of mismatch.
There should also be a script handling conversion when the schema changes, but I'd like to deal with that in a separate PR.
Moving the provenance-generating code to BaseObject and subclasses seems promising. I'll try to get that in this PR if I don't run into any serious roadblocks.

@jchiang87
Copy link
Collaborator

An alternative to conversion scripts would be simply to make available modern versions of the catalogs in use now. I doubt anybody aside from you and me have ever run create_sc.py. My worry about the backwards compatibility code in there now is that it will be technical debt that never gets retired. Also, because I had hoped it would go away, I didn't look as carefully as I would have had otherwise.

@JoanneBogart
Copy link
Collaborator Author

I don't see how I can make use of BaseObject and subclasses as hoped. There are no such objects lying around at the time the config fragments are being written. I could make another class hierarchy for object types, but there would still need to be creation of an instance for each (or alternatively make everything static) and a mapping of name to corresponding object-type object. It wouldn't be any less code than is there now but maybe more maintainable.

@jchiang87
Copy link
Collaborator

One thing that I wanted to note is that the version info for the throughputs package is recorded in the baseline subfolder, e.g., for the descssim location at NERSC, there is

[login33 w_2024_22] pwd -P
/global/cfs/cdirs/descssim/imSim/lsst/data/throughputs_2023_09_07
[login33 w_2024_22] cat baseline/version_info 
1.9
[login33 w_2024_22] 

So that info could read from that location and added to the provenance output, presumably associated with the flux files.

@JoanneBogart
Copy link
Collaborator Author

Thanks for the tip about throughputs version. As I mentioned at the Thursday zoom, the plan was to include this information if possible. I had seen those version numbers in the package but didn't know which, if any, was suitable for this purpose.

@JoanneBogart
Copy link
Collaborator Author

I've added the throughputs version to the per-file metadata as suggested. I've also added a little script to dump the provenance and updated sso file creation to include metadata (I forgot about sso in the first round).
The PR is ready for re-review.

Copy link
Collaborator

@jchiang87 jchiang87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, the ConfigFragment implemetation looks much cleaner. I do have some suggestions for simplifying/clarifying further.

Regarding making sure either only diffsky or cosmoDC2 galaxies are used, but not both, or alternatively, adding a mechanism to ensure cosmologies are consistent, can we add an issue to address this so that it doesn't get dropped entirely? An issue to add cosmology info for SNe would also be good.

@@ -70,7 +74,7 @@ def load_lsst_bandpasses():
bp = bp.withZeropoint('AB')
lsst_bandpasses[band] = bp

return lsst_bandpasses
return lsst_bandpasses, version
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll note that adding the version to the return value like this is an interface change and will break client code that calls load_lsst_bandpasses.

Copy link
Collaborator Author

@JoanneBogart JoanneBogart Jul 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. load_lsst_bandpasses could instead store the version as well as the bandpasses in a separate global variable and there could be a separate public get_lsst_bandpasses_version routine.
Or I could add a keyword argument "return_version", default False, and when False keep to old interface.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest an internal version of the function in addition to the public interface. So change the name of the current load_lsst_bandpasses to _load_lsst_bandpasses, and make a new public function that just returns the dict, so something along the lines of

def load_lsst_bandpasses():
    return _load_lsst_bandpasses()[0]

with open(os.path.join(bp_dir, 'version_info')) as f:
version = f.read().strip()
else:
version = 'galsim_builtin'
Copy link
Collaborator

@jchiang87 jchiang87 Jul 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic in this function seems to have gotten a little unwieldy. I'd suggest simplifying it like this:

    lsst_bandpasses = dict()
    rubin_sim_dir = os.getenv('RUBIN_SIM_DATA_DIR', None)
    bp_dir = os.path.join(rubin_sim_dir, 'throughputs', 'baseline')

    if os.path.isdir(bp_dir):
        with open(os.path.join(bp_dir, 'version_info')) as f:
            version = f.read().strip()
    else:
        bp_dir = None
        version = 'galsim_builtin'
<...everything below the same...>

For the special branch looking for throughputs in $HOME, I'd suggest setting RUBIN_SIM_DATA_DIR=${HOME}/rubin_sim_data in the user environment since this doesn't seem like a use case we need to support in general. Also, I don't see BaseObject._bp_path used anywhere so we can just omit that line. More generally, I think there should be a warning message that the builtin galsim throughputs are being used, since silently falling over to them is probably not what users would want to happen by default. Personally, I would prefer not to have that as a option at all since those bandpasses are not kept up-to-date.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concerning throughputs in $HOME - that code has been there forever (well, at least 2 years). I don't mind getting rid of it once the installation procedure is well-documented, which is a work in progress. See https://github.com/LSSTDESC/skyCatalogs/tree/u/jrbogart/skycat_doc.
I would rather not get rid of the galsim_builtin option (all there is currently for Roman) but there should be a warning; I'll add that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does anybody use the $HOME code? I've always wondered why it's there. afaik, that option isn't documented at all so I don't see why it even exists.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea whether anyone uses it. It's not documented in skyCatalogs documentation, but neither is anything else about finding throughputs or SEDs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we definitely know that the de facto standard functionality associated with throughputs and SEDs as implemented is used. So, if it's possible to clean up this special case that no one uses, which has a preferred alternative, and which makes the code more complicated than it ultimately needs to be, I think we should do that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I don't know of anyone who uses the $HOME code, I can't be sure that no one does. I'll get rid of it after the preferred method is documented.

Comment on lines 51 to 58
if self._opt_dict:
opt = self._opt_dict
other = dict()
for key in opt:
if opt[key] is not None:
other[key] = opt[key]
if len(other.keys()) > 0:
data.update(other)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering why all this is necessary. First off, I think you should set self._opt_dict = {} in the base class __init__, since that if self._opt_dict test will raise if the subclasses don't set that attribute. Also, self._opt_dict is a private attribute, and you have control over what goes into it, so testing for a None value seems unnecessary. If it's empty, i.e., if len(self._opt_dict.keys()) == 0, then the data.update(...) will have no effect, so that test seems unneeded. All in all, this block of code could be replaced by

data.update(self._opt_dict)

assuming you set self._opt_dict = {} in the base class __init__.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I can just use update, at least not as the rest of the code stands. self._opt_dict will usually include keys like
{'area_partition': None, 'data_file_type': None}
The None should not override what's already there. So each of the subclass __init__ routines would need to exclude those things from self._opt_dict. I chose to do it in generic_create instead.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can tell, all of the subclasses have area_partition and data_file_type as keyword options in their __init__ functions, so it seems those could just be moved to the base class and the logic to exclude None put there.

for the object type.
Must be implemented by subclass
'''
raise NotImplementedError("Must be implemented by subclass")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like you could put a default implementation here instead of requiring subclasses to implement, i.e.,

def make_fragment(self):
    return self.generic_create(self)

or even put the body of generic_create here and have subclasses that re-implement use the base class version before adding entries.


self._opt_dict = {'area_partition': area_partition,
'data_file_type': data_file_type}
self._cosmology = cosmology
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not include a Cosmology entry in self._opt_dict? i.e.,

self._opt_dict = {'area_partition': area_partition,
                  'data_file_type': data_file_type,
                  'Cosmology': cosmology}

I don't see self._cosmology from this class used anywhere else. Then this class could use the default implementation of make_fragment that I suggested be implemented in the base class.

Comment on lines 172 to 176
self._opt_dict = {'area_partition': area_partition,
'data_file_type': data_file_type}
self._cosmology = cosmology
self._tophat_bins = tophat_bins

Copy link
Collaborator

@jchiang87 jchiang87 Jul 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to my comment in DiffskyConfigFragment, why not just add Cosmology and tophat entries to self._opt_dict:

self._opt_dict = {'area_partition': area_partition,
                  'data_file_type': data_file_type,
                  'Cosmology': cosmology,
                  'tophat': {'bins': tophat_bins}}

Along with the suggestion for DiffskyConfigFragment, this would enable all subclasses to use the same make_fragment implementation in the base class.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add the 'Cosmology' entry to self._opt_dict but not 'tophat' because its value in the template is dict with a couple keys other than 'bins' which would get lost with the update.

@@ -1,10 +1,11 @@
import os
import numpy as np
# import numpy as np
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete?

@@ -25,6 +24,9 @@
from .utils.creator_utils import make_MW_extinction_av, make_MW_extinction_rv
from .objects.base_object import LSST_BANDS
from .objects.base_object import ROMAN_BANDS
from .objects.star_object import StarConfigFragment
from .objects.galaxy_object import GalaxyConfigFragment
from .objects.diffsky_object import DiffskyConfigFragment
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SnanaConfigFragment and GaiaConfigFragment aren't used anywhere?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that's because skyCatalogs is not responsible for making those catalogs. Those routines are provided in case someone (or some thing) that does produce them wants to also produce a suitable yaml fragment.

object_types: iterable of (pointsource) object type names
inpath string path to file
silent boolean if file not found and silent is true,
return None. Else raise exception
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a docstring entry for resolve_include.

Copy link
Collaborator

@jchiang87 jchiang87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks ok, but there is one item from previous comments that I suggested could be cleaned up, aside from the $HOME/bp_dir code.

Comment on lines 63 to 64
if len(other.keys()) > 0:
data.update(other)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think this could just be

data.update(other)

i.e., there is no need to test that other is not empty. fwiw, I think a more standard way to do that would be

if other:

@JoanneBogart JoanneBogart merged commit 192da85 into main Aug 2, 2024
1 check passed
@JoanneBogart JoanneBogart deleted the u/jrbogart/config_reorg branch August 2, 2024 18:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants