Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gaia: change the signature of the method load_data #3014

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

cosmoJFH
Copy link

@cosmoJFH cosmoJFH commented May 29, 2024

The present implementation of the method GaiaClass.load_data makes use of the parameter output_file that defines a user defined zipped file were the data is saved.

We would like to change this parameter by the boolean parameter dump_to_file to save the results in the hard-code file name "datalink_output.zip" that will be saved in the current working directory.

This PR also contains 2 bug fixes in the present implementation:

  1. if the value of the parameter output_file is a simple name, i.e., without a path, the code will collect all the files with extensions '.fits', '.xml', '.csv', or '.ecsv' from the current working directory. This must be changed, since this implies that any file with any of those extensions will be included in zipped file, i.e., independently if they come from the execution of the method load_data, or from previous executions.

  2. If the parameter output_file is used, the built zip file is used to save the files coming from by the datalink service. Then the file is unzipped and the files that it contains are read, so that a list of Tables is built and returned to the user. But the the unzipped files are never removed. We think that this is a bug and it would be good to remove the unzipped files.

New units tests were developed.

cc @esdc-esac-esa-int

jira: GAIAMNGT-1700

@pep8speaks
Copy link

pep8speaks commented May 29, 2024

Hello @cosmoJFH! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2024-06-21 06:25:01 UTC

@bsipocz bsipocz added this to the v0.4.8 milestone Jun 11, 2024
Copy link
Member

@bsipocz bsipocz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to understand the motivation behind removing the ability from the users to define the output name. At the minimum, it should go through a deprecation cycle, but I would rather keep the functionality.

Besides that it looks fine, minus a minor bug about having it but not using the overwrite parameter.

@@ -325,21 +323,31 @@ def load_data(self, ids, *, data_release=None, data_structure='INDIVIDUAL', retr

return files

def build_general_output_filename(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we try to slowly simplify the API whereever we can, so I would suggest to make this private from the beginning.
Or just as well, not have a separate method, but add the 4 lines at the place of usage (as I see it is only used in one place)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have removed the function build_general_output_filename.

@@ -170,7 +169,7 @@ def logout(self, *, verbose=False):

def load_data(self, ids, *, data_release=None, data_structure='INDIVIDUAL', retrieval_type="ALL",
linking_parameter='SOURCE_ID', valid_data=False, band=None, avoid_datatype_check=False,
format="votable", output_file=None, overwrite_output_file=False, verbose=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing a parameter from the public APIs should be done through a deprecation period.

But also, I do wonder why is this necessary, to remove the ability from the users to define a filename/location that better suits them?

So, my suggestion would be to change the default None to datalink_output.zip, and add the new boolean parameter. That way the default will be what you proposed here, but the users still have the ability to select a better location.

if not os.path.isdir(path):
try:
os.mkdir(path)
except FileExistsError:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is the overwrite_output_file parameter, it should be used here isn't it?

@bsipocz
Copy link
Member

bsipocz commented Jun 18, 2024

I would also raise the idea of maybe having a separate method for the download instead of the dump_to_file parameter? That is what we have for most of the other modules, though I am not against the idea of having parameters instead. (But then we can also bike-shed on whether the dump_to_file is the best phrase to use.)

Anyway, this API question points further than this PR, and for that I would like to also ping @keflavich and @andamian for their opinions.

@andamian
Copy link

I would also raise the idea of maybe having a separate method for the download instead of the dump_to_file parameter? That is what we have for most of the other modules, though I am not against the idea of having parameters instead. (But then we can also bike-shed on whether the dump_to_file is the best phrase to use.)

Anyway, this API question points further than this PR, and for that I would like to also ping @keflavich and @andamian for their opinions.

The proposed change is a bit counterintuitive - it's hard to tell what drives it (maybe subsequent processing on the resulting file?). If the user has no control on the name of the file I would expect it to be at least timestamped. Also, current working directory is a bit vague. What is current working directory when a script calls the astroquery library, or a test is run with pytest or astroquery is invoked in a Jupyter notebook. At least make the location of the file available of the file available in the class.

@keflavich
Copy link
Contributor

Sorry I'm a bit late to this, but I agree that removing users' ability to specify the output location does not seem like a good idea.

However, I had a hard time parsing the original intent: is the idea that datalink_output.zip is supposed to be a temporary file that is always removed after the files are extracted? It seems instead what is implemented is the opposite: the datalink_output.zip is kept, but all the files extracted from it are deleted. To me, this seems backward; what's the use case for preserving a generically-named zip file?

@hectorcanovas
Copy link
Contributor

Dear all,

Thanks for your valuable feedback. Let me give you an overview of the way this functionality is currently implemented and the reasons behind our proposal.
The load_data() method retrieves ancillary products from the Gaia Archive that are served via the DataLink protocol. These products are stored inside a (Python) dictionary, whose keys are named according to the values of the following arguments:

  • data_structure : options are: “INDIVIDUAL”, “RAW”, and “COMBINED” – this last one is going to be deprecated)
  • format: options are: VOTable and VOTable_plain (translates to “xml”), FITS, CSV, and ECSV
  • retrieval_type: options are: 'ALL' , 'EPOCH_PHOTOMETRY', 'RVS', 'XP_CONTINUOUS', 'XP_SAMPLED', 'MCMC_GSPPHOT' and 'MCMC_MSC'

Examples of key names are: “EPOCH_PHOTOMETRY-Gaia DR3 2263166706630078848.xml” or “EPOCH_PHOTOMETRY_RAW.xml”.

This file-naming procedure reproduces the behavior of the web interface of the Gaia ESA Archive (which does not allow to specify the names of the files to be downloaded). That said, it is possible to retrieve:

  • One product for one source, or
  • One product for many sources, or
  • All the products available for one source, or
  • All the products available for many sources.

From the web interface, the data is always downloaded as a single compressed (.zip) file that is named as: <user_name><job_id>. or <job_id>. (depending on the request type: registered or anonymous).
To the best of my knowledge, it is not possible to retrieve the <job_id> value to associated to the request launched by the load_data() method, and hence it is not possible to add this attribute to the name of the compressed file retrieved via Astroquery.Gaia. We therefore decided to assign a fixed value to the name of the (compressed) file that stores the data: “datalink_output.zip”. We considered the following reasons to support this idea:

  1. It mimics the behavior of the Gaia ESA Archive web GUI: the name of the retrieved file is set by the Archive.
  2. It avoids problems if users specify a name with or without adding the “.zip” suffix (it is possible to update the method to check this issue and correct the name as necessary, but that requires extra work), and
  3. It is very easy to modify the (fixed) file name in a Python script by e.g., appending a timestamp suffix.

We could consider the possibility of adding a separated method to download the ancillary files, but there are few reasons that make me reluctant to implement it (although I am not totally against this):

  1. The output of the load_data method is not a TAP job, but a Python dictionary whose elements (keys) contain ancillary products retrieved via DataLink.
  2. In DR4 the Gaia Archive will serve images in FITS format. Our tests show that these products can be written in a .fits file with the proposed implementation. However, retrieving them without using the “dump_to_file = True” option results in an incomplete download, as only the tabular HDUs from the fits file are retrieved, and the HDU that contains the image part is missed.

All in all, we considered that the proposed change would benefit the users by simplifying the way that the DataLink products can be stored using Astroquery.Gaia). But if you think that this change can be improved (by e.g. automatically adding a timestamp to the currently fixed file name), please let us know.

Kind regards,

==================================
Dr. Héctor Cánovas Cabrera (ORCID)

Telespazio for ESA - European Space Agency
SCO-09 – Support Archive Scientist for the Gaia mission
European Space Astronomy Centre (ESAC)
Camino Bajo del Castillo s/n
28692 Villanueva de la Cañada (Madrid), Spain

@andamian
Copy link

andamian commented Jul 2, 2024

Thank you for your great description @hectorcanovas. Yes, trying to emulate the Web Interface functionality is a great goal as it will make it easier for user to use both mechanisms to access their data. I'm not familiar with the Gaia Archive, but in general, the Web interface is used for exploration while the library is used for processing.
Even so, the Web browser can prompt for the location where the file to be dowloaded or, in case it goes into the Downloads directory, it adds a number suffix each time the file with the same name is downloaded so versions are not overridden - maybe this is different in Gaia.
The astroquery library is used for processing in which multiple downloads are likely to be required and users will need a way to distinguish between these files.
Again, I'm not familiar with Gaia and please ignore if these general comments do not apply to your case. It is a design decision that people most familiar with the service/user base must take.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants