Gaia: change the signature of the method load_data #3014

cosmoJFH · 2024-05-29T16:01:39Z

The present implementation of the method GaiaClass.load_data makes use of the parameter output_file that defines a user defined zipped file were the data is saved.

We would like to change this parameter by the boolean parameter dump_to_file to save the results in the hard-code file name "datalink_output.zip" that will be saved in the current working directory.

This PR also contains 2 bug fixes in the present implementation:

if the value of the parameter output_file is a simple name, i.e., without a path, the code will collect all the files with extensions '.fits', '.xml', '.csv', or '.ecsv' from the current working directory. This must be changed, since this implies that any file with any of those extensions will be included in zipped file, i.e., independently if they come from the execution of the method load_data, or from previous executions.
If the parameter output_file is used, the built zip file is used to save the files coming from by the datalink service. Then the file is unzipped and the files that it contains are read, so that a list of Tables is built and returned to the user. But the the unzipped files are never removed. We think that this is a bug and it would be good to remove the unzipped files.

New units tests were developed.

cc @esdc-esac-esa-int

jira: GAIAMNGT-1700

pep8speaks · 2024-05-29T16:01:47Z

Hello @cosmoJFH! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2024-10-16 22:42:55 UTC

bsipocz

I would like to understand the motivation behind removing the ability from the users to define the output name. At the minimum, it should go through a deprecation cycle, but I would rather keep the functionality.

Besides that it looks fine, minus a minor bug about having it but not using the overwrite parameter.

bsipocz · 2024-06-11T03:01:48Z

astroquery/gaia/core.py

@@ -325,21 +323,31 @@ def load_data(self, ids, *, data_release=None, data_structure='INDIVIDUAL', retr

        return files

+    def build_general_output_filename(self):


we try to slowly simplify the API whereever we can, so I would suggest to make this private from the beginning.
Or just as well, not have a separate method, but add the 4 lines at the place of usage (as I see it is only used in one place)

We have removed the function build_general_output_filename.

bsipocz · 2024-06-18T19:43:37Z

astroquery/gaia/core.py

@@ -170,7 +169,7 @@ def logout(self, *, verbose=False):

    def load_data(self, ids, *, data_release=None, data_structure='INDIVIDUAL', retrieval_type="ALL",
                  linking_parameter='SOURCE_ID', valid_data=False, band=None, avoid_datatype_check=False,
-                  format="votable", output_file=None, overwrite_output_file=False, verbose=False):


Removing a parameter from the public APIs should be done through a deprecation period.

But also, I do wonder why is this necessary, to remove the ability from the users to define a filename/location that better suits them?

So, my suggestion would be to change the default None to datalink_output.zip, and add the new boolean parameter. That way the default will be what you proposed here, but the users still have the ability to select a better location.

This is still a problem, we should not break people's code without noticing them first.

bsipocz · 2024-06-18T19:50:12Z

astroquery/gaia/core.py

+            if not os.path.isdir(path):
+                try:
+                    os.mkdir(path)
+                except FileExistsError:


there is the overwrite_output_file parameter, it should be used here isn't it?

bsipocz · 2024-06-18T20:01:53Z

I would also raise the idea of maybe having a separate method for the download instead of the dump_to_file parameter? That is what we have for most of the other modules, though I am not against the idea of having parameters instead. (But then we can also bike-shed on whether the dump_to_file is the best phrase to use.)

Anyway, this API question points further than this PR, and for that I would like to also ping @keflavich and @andamian for their opinions.

andamian · 2024-06-27T22:43:28Z

I would also raise the idea of maybe having a separate method for the download instead of the dump_to_file parameter? That is what we have for most of the other modules, though I am not against the idea of having parameters instead. (But then we can also bike-shed on whether the dump_to_file is the best phrase to use.)

Anyway, this API question points further than this PR, and for that I would like to also ping @keflavich and @andamian for their opinions.

The proposed change is a bit counterintuitive - it's hard to tell what drives it (maybe subsequent processing on the resulting file?). If the user has no control on the name of the file I would expect it to be at least timestamped. Also, current working directory is a bit vague. What is current working directory when a script calls the astroquery library, or a test is run with pytest or astroquery is invoked in a Jupyter notebook. At least make the location of the file available of the file available in the class.

keflavich · 2024-06-27T23:18:41Z

Sorry I'm a bit late to this, but I agree that removing users' ability to specify the output location does not seem like a good idea.

However, I had a hard time parsing the original intent: is the idea that datalink_output.zip is supposed to be a temporary file that is always removed after the files are extracted? It seems instead what is implemented is the opposite: the datalink_output.zip is kept, but all the files extracted from it are deleted. To me, this seems backward; what's the use case for preserving a generically-named zip file?

hectorcanovas · 2024-07-01T20:01:46Z

Dear all,

Thanks for your valuable feedback. Let me give you an overview of the way this functionality is currently implemented and the reasons behind our proposal.
The load_data() method retrieves ancillary products from the Gaia Archive that are served via the DataLink protocol. These products are stored inside a (Python) dictionary, whose keys are named according to the values of the following arguments:

data_structure : options are: “INDIVIDUAL”, “RAW”, and “COMBINED” – this last one is going to be deprecated)
format: options are: VOTable and VOTable_plain (translates to “xml”), FITS, CSV, and ECSV
retrieval_type: options are: 'ALL' , 'EPOCH_PHOTOMETRY', 'RVS', 'XP_CONTINUOUS', 'XP_SAMPLED', 'MCMC_GSPPHOT' and 'MCMC_MSC'

Examples of key names are: “EPOCH_PHOTOMETRY-Gaia DR3 2263166706630078848.xml” or “EPOCH_PHOTOMETRY_RAW.xml”.

This file-naming procedure reproduces the behavior of the web interface of the Gaia ESA Archive (which does not allow to specify the names of the files to be downloaded). That said, it is possible to retrieve:

One product for one source, or
One product for many sources, or
All the products available for one source, or
All the products available for many sources.

From the web interface, the data is always downloaded as a single compressed (.zip) file that is named as: <user_name><job_id>. or <job_id>. (depending on the request type: registered or anonymous).
To the best of my knowledge, it is not possible to retrieve the <job_id> value to associated to the request launched by the load_data() method, and hence it is not possible to add this attribute to the name of the compressed file retrieved via Astroquery.Gaia. We therefore decided to assign a fixed value to the name of the (compressed) file that stores the data: “datalink_output.zip”. We considered the following reasons to support this idea:

It mimics the behavior of the Gaia ESA Archive web GUI: the name of the retrieved file is set by the Archive.
It avoids problems if users specify a name with or without adding the “.zip” suffix (it is possible to update the method to check this issue and correct the name as necessary, but that requires extra work), and
It is very easy to modify the (fixed) file name in a Python script by e.g., appending a timestamp suffix.

We could consider the possibility of adding a separated method to download the ancillary files, but there are few reasons that make me reluctant to implement it (although I am not totally against this):

The output of the load_data method is not a TAP job, but a Python dictionary whose elements (keys) contain ancillary products retrieved via DataLink.
In DR4 the Gaia Archive will serve images in FITS format. Our tests show that these products can be written in a .fits file with the proposed implementation. However, retrieving them without using the “dump_to_file = True” option results in an incomplete download, as only the tabular HDUs from the fits file are retrieved, and the HDU that contains the image part is missed.

All in all, we considered that the proposed change would benefit the users by simplifying the way that the DataLink products can be stored using Astroquery.Gaia). But if you think that this change can be improved (by e.g. automatically adding a timestamp to the currently fixed file name), please let us know.

Kind regards,

==================================
Dr. Héctor Cánovas Cabrera (ORCID)

Telespazio for ESA - European Space Agency
SCO-09 – Support Archive Scientist for the Gaia mission
European Space Astronomy Centre (ESAC)
Camino Bajo del Castillo s/n
28692 Villanueva de la Cañada (Madrid), Spain

andamian · 2024-07-02T19:33:43Z

Thank you for your great description @hectorcanovas. Yes, trying to emulate the Web Interface functionality is a great goal as it will make it easier for user to use both mechanisms to access their data. I'm not familiar with the Gaia Archive, but in general, the Web interface is used for exploration while the library is used for processing.
Even so, the Web browser can prompt for the location where the file to be dowloaded or, in case it goes into the Downloads directory, it adds a number suffix each time the file with the same name is downloaded so versions are not overridden - maybe this is different in Gaia.
The astroquery library is used for processing in which multiple downloads are likely to be required and users will need a way to distinguish between these files.
Again, I'm not familiar with Gaia and please ignore if these general comments do not apply to your case. It is a design decision that people most familiar with the service/user base must take.

hectorcanovas · 2024-07-22T13:50:22Z

Dear all, following your feedback I have proposed to our team to add a timestamp suffix to the name of the file generated by the load_data method. With this change, the output filename would be:
"datalink_output_<time_stamp>.zip"
where <time_stamp> corresponds to the UTC time associated to the request received at the Gaia Archive DataLink server. The <time_stamp> format should follow the ISO 8601 standard: "yyyymmddThhmmss". Based on our internal records, a 1-second granularity should be enough to avoid duplicated filenames (but this format can be updated as needed). If you agree with this proposal we do our best to prepare a new, dedicated Pull Request right after the summer break.

… the format "%Y%m%dT%H%M%S"

cosmoJFH · 2024-09-30T11:04:42Z

Changes committed to the branch: now the output file name follows the patter "datalink_output_<time_stamp>.zip" where the time stamp is obtained as

    now = datetime.now(timezone.utc)
    output_file = 'datalink_output_' + now.strftime("%Y%m%dT%H%M%S") + '.zip'

bsipocz · 2024-09-30T20:59:41Z

Test failures are real, while the RTD docs build failure is not. Could you please rebase? That would take care of both the conflict and RTD.

codecov · 2024-10-01T04:19:38Z

Codecov Report

Attention: Patch coverage is 70.96774% with 9 lines in your changes missing coverage. Please review.

Project coverage is 67.49%. Comparing base (6959406) to head (58f8692).
Report is 158 commits behind head on main.

Files with missing lines	Patch %	Lines
astroquery/gaia/core.py	70.96%	9 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3014      +/-   ##
==========================================
+ Coverage   67.39%   67.49%   +0.10%     
==========================================
  Files         233      233              
  Lines       18405    18413       +8     
==========================================
+ Hits        12404    12428      +24     
+ Misses       6001     5985      -16

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

cosmoJFH · 2024-10-01T04:19:52Z

Test failures are real, while the RTD docs build failure is not. Could you please rebase? That would take care of both the conflict and RTD.

The tests and conflicts were fixed.

cosmoJFH · 2024-10-07T08:54:44Z

Hi @bsipocz, what pending tasks do we need to address on our end?

bsipocz · 2024-10-07T19:00:13Z

Hi @bsipocz, what pending tasks do we need to address on our end?

No, it waits on my todo to a final review, I'll try to get to it this week.

bsipocz

The only remaining point is to add the deprecation on the removed kwargs, so while specifying it will have no effect, it this PR won't break existing user code just raise a warning first.

I'll add this deprecation and then go ahead with the merge.

Thanks!

bsipocz · 2024-10-16T22:26:46Z

astroquery/gaia/core.py

            if not overwrite_output_file and os.path.exists(output_file):
                raise ValueError(f"{output_file} file already exists. Please use overwrite_output_file='True' to "
                                 f"overwrite output file.")

        path = os.path.dirname(output_file)

+        log.debug(f"Directory where the data will be saved: {path}")
+
+        if path != '':


This maybe cleaner as '' will be False here:

Suggested change

if path != '':

if path:

bsipocz · 2024-10-16T22:43:49Z

CHANGES.rst

@@ -415,6 +420,8 @@ gaia
  epoch photometry service to return all data associated to a given source.
  [#2376]

+- New retrieval types for datalink (Gaia DR4 release). [#3110]


This is at the wrong location, but I'm cleaning up the changelog for release time anyway, so it doesn't matter here.

bsipocz · 2024-10-16T22:56:05Z

Note to the other maintainers: I should have cleaned up the commit history here with a rebase, sometimes mistakes happen 🤷‍♀️

hectorcanovas · 2024-10-17T21:27:06Z

Dear @bsipocz,

Thank you very much for your help and feedback in implementing this PR. This update will support our ongoing preparations for Gaia DR4.

Regards,

Dr. Héctor Cánovas Cabrera (ORCID)

Telespazio for ESA - European Space Agency
SCO-09 – Support Archive Scientist for the Gaia mission

European Space Astronomy Centre (ESAC)
Camino Bajo del Castillo s/n
28692 Villanueva de la Cañada (Madrid), Spain

cosmoJFH · 2024-10-18T07:27:24Z

I would also raise the idea of maybe having a separate method for the download instead of the dump_to_file parameter? That is what we have for most of the other modules, though I am not against the idea of having parameters instead. (But then we can also bike-shed on whether the dump_to_file is the best phrase to use.)

Anyway, this API question points further than this PR, and for that I would like to also ping @keflavich and @andamian for their opinions.

@bsipocz, could you give us an example in other module for a method for the download?

keflavich · 2024-10-18T14:42:21Z

https://github.com/astropy/astroquery/blob/main/astroquery/alma/core.py#L896 is a download example from alma

Jorge Fernandez Hernandez added 15 commits May 28, 2024 14:56

GAIAMNGT-1700 change signature of the method Gaia.load_data

bd8f332

GAIAMNGT-1700 change signature of the method Gaia.load_data

a304b13

GAIAMNGT-1700 change signature of the method Gaia.load_data

54bd20e

GAIAMNGT-1700 change signature of the method Gaia.load_data

040b1dc

GAIAMNGT-1700 New tests

80275c2

GAIAMNGT-1700 New tests

53233b8

GAIAMNGT-1700 New tests

095483a

GAIAMNGT-1700 New tests

181e858

GAIAMNGT-1700 New tests

71ad0ae

GAIAMNGT-1700 New function build_general_output_filename

eddf96a

GAIAMNGT-1700 New function test

72617e6

GAIAMNGT-1700 Remove test

94b8099

GAIAMNGT-1700 Remove wrong line

536f39c

GAIAMNGT-1700 New test to check the exception

45b0461

GAIAMNGT-1700 New test to check the exception

0f8b0cf

Jorge Fernandez Hernandez added 3 commits May 29, 2024 18:03

GAIAMNGT-1700 Update PR number

6aa4f6b

GAIAMNGT-1700 Fix code style issues

6845bc8

GAIAMNGT-1700 Fix code style issues

12efad1

bsipocz added gaia utils.tap labels Jun 11, 2024

bsipocz added this to the v0.4.8 milestone Jun 11, 2024

bsipocz requested changes Jun 18, 2024

View reviewed changes

GAIAMNGT-1700 The function build_general_output_filename is removed.

ea2c414

Jorge Fernandez Hernandez added 2 commits September 30, 2024 12:57

GAIAMNGT-1700 The name of the output file contains the timestamp with…

154d5e7

… the format "%Y%m%dT%H%M%S"

GAIAMNGT-1700 Change format

510fe02

GAIAMNGT-1700 Update test

db87123

Jorge Fernandez Hernandez and others added 3 commits October 1, 2024 06:12

GAIAMNGT-1700 Fix test to make use of a mocked datetime

443f422

GAIAMNGT-1700 Resolve conflicts

220b885

Merge branch 'main' into ESA_gaia_GAIAMNGT-1700_load_data

75b7124

Jorge Fernandez Hernandez added 4 commits October 1, 2024 06:37

GAIAMNGT-1700 Make use of Path instead of os.path

1b8aaed

GAIAMNGT-1700 Updte information for the parameter dump_to_file

f96d621

GAIAMNGT-1700 Fix unexpected indentation

775e8e0

GAIAMNGT-1700 Fix unexpected indentation

e24b67c

bsipocz requested changes Oct 16, 2024

View reviewed changes

MAINT: adding deprecation for the removed argument

58f8692

bsipocz reviewed Oct 16, 2024

View reviewed changes

bsipocz approved these changes Oct 16, 2024

View reviewed changes

bsipocz merged commit aa35035 into astropy:main Oct 16, 2024
11 checks passed

		@@ -325,21 +323,31 @@ def load_data(self, ids, *, data_release=None, data_structure='INDIVIDUAL', retr

		return files

		def build_general_output_filename(self):

Uh oh!

Gaia: change the signature of the method load_data #3014

Gaia: change the signature of the method load_data #3014

Uh oh!

Conversation

cosmoJFH commented May 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pep8speaks commented May 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2024-10-16 22:42:55 UTC

Uh oh!

bsipocz left a comment

Choose a reason for hiding this comment

Uh oh!

bsipocz Jun 11, 2024

Choose a reason for hiding this comment

Uh oh!

cosmoJFH Jun 21, 2024

Choose a reason for hiding this comment

Uh oh!

bsipocz Jun 18, 2024

Choose a reason for hiding this comment

Uh oh!

bsipocz Oct 16, 2024

Choose a reason for hiding this comment

Uh oh!

bsipocz Jun 18, 2024

Choose a reason for hiding this comment

Uh oh!

bsipocz commented Jun 18, 2024

Uh oh!

andamian commented Jun 27, 2024

Uh oh!

keflavich commented Jun 27, 2024

Uh oh!

hectorcanovas commented Jul 1, 2024

Uh oh!

andamian commented Jul 2, 2024

Uh oh!

hectorcanovas commented Jul 22, 2024

Uh oh!

cosmoJFH commented Sep 30, 2024

Uh oh!

bsipocz commented Sep 30, 2024

Uh oh!

codecov bot commented Oct 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

cosmoJFH commented Oct 1, 2024

Uh oh!

cosmoJFH commented Oct 7, 2024

Uh oh!

bsipocz commented Oct 7, 2024

Uh oh!

bsipocz left a comment

Choose a reason for hiding this comment

Uh oh!

bsipocz Oct 16, 2024

Choose a reason for hiding this comment

Uh oh!

bsipocz Oct 16, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bsipocz commented Oct 16, 2024

Uh oh!

hectorcanovas commented Oct 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cosmoJFH commented Oct 18, 2024

Uh oh!

keflavich commented Oct 18, 2024

Uh oh!

Uh oh!

cosmoJFH commented May 29, 2024 •

edited

Loading

pep8speaks commented May 29, 2024 •

edited

Loading

codecov bot commented Oct 1, 2024 •

edited

Loading

hectorcanovas commented Oct 17, 2024 •

edited

Loading