Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Particle IDs of Gadget HDF5-style data are converted from integers to floats #4995

Open
arkordt opened this issue Sep 19, 2024 · 8 comments
Open

Comments

@arkordt
Copy link

arkordt commented Sep 19, 2024

Bug report

Bug summary

An arepo snapshot whose on-disk ParticleIDs data type is uint32 or uint64 is converted to float32 or float64, respectively.

The conversion can be traced back to two locations. The first conversion is in yt/utilities/io_handler.py:209, the second conversion occurs in unyt/array.py:712 that is called from yt/data_objects/selection_objects/data_selection_objects.py:217. If both conversions are commented out, the expected output is produced.

I also submitted an issue to unyt because I am not sure where it would be better to avoid the type conversion.

Code for reproduction

yt-test-data should be the path where the YT test data is stored.

>>> import yt
>>> ds = yt.load('yt-test-data/ArepoBullet/snapshot_150.hdf5')
>>> ds.all_data()['PartType0', 'ParticleIDs']

Actual outcome

unyt_array([1710352., 1710346., 1710350., ..., 5721056., 5843308., 5843259.],
      dtype=float32, units='(dimensionless)')

Expected outcome

unyt_array([1710352, 1710346, 1710350, ..., 5721056, 5843308, 5843259],
      dtype=uint32, units='(dimensionless)')

Version Information

  • Operating System: macOS
  • Python Version: 3.12.6
  • yt version: 4.3.1 or 4.4.dev0 (current main branch)
  • Other Libraries (if applicable): unyt version: 3.0.2

yt was installed via conda (version 4.3.1) or from source (main branch) for testing and bug tracing purposes.

Copy link

welcome bot commented Sep 19, 2024

Hi, and welcome to yt! Thanks for opening your first issue. We have an issue template that helps us to gather relevant information to help diagnosing and fixing the issue.

@chrishavlin
Copy link
Contributor

I commented over in the unyt issue, but I do think that this is more of a yt issue than a unyt issue. IMO it'd make sense to handle integer, dimensionless quantities (like particleIDs) with a special case that avoids the type and unit conversion that is applied to physical fields.

@neutrinoceros
Copy link
Member

The issue goes much deeper than this: many Cython routines in yt expect float64 as the dtype. Changing that would be a massive undertaking I think (with the possible exception of adding support for other float types, first and foremost float32). In my honest opinion this isn't a bug unless you go above the larger int that can be represented exactly as a float64 (which I think is 9_007_199_254_740_992). Is this actually creating an issue at runtime or is it only surprising ?

@chrishavlin
Copy link
Contributor

I don't think it'd be too bad to preserve types just for read/selection operations though. Was just playing around with a possible implementation and it seems to work (can push up a draft in a bit) -- any operations beyond read/selection operations have implicit or explicit type conversions on the call to the cython routines.

@arkordt
Copy link
Author

arkordt commented Sep 20, 2024

I had a look into the draft and at least from a user perspective, this seems like what I was expecting. Maybe I can add a bit more context how/why I encountered this issue and answering

... Is this actually creating an issue at runtime or is it only surprising ?

My simulation contains tracer particles and I was thinking about how to writing routines for/extending yt to work with them. The tracer particles are a separate particle type in the hdf5 file. The group has two datasets ParentID and TracerID. To get the number of tracers within a cell, I would need to count how many multiples of the same ID are in ParentID and then join these information to the gas cells (actually arepo writes NumTracers, an uint32 field, to the gas cells but initially, I somehow missed that). Retrieving the tracer coordinates, one would need to join the gas cell coordinates with the tracers with ParticleIDs==ParentID -- speaking with relational database operations. Currently, tracer particles of this storage type are ignored but this was a rather easy to change.

In this context, the matching operations on floats that represent integers would probably work because the number of digits in the largest particle id is less than the accuracy limit of floats. Additionally, there are no other mathematical operations on the ids, so rounding should not occur and an equality test of two float ids probably works, too. However, this is not guaranteed.

@neutrinoceros
Copy link
Member

there are no other mathematical operations on the ids, so rounding should not occur and an equality test of two float ids probably works, too. However, this is not guaranteed.

Indeed it's not.

any operations beyond read/selection operations have implicit or explicit type conversions on the call to the cython routines.

I missed that. Then I agree it makes sense to at least explore a simple patch such as #4996. Let us know when it's ready for review !

@chrishavlin
Copy link
Contributor

added a bunch of tests and I'm not seeing any issues over in #4996 , just marked it as ready to review

@matthewturk
Copy link
Member

I'd like to add a thanks to @arkordt for such a detailed dig in to this issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants