Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistency in crossmatch column data types #273

Closed
3 tasks done
camposandro opened this issue Apr 11, 2024 · 4 comments
Closed
3 tasks done

Inconsistency in crossmatch column data types #273

camposandro opened this issue Apr 11, 2024 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@camposandro
Copy link
Collaborator

Bug report

If we decide to keep the non-matches it's possible to get NaN values in our crossmatch dataframe. For every point in the left partitions we will have a row with the left point information and the information of the respective match on the right (which being inexistent will be set to NaN).

When assigning a row with NaN values on a dataframe, Pandas seems to automatically cast the whole column type to "float". Columns such as Norder_{}_xmatch, Dir_{}_xmatch and Npix_{}_xmatch, therefore have an incorrect type.

Screenshot 2024-04-11 at 10 53 55 AM

We should create an end-to-end test to verify that the column data types of the original catalogs remain unchanged.

Before submitting
Please check the following:

  • I have described the situation in which the bug arose, including what code was executed, information about my environment, and any applicable data others will need to reproduce the problem.
  • I have included available evidence of the unexpected behavior (including error messages, screenshots, and/or plots) as well as a descriprion of what I expected instead.
  • If I have a solution in mind, I have provided an explanation and/or pseudocode and/or task list.
@delucchi-cmu
Copy link
Contributor

Has this been addressed (or a little bit improved) by the pyarrow dtype changes?

@camposandro
Copy link
Collaborator Author

@delucchi-cmu yes, supporting None values by default using pyarrow should fix the column types. We're holding off on the merge of #271 this week but I might try to build some end-to-end tests in the meantime to make sure the output columns of the crossmatch indeed remain the same!

@delucchi-cmu
Copy link
Contributor

This has been addressed by recent changes to using pyarrow types, and holding on to the pyarrow schema throughout operations.

@camposandro
Copy link
Collaborator Author

We should make sure the Dask DataFrame meta and the pyarrow schema are consistent whenever we address #390.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

No branches or pull requests

2 participants