Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Header detection: default behavior and handling sparse columns #76

Closed
sbesson opened this issue May 31, 2022 · 2 comments · Fixed by #77
Closed

Header detection: default behavior and handling sparse columns #76

sbesson opened this issue May 31, 2022 · 2 comments · Fixed by #77
Labels
bug Something isn't working

Comments

@sbesson
Copy link
Member

sbesson commented May 31, 2022

Discovered while testing the HEAD of the omero-metadata including the new header detection feature introduced in #67 against an IDR high-content screening dataset.

The annotation CSV contains a combination of biomolecular annotations (Organism, compound name, identifiers) and analytical metadata (features). The feature columns are densely populated but some of the biomolecular annotations are sparse e.g. Compound Concentration (microMolar). This is expected since several rows correspond to control wells where there is no compound and this metadata is irrelevant.

The current version of the header detection code leads to issue in this case as these columns are detected as Double/Float and the table population subsequently fail unless --allow-nan is passed. With the current code, the workaround for completing the table population are:

  • either to disable the manual header detection with --manual_header
  • and/or to manually specify the behavior of the columns using the #header row

Ideally, it would be great to allow the plugin to "do the right thing" and handle these scenarios while retaining the automatic header detection to map the column the most appropriate type. This raises the question of whether there should be a single behavior or whether this would be another option down to the user.

In the IDR use case above, the expectation is that we want to preserve the sparsity rather than populating NaN values. Some downstream processes like the tables -> key/value conversion currently have logic that relies on the emptiness of the values in the table and I expect NaN might cause issues with the current implementation.

There are likely other use cases where the user would like empty values to be stroed as NaN. And it should be possible to update the transformation of tables into maps to handle NaN in the same way we handle empty strings.

Code-wise, it should be possible to make use pandas.read_csv keep_default_na option to map such column as object/StringColumn rather than float:

(base) sbesson@Sebastiens-MacBook-Pro /tmp % cat test.csv 
Column1,Column2,Column3
A,1,2
B,,3
C,2,5%                                                                          
(base) sbesson@Sebastiens-MacBook-Pro /tmp % venv/bin/python
Python 3.8.11 (default, Jul 29 2021, 14:57:32) 
[Clang 12.0.0 ] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas
>>> df=pandas.read_csv('test.csv')
>>> df
  Column1  Column2  Column3
0       A      1.0        2
1       B      NaN        3
2       C      2.0        5
>>> df.dtypes
Column1     object
Column2    float64
Column3      int64
dtype: object
>>> df=pandas.read_csv('test.csv',keep_default_na=False)
>>> df
  Column1 Column2  Column3
0       A       1        2
1       B                3
2       C       2        5
>>> df.dtypes
Column1    object
Column2    object
Column3     int64
dtype: object

Possibly, this is something that could be coupled with the existing --allow-nan flag?

cc @muhanadz @pwalczysko @will-moore

@sbesson
Copy link
Member Author

sbesson commented May 31, 2022

Briefly mentioned as part of today's group meeting. @will-moore mentioned that if a column is truly numeric, it's certainly lossy to turn it back into StringColumn.
This possibly raises the question of how NaN appear in the UI e.g. in the omero_table endpoint and/or the Tables menu.

@sbesson
Copy link
Member Author

sbesson commented May 31, 2022

Also for background reading, see https://pandas.pydata.org/pandas-docs/dev/user_guide/gotchas.html#nan-integer-na-values-and-na-type-promotions re pandas decision to use NaN as the representation of missing data

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant