You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Discovered while testing the HEAD of the omero-metadata including the new header detection feature introduced in #67 against an IDR high-content screening dataset.
The annotation CSV contains a combination of biomolecular annotations (Organism, compound name, identifiers) and analytical metadata (features). The feature columns are densely populated but some of the biomolecular annotations are sparse e.g. Compound Concentration (microMolar). This is expected since several rows correspond to control wells where there is no compound and this metadata is irrelevant.
The current version of the header detection code leads to issue in this case as these columns are detected as Double/Float and the table population subsequently fail unless --allow-nan is passed. With the current code, the workaround for completing the table population are:
either to disable the manual header detection with --manual_header
and/or to manually specify the behavior of the columns using the #header row
Ideally, it would be great to allow the plugin to "do the right thing" and handle these scenarios while retaining the automatic header detection to map the column the most appropriate type. This raises the question of whether there should be a single behavior or whether this would be another option down to the user.
In the IDR use case above, the expectation is that we want to preserve the sparsity rather than populating NaN values. Some downstream processes like the tables -> key/value conversion currently have logic that relies on the emptiness of the values in the table and I expect NaN might cause issues with the current implementation.
There are likely other use cases where the user would like empty values to be stroed as NaN. And it should be possible to update the transformation of tables into maps to handle NaN in the same way we handle empty strings.
Code-wise, it should be possible to make use pandas.read_csvkeep_default_na option to map such column as object/StringColumn rather than float:
(base) sbesson@Sebastiens-MacBook-Pro /tmp % cat test.csv
Column1,Column2,Column3
A,1,2
B,,3
C,2,5%
(base) sbesson@Sebastiens-MacBook-Pro /tmp % venv/bin/python
Python 3.8.11 (default, Jul 29 2021, 14:57:32)
[Clang 12.0.0 ] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas
>>> df=pandas.read_csv('test.csv')
>>> df
Column1 Column2 Column3
0 A 1.0 2
1 B NaN 3
2 C 2.0 5
>>> df.dtypes
Column1 object
Column2 float64
Column3 int64
dtype: object
>>> df=pandas.read_csv('test.csv',keep_default_na=False)
>>> df
Column1 Column2 Column3
0 A 1 2
1 B 3
2 C 2 5
>>> df.dtypes
Column1 object
Column2 object
Column3 int64
dtype: object
Possibly, this is something that could be coupled with the existing --allow-nan flag?
Briefly mentioned as part of today's group meeting. @will-moore mentioned that if a column is truly numeric, it's certainly lossy to turn it back into StringColumn.
This possibly raises the question of how NaN appear in the UI e.g. in the omero_table endpoint and/or the Tables menu.
Discovered while testing the HEAD of the
omero-metadata
including the new header detection feature introduced in #67 against an IDR high-content screening dataset.The annotation CSV contains a combination of biomolecular annotations (Organism, compound name, identifiers) and analytical metadata (features). The feature columns are densely populated but some of the biomolecular annotations are sparse e.g.
Compound Concentration (microMolar)
. This is expected since several rows correspond to control wells where there is no compound and this metadata is irrelevant.The current version of the header detection code leads to issue in this case as these columns are detected as
Double/Float
and the table population subsequently fail unless--allow-nan
is passed. With the current code, the workaround for completing the table population are:--manual_header
#header
rowIdeally, it would be great to allow the plugin to "do the right thing" and handle these scenarios while retaining the automatic header detection to map the column the most appropriate type. This raises the question of whether there should be a single behavior or whether this would be another option down to the user.
In the IDR use case above, the expectation is that we want to preserve the sparsity rather than populating NaN values. Some downstream processes like the tables -> key/value conversion currently have logic that relies on the emptiness of the values in the table and I expect NaN might cause issues with the current implementation.
There are likely other use cases where the user would like empty values to be stroed as
NaN
. And it should be possible to update the transformation of tables into maps to handleNaN
in the same way we handle empty strings.Code-wise, it should be possible to make use pandas.read_csv
keep_default_na
option to map such column asobject/StringColumn
rather thanfloat
:Possibly, this is something that could be coupled with the existing
--allow-nan
flag?cc @muhanadz @pwalczysko @will-moore
The text was updated successfully, but these errors were encountered: