-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add --detect_header and DatasetColumn support #67
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few questions around new dependencies, client/server requirements as well as the scope of the header detection utility.
I have not carried out any functional testing but as the metadata population is part of the IDR workflow, I suspect we should be able to test it in the context of upcoming submissions. This plugin is also heavily used in the context of our training/guides so this might be another context for testing and providing feedback.
@@ -533,7 +582,7 @@ def populate(self, args): | |||
cfg=args.cfg, cfgid=cfgid, attach=args.attach, | |||
options=localcfg, batch_size=args.batch, | |||
loops=loops, ms=ms, dry_run=args.dry_run, | |||
allow_nan=args.allow_nan) | |||
allow_nan=args.allow_nan, column_types=header_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I totally missed this, I was under the impression that column_types is already a present key-value argument, here in the populate
option, and I'm just now passing in a non-None value instead of the default None.
But currently, omero metadata populate Screen:51 -n --file example.csv --detect_header --context bulkmap
does fail with a similar exception. I will discuss with @emilroz to see if we could brainstorm a more scalable solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries. The absence of integration tests for the other contexts does not help detecting this use case.
As mentioned in #59 (comment), using **kwargs
might be an option to handle undefined keywords arguments in a lenient manner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that's okay with you I think cleaning up unused arguments in a follow-up PR would likely simplify the review process
@@ -825,7 +821,10 @@ def get_image_name_by_id(self, iid, did=None): | |||
|
|||
def resolve_dataset(self, column, row, value): | |||
try: | |||
return self.datasets_by_name[value].id.val | |||
if column.name.lower() == 'dataset': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume these improvements build on top of the DatasetColumn
implementation in ome/omero-py#309. Is the requirement for the new API only client-side or will a server need to be upgraded with a version of omero-py
containing these changes to use this new functionality?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. During testing, I needed to upgrade my server omero-py with the new DatasetColumn implementation to use the new --detect_header
functionality. Otherwise, I'd get an exception about DatasetColum not implemented when using a dataset ID column with 'dataset'
as a header type .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you remember which call led to the server exception and the exception type thrown by the server? As a general rule, we don't want this plugin to not make particular assumptions. It should be possible to handle both scenarios and fall back to the former logic if DatasetColumn
is not implemented server-side. Catching the appropriate exception with a try/except
block is the immediate solution that comes to mind unless @joshmoore can think of an alternate way to introspect the server-side OMERO.py version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other versions are inspectable from the config service, but we'd need to teach the Java process how to inspect the Python process to get that info.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. In terms of functional review, I think this means we should probably test this tool against two types of servers: with and without upgraded OMERO.py. The population should succeed in both cases but only the server with an upgraded OMERO.py should populate a DatasetColumn
.
@muhanadz can you add some minimal handling to distinguish between upgraded and non-upgraded servers?
With csv:
Created a table - viewed with ome/omero-web#355 The column types are:
👍 |
I noticed in the screenshot above that |
4fd6190
to
0f76448
Compare
…e/image_name columns and only dataset_id/image_id
… Ensured DatasetColumn is named 'Dataset'
src/omero_metadata/cli.py
Outdated
@@ -242,6 +244,9 @@ def _configure(self, parser): | |||
populate.add_argument("--allow_nan", action="store_true", help=( | |||
"Allow empty values to become Nan in Long or Double columns")) | |||
|
|||
populate.add_argument("--detect_header", action="store_true", help=( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Echoing my enthusiastic comment in https://github.com/ome/omero-metadata/pull/67/files#r816998882, assuming the functionality works as expected, is there any reason not to make this behavior the default and allow to disable the auto-detection via CLI instead?
In terms of the legacy behavior, we might also need to clarify the expectation for users when both the header auto-detection and the # headers
row are present. Based on my reading of the code, the header auto-detection should set column_types
and take precedence over the content of the CSV, is that correct?
Tested:
With CSV from README, with no header:
Got column types: With CSV from README, with no header, using
Got column types (with new cols Well Name & Plate Name) With CSV from the README, with no header, using
Got new column "Roi Name" and column types: Looking good. |
Discussing the I guess you could do this (not tested yet):
but a |
Or have some value e.g.
Agreed, that echoes my https://github.com/ome/omero-metadata/pull/67/files#r817003444 and I am still for making this the default assuming the detection is robust (and there is a CLI way to disable it). I still think we need to establish the order of precedence if header detection is selected and the |
See also IDR/idr0062-blin-nuclearsegmentation#12 for an ongoing IDR annotation investigation using the new logic proposed in this PR as a library |
@sbesson I don't see that IDR/idr0062-blin-nuclearsegmentation#12 is using this PR (yet)? |
Not formally, but the logic introduced currently as |
I think if a |
👍 I was never a fan of this custom format but I can appreciate that if someone had made the effort of storing the expected column types into a CSV, it would feel unexpected to ignore it by default (unless some option along the lines of |
I was going to try to say something to this effect as well. CSV "specifications" are a nightmare, making this feel a bit like the "comments in JSON" conversation. |
@sbesson @will-moore While working on the default behavior, an idea would be to change |
Sounds good! |
@will-moore: What I had in mind was that Of course, this would come with some automation, for example, if a user passes in a |
…method. User can now either pass '--manual_header' or a csv with '# header' header to bypass the auto-detect header method.
…lemented later and caused a bug
Looking to use
Then, in that script you can do:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested again as above without needing any extra arguments this time, and default behaviour created columns of the correct types:
$ omero metadata populate Project:801 --file no-header-imagename.csv
Looks good to me.
So - "good to merge" with the @staticmethod
tweak above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some functional testing the latest version of this PR using a simple P/D/I structure with one dataset and 2 images and a variation of CSV files with/without the header rows and with/without
(venv3) [sbesson@pilot-idr0124-omeroreadwrite ~]$ cat headers_datasetid.csv
# header dataset,s,s,s
Dataset ID,Image Name,string,float
7459,test.fake,foo,0.1
7459,test2.fake,bar,0.2
(venv3) [sbesson@pilot-idr0124-omeroreadwrite ~]$ cat headers_datasetname.csv
# header s,s,s,s
Dataset Name,Image Name,string,float
test,test.fake,foo,0.1
test,test2.fake,bar,0.2
(venv3) [sbesson@pilot-idr0124-omeroreadwrite ~]$ cat noheaders_datasetid.csv
dataset id,image name,string,float
7459,test.fake,foo,0.1
7459,test2.fake,bar,0.2
(venv3) [sbesson@pilot-idr0124-omeroreadwrite ~]$ cat noheaders_datasetname.csv
Dataset Name,Image Name,string,float
test,test.fake,foo,0.1
test,test2.fake,bar,0.2
The behavior of the various implementations (with/without --manual_header
) matches my expectation from the last commit.
I also briefly tested the scenario discussed in https://github.com/ome/omero-metadata/pull/67/files#r799518087. Trying to create a DatasetColumn
on a server without the minimal OMERO.py version results in an omero::InternalException
with AttributeError: 'DatasetColumn' object has no attribute 'descriptor'
. However, the same error can be thrown without this PR when using # header dataset,...
so it's better to capture this ass a separate issue to handle this scenario more gracefully.
I spent a bit of time reviewing the different strategies used for choosing the column types. As of the current state of this PR:
- a
# header
row can be stored as the first row of the CSV using the keys defined inomero-metadata/src/omero_metadata/populate.py
Line 120 in 2c1b269
COLUMN_TYPES = { HeaderResolver.get_column_types
contains most of the logic for mapping these keys into OMERO.tables columns - a
column_types
argument can be passed manually toParsingContext
using the same keys as above - see also Support --columns in CLI #6 for a RFE to expose this - prior to this PR, in the absence of a specified
column_types
and header row, theHeaderResolver
logic maps column types using column names first and defaulting to StringColumn -omero-metadata/src/omero_metadata/populate.py
Lines 280 to 293 in 99383af
try: keys = getattr(self, "%s_keys" % klass) log.debug("Adding keys %r" % keys) if keys[header_as_lower] is StringColumn: column = keys[header_as_lower]( name, description, self.DEFAULT_COLUMN_SIZE, list()) else: column = keys[header_as_lower]( name, description, list()) except KeyError: log.debug("Adding string column %r" % name) column = StringColumn( name, description, self.DEFAULT_COLUMN_SIZE, list()) - this PR introduces a new logic to detect column types based on column names first and then defaulting to the column type detecting by the
pandas
library. This is currently stored as acolumn_types
argument and passed to theParsingContext
as above
The last two strategies are very similar in the intent but there are a few key divergences:
- the
HeaderResolver
detection logic is target dependent i.e. the column names are specific to the target class and defined in the<class>_keys
variables of theHeaderResolver
instance. On the other side, the new detection logic is agnostic to the target class i.e. awell
column will always be mapped as aWellColumn
even if the population is run on a P/D/I hierarchy - there is not a perfect overlap between former column names and new column names. Also keys like "plate name" are mapped into
PlateColumn
while e.g.image_name
is mapped into aStringColumn
in the current logic
In summary, the new logic is functional and matches the description of the PR. The support for DatasetColumn support and pandas based type detection are very valuable additions.
My outstanding concern is about maintaining and supporting two separate code paths for the detection of column types using different sets of rules. Several open issues and conversations already point at the unclarity of the behavior of the current OMERO.tables population logic depending on the target type - see #66 (comment).
Before merging this my preferred option would be to collect and review the different behaviors and unify the name-based column types detection changes. From then we can decide whether we still want to keep the default StringColumn
mapping or use exclusively the new pandas
based detection.
@muhanadz are there some new name-based mappings that are essential to introduce for your workflows?
Happy to schedule a call to review the different options to move forward with this if it makes things easier /cc @pwalczysko
@sbesson Thank you for the in-depth review and analysis! Some of the issues you raised were definitely a concern of mine. I think a call to review the options would really help. The workflow that pushed for the old |
Re Seb |
Are you talking about the current behavior of the released
Same as above, it would be great to capture the different scenarios for the added column and their types so that we can identify the places where we ca unify the behavior across OMERO objects. |
Sorry, for IDR/idr0062-blin-nuclearsegmentation#12 adding a few lines myself was just simpler than trying to build an omero-metadata with this PR and then using it, and also keeping it in line with this PR afterwards, etc. |
I tried to capture my understanding of the different scenarios for the released version of the plugin in https://hackmd.io/EahBQovyR_OyY2PgxewlIg. Feel free to review/amend ahead of the discussion |
Summarizing the conversation with @will-moore and @muhanadz. There was an extensive review of the current understanding of the support (https://hackmd.io/EahBQovyR_OyY2PgxewlIg#fn7) as well as the different locations in the code (header resolution, value resolution) which control how the OMERO.tables is populated and the various edge cases/limitations. Summarizing the discussion:
I will merge this PR and let @muhanadz open follow-up PRs with unit tests as well as documentation updates capturing and describing the new plugin behavior. We can then review this behavior in a wider context e.g. including in various scenarios like IDR or the training sessions for the open-source team, have a general sign-off on the new strategy, deprecate/remove the former code, prioritize the outstanding issues and then move towards a release of the plugin ( |
Changes
This PR adds support to
DatasetColumn
, specifically support to a Dataset ID columnThis PR adds an option CLI
--detect_header
parameter toomero metatdata populate
that automatically detectscolumn_type
. This aims to reduce the workload and remove the need for users to predefine the header types through automation.Dataset ID column was necessary so that the
--detect_header
is easier and better implemented.Tests passed locally with no failures.
Edit: Changed the default behavior to use the detect header method unless the user supplies a CSV with a custom header with
# header
or the new--manual_header
flag.To do:
--detect_header
--detect_header