Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add --detect_header and DatasetColumn support #67

Merged
merged 12 commits into from
Apr 5, 2022

Conversation

muhanadz
Copy link
Member

@muhanadz muhanadz commented Feb 3, 2022

Changes

  • This PR adds support to DatasetColumn, specifically support to a Dataset ID column

  • This PR adds an option CLI --detect_header parameter to omero metatdata populate that automatically detects column_type. This aims to reduce the workload and remove the need for users to predefine the header types through automation.

Dataset ID column was necessary so that the --detect_header is easier and better implemented.

Tests passed locally with no failures.

Edit: Changed the default behavior to use the detect header method unless the user supplies a CSV with a custom header with # header or the new --manual_header flag.

To do:

  • Update README.md to reflect proper use of --detect_header
  • Fix other contexts when using --detect_header

Copy link
Member

@sbesson sbesson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions around new dependencies, client/server requirements as well as the scope of the header detection utility.

I have not carried out any functional testing but as the metadata population is part of the IDR workflow, I suspect we should be able to test it in the context of upcoming submissions. This plugin is also heavily used in the context of our training/guides so this might be another context for testing and providing feedback.

setup.py Show resolved Hide resolved
src/omero_metadata/cli.py Show resolved Hide resolved
@@ -533,7 +582,7 @@ def populate(self, args):
cfg=args.cfg, cfgid=cfgid, attach=args.attach,
options=localcfg, batch_size=args.batch,
loops=loops, ms=ms, dry_run=args.dry_run,
allow_nan=args.allow_nan)
allow_nan=args.allow_nan, column_types=header_type)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly to #59, adding a new key-value argument will likely break other contexts e.g. BulkToMapAnnotationContext.
#60 is one way to fix this issue but as indicated in the PR, it would be good to find a more scalable strategy

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally missed this, I was under the impression that column_types is already a present key-value argument, here in the populate option, and I'm just now passing in a non-None value instead of the default None.

But currently, omero metadata populate Screen:51 -n --file example.csv --detect_header --context bulkmap does fail with a similar exception. I will discuss with @emilroz to see if we could brainstorm a more scalable solution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries. The absence of integration tests for the other contexts does not help detecting this use case.
As mentioned in #59 (comment), using **kwargs might be an option to handle undefined keywords arguments in a lenient manner.

Copy link
Member Author

@muhanadz muhanadz Mar 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 9b158f5
@sbesson Now that **kwargs is used for DeleteMapAnnotationContext and BulkToMapAnnotationContext, there are some keyword arguments that are not used and could be cleaned up. Do you think it's appropriate to clean up here or later in a different PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's okay with you I think cleaning up unused arguments in a follow-up PR would likely simplify the review process

@@ -825,7 +821,10 @@ def get_image_name_by_id(self, iid, did=None):

def resolve_dataset(self, column, row, value):
try:
return self.datasets_by_name[value].id.val
if column.name.lower() == 'dataset':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume these improvements build on top of the DatasetColumn implementation in ome/omero-py#309. Is the requirement for the new API only client-side or will a server need to be upgraded with a version of omero-py containing these changes to use this new functionality?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. During testing, I needed to upgrade my server omero-py with the new DatasetColumn implementation to use the new --detect_header functionality. Otherwise, I'd get an exception about DatasetColum not implemented when using a dataset ID column with 'dataset' as a header type .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you remember which call led to the server exception and the exception type thrown by the server? As a general rule, we don't want this plugin to not make particular assumptions. It should be possible to handle both scenarios and fall back to the former logic if DatasetColumn is not implemented server-side. Catching the appropriate exception with a try/except block is the immediate solution that comes to mind unless @joshmoore can think of an alternate way to introspect the server-side OMERO.py version.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other versions are inspectable from the config service, but we'd need to teach the Java process how to inspect the Python process to get that info.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. In terms of functional review, I think this means we should probably test this tool against two types of servers: with and without upgraded OMERO.py. The population should succeed in both cases but only the server with an upgraded OMERO.py should populate a DatasetColumn.
@muhanadz can you add some minimal handling to distinguish between upgraded and non-upgraded servers?

@will-moore
Copy link
Member

With csv:

image,dataset,ROI_Area,Channel_Index,Channel_Name
3659,1451,0.0469,1,DAPI
3686,1451,0.142,2,GFP-AuroraB
3746,1451,0.093,3,TRITC-edit
$ omero metadata populate --file /Users/wmoore/Desktop/METADATA/test-project-dataset-noheader.csv Project:1001 --detect_header
Created session for will@localhost:4064. Idle timeout: 10 min. Current group: Lab Data
INFO:omero_metadata.populate:Adding dataset:1451 image:3746
INFO:omero_metadata.populate:Adding dataset:1451 image:3686
INFO:omero_metadata.populate:Adding dataset:1451 image:3659
INFO:omero_metadata.populate:Created new table OriginalFile:75983

Created a table - viewed with ome/omero-web#355

Screenshot 2022-02-15 at 11 06 45

The column types are:

"ImageColumn",
"DatasetColumn",
"DoubleColumn",
"LongColumn",
"StringColumn",
"StringColumn"

👍

@will-moore
Copy link
Member

I noticed in the screenshot above that dataset column name isn't capitalised like image -> Image etc, and other column types as discussed at #56
Would be good to ensure 'Dataset' is the name for any 'DatasetColumn' to be consistent.

@muhanadz muhanadz force-pushed the detect_header_datasetColumn branch from 4fd6190 to 0f76448 Compare February 23, 2022 09:31
@@ -242,6 +244,9 @@ def _configure(self, parser):
populate.add_argument("--allow_nan", action="store_true", help=(
"Allow empty values to become Nan in Long or Double columns"))

populate.add_argument("--detect_header", action="store_true", help=(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Echoing my enthusiastic comment in https://github.com/ome/omero-metadata/pull/67/files#r816998882, assuming the functionality works as expected, is there any reason not to make this behavior the default and allow to disable the auto-detection via CLI instead?

In terms of the legacy behavior, we might also need to clarify the expectation for users when both the header auto-detection and the # headers row are present. Based on my reading of the code, the header auto-detection should set column_types and take precedence over the content of the CSV, is that correct?

@will-moore
Copy link
Member

Tested:

$ omero metadata populate Project:1851 --file no-header-imagename.csv --detect_header

With CSV from README, with no header:

Image Name,Dataset Name,ROI_Area,Channel_Index,Channel_Name
img-01.png,dataset01,0.0469,1,DAPI
img-02.png,dataset01,0.142,2,GFP

Got column types: "StringColumn", "StringColumn", "DoubleColumn", "LongColumn", "StringColumn", "ImageColumn"

With CSV from README, with no header, using --detect_header:

Well,Plate,Drug,Concentration,Cell_Count,Percent_Mitotic
A1,Index.idx.xml,DMSO,10.1,10,25.4
A2,Index.idx.xml,DMSO,0.1,1000,2.54

Got column types (with new cols Well Name & Plate Name)
"WellColumn","PlateColumn","StringColumn","DoubleColumn","LongColumn","DoubleColumn","StringColumn","StringColumn"

With CSV from the README, with no header, using --detect_header on target Image:

Roi,object,probability,area
230422,1,0.8,250
230423,1,0.9,500

Got new column "Roi Name" and column types:
"RoiColumn","LongColumn","DoubleColumn","LongColumn","StringColumn"

Looking good.
In all these cases, it seems that the default behaviour should be --detect_header = True so that users don't need to remember it every time.

@will-moore
Copy link
Member

Discussing the detect_headers feature with @dominikl
It would be handy to have this option available directly in the ParsingContext() class, which is our entry-point from various python scripts (e.g. https://github.com/IDR/idr0101-payne-insitugenomeseq/blob/560e389dc248976693ae3e647858fdb247c41a72/scripts/csv_to_points.py#L151)

I guess you could do this (not tested yet):

from omero_metadata.cli import detect_headers

ParsingContext(....column_types=detect_headers(file))

but a detect_headers option for ParsingContext() would be nicer?
Best of all would be if detect_headers=True by default.

@sbesson
Copy link
Member

sbesson commented Mar 24, 2022

but a detect_headers option for ParsingContext() would be nicer?

Or have some value e.g. ParsingContext(..., column_types='autodetect') or similar that would internally trigger the header detection logic?

Best of all would be if detect_headers=True by default.

Agreed, that echoes my https://github.com/ome/omero-metadata/pull/67/files#r817003444 and I am still for making this the default assuming the detection is robust (and there is a CLI way to disable it). I still think we need to establish the order of precedence if header detection is selected and the # header row are provided.

@sbesson
Copy link
Member

sbesson commented Mar 25, 2022

See also IDR/idr0062-blin-nuclearsegmentation#12 for an ongoing IDR annotation investigation using the new logic proposed in this PR as a library

@will-moore
Copy link
Member

@sbesson I don't see that IDR/idr0062-blin-nuclearsegmentation#12 is using this PR (yet)?

@sbesson
Copy link
Member

sbesson commented Mar 25, 2022

Not formally, but the logic introduced currently as detect_headers is very similar to https://github.com/IDR/idr0062-blin-nuclearsegmentation/pull/12/files#diff-4e4de1329ccfc37299b279819006a581e50b7e8a449f2da8c3c009d17063839cR45-R56 so I assume that if this API was migrated to omero_metadata.library or even handled internally as discussed above, IDR/idr0062-blin-nuclearsegmentation#12 could be simplified to consume the new API?

@will-moore
Copy link
Member

I think if a #header row is provided on the CSV, then it should take precedence over the auto-detection, since that gives you more control, and reduces the chance of a breaking change.

@sbesson
Copy link
Member

sbesson commented Mar 25, 2022

I think if a #header row is provided on the CSV, then it should take precedence over the auto-detection, since that gives you more control, and reduces the chance of a breaking change.

👍 I was never a fan of this custom format but I can appreciate that if someone had made the effort of storing the expected column types into a CSV, it would feel unexpected to ignore it by default (unless some option along the lines of --detect-headers force was passed)

@joshmoore
Copy link
Member

I was never a fan of this custom format

I was going to try to say something to this effect as well. CSV "specifications" are a nightmare, making this feel a bit like the "comments in JSON" conversation.

@muhanadz
Copy link
Member Author

muhanadz commented Mar 28, 2022

@sbesson @will-moore While working on the default behavior, an idea would be to change--detect_header to --manual_header which would disable the new automatic header detection. If # header is present in the CSV (and automatically detected), we can warn the user and auto-trigger --manual_header so that populate uses the# header row. Thoughts?

@will-moore
Copy link
Member

Sounds good!
So, the only time a user would need the --manual_header would be if the csv didn't have #header and they wanted e.g. number columns to be treated as strings, or if there was some bug in the auto-detect?
👍

@muhanadz
Copy link
Member Author

muhanadz commented Mar 28, 2022

Sounds good! So, the only time a user would need the --manual_header would be if the csv didn't have #header and they wanted e.g. number columns to be treated as strings, or if there was some bug in the auto-detect? 👍

@will-moore: What I had in mind was that --manual_header would disable the automatic header detection (currently enabled via --detect_header which will be enabled by default). The user would use --manual_header when they want to explicitly define the header types via # header and pass in a CSV file with # header as the first row. Or as you mentioned, the auto-detect has some bug or is misbehaving.

Of course, this would come with some automation, for example, if a user passes in a # header CSV without using the --manual_header flag, the script would automatically switch to the previous behavior and use # header.

…method. User can now either pass '--manual_header' or a csv with '# header' header to bypass the auto-detect header method.
@will-moore
Copy link
Member

Looking to use detect_headers() from IDR/idr0062-blin-nuclearsegmentation#12
I didn't want to create an instance of MetadataControl class, so it would be helpful to make it a @staticmethod:

-    def detect_headers(self, csv_path):
+    @staticmethod
+    def detect_headers(csv_path):

Then, in that script you can do:

from omero_metadata.cli import MetadataControl

column_types = MetadataControl.detect_headers(file_path)
ctx = ParsingContext(..... column_types)

Copy link
Member

@will-moore will-moore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested again as above without needing any extra arguments this time, and default behaviour created columns of the correct types:

$ omero metadata populate Project:801 --file no-header-imagename.csv

Looks good to me.
So - "good to merge" with the @staticmethod tweak above

Copy link
Member

@sbesson sbesson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some functional testing the latest version of this PR using a simple P/D/I structure with one dataset and 2 images and a variation of CSV files with/without the header rows and with/without

(venv3) [sbesson@pilot-idr0124-omeroreadwrite ~]$ cat headers_datasetid.csv 
# header dataset,s,s,s
Dataset ID,Image Name,string,float
7459,test.fake,foo,0.1
7459,test2.fake,bar,0.2
(venv3) [sbesson@pilot-idr0124-omeroreadwrite ~]$ cat headers_datasetname.csv 
# header s,s,s,s
Dataset Name,Image Name,string,float
test,test.fake,foo,0.1
test,test2.fake,bar,0.2
(venv3) [sbesson@pilot-idr0124-omeroreadwrite ~]$ cat noheaders_datasetid.csv 
dataset id,image name,string,float
7459,test.fake,foo,0.1
7459,test2.fake,bar,0.2
(venv3) [sbesson@pilot-idr0124-omeroreadwrite ~]$ cat noheaders_datasetname.csv 
Dataset Name,Image Name,string,float
test,test.fake,foo,0.1
test,test2.fake,bar,0.2

The behavior of the various implementations (with/without --manual_header) matches my expectation from the last commit.

I also briefly tested the scenario discussed in https://github.com/ome/omero-metadata/pull/67/files#r799518087. Trying to create a DatasetColumn on a server without the minimal OMERO.py version results in an omero::InternalException with AttributeError: 'DatasetColumn' object has no attribute 'descriptor'. However, the same error can be thrown without this PR when using # header dataset,... so it's better to capture this ass a separate issue to handle this scenario more gracefully.

I spent a bit of time reviewing the different strategies used for choosing the column types. As of the current state of this PR:

  • a # header row can be stored as the first row of the CSV using the keys defined in . HeaderResolver.get_column_types contains most of the logic for mapping these keys into OMERO.tables columns
  • a column_types argument can be passed manually to ParsingContext using the same keys as above - see also Support --columns in CLI #6 for a RFE to expose this
  • prior to this PR, in the absence of a specified column_types and header row, the HeaderResolver logic maps column types using column names first and defaulting to StringColumn -
    try:
    keys = getattr(self, "%s_keys" % klass)
    log.debug("Adding keys %r" % keys)
    if keys[header_as_lower] is StringColumn:
    column = keys[header_as_lower](
    name, description,
    self.DEFAULT_COLUMN_SIZE, list())
    else:
    column = keys[header_as_lower](
    name, description, list())
    except KeyError:
    log.debug("Adding string column %r" % name)
    column = StringColumn(
    name, description, self.DEFAULT_COLUMN_SIZE, list())
    contains the relevant logic
  • this PR introduces a new logic to detect column types based on column names first and then defaulting to the column type detecting by the pandas library. This is currently stored as a column_types argument and passed to the ParsingContext as above

The last two strategies are very similar in the intent but there are a few key divergences:

  • the HeaderResolver detection logic is target dependent i.e. the column names are specific to the target class and defined in the <class>_keys variables of the HeaderResolver instance. On the other side, the new detection logic is agnostic to the target class i.e. a well column will always be mapped as a WellColumn even if the population is run on a P/D/I hierarchy
  • there is not a perfect overlap between former column names and new column names. Also keys like "plate name" are mapped into PlateColumn while e.g. image_name is mapped into a StringColumn in the current logic

In summary, the new logic is functional and matches the description of the PR. The support for DatasetColumn support and pandas based type detection are very valuable additions.

My outstanding concern is about maintaining and supporting two separate code paths for the detection of column types using different sets of rules. Several open issues and conversations already point at the unclarity of the behavior of the current OMERO.tables population logic depending on the target type - see #66 (comment).

Before merging this my preferred option would be to collect and review the different behaviors and unify the name-based column types detection changes. From then we can decide whether we still want to keep the default StringColumn mapping or use exclusively the new pandas based detection.

@muhanadz are there some new name-based mappings that are essential to introduce for your workflows?

Happy to schedule a call to review the different options to move forward with this if it makes things easier /cc @pwalczysko

@muhanadz
Copy link
Member Author

muhanadz commented Apr 4, 2022

@sbesson Thank you for the in-depth review and analysis! Some of the issues you raised were definitely a concern of mine. I think a call to review the options would really help.

The workflow that pushed for the old --detect_header implementation was one that couldn't provide # header in the csv and the default StringColumn wasn't enough. In other words, a user would like to have set headers (eg dataset id) which would automatically be detected as the relevant column type (eg a dataset column in this case) without the # header row.

@will-moore
Copy link
Member

Re Seb Also keys like "plate name" are mapped into PlateColumn while e.g. image_name is mapped into a StringColumn in the current logic...
Confusingly, you don't create a csv with plate_name but rather with Plate column which is changed from string to PlateColumn and a new Plate Name column gets added.
This is in contrast to the more logical behaviour of Dataset Name column which stays as a StringColumn although a Dataset column doesn't get added to the table.

@sbesson
Copy link
Member

sbesson commented Apr 4, 2022

Confusingly, you don't create a csv with plate_name but rather with Plate column which is changed from string to PlateColumn and a new Plate Name column gets added.

Are you talking about the current behavior of the released omero-metadata plugin or the new behavior of this PR?

This is in contrast to the more logical behaviour of Dataset Name column which stays as a StringColumn although a Dataset column doesn't get added to the table.

Same as above, it would be great to capture the different scenarios for the added column and their types so that we can identify the places where we ca unify the behavior across OMERO objects.

@dominikl
Copy link
Member

dominikl commented Apr 4, 2022

Sorry, for IDR/idr0062-blin-nuclearsegmentation#12 adding a few lines myself was just simpler than trying to build an omero-metadata with this PR and then using it, and also keeping it in line with this PR afterwards, etc.

@sbesson
Copy link
Member

sbesson commented Apr 4, 2022

I tried to capture my understanding of the different scenarios for the released version of the plugin in https://hackmd.io/EahBQovyR_OyY2PgxewlIg. Feel free to review/amend ahead of the discussion

@sbesson
Copy link
Member

sbesson commented Apr 5, 2022

Summarizing the conversation with @will-moore and @muhanadz. There was an extensive review of the current understanding of the support (https://hackmd.io/EahBQovyR_OyY2PgxewlIg#fn7) as well as the different locations in the code (header resolution, value resolution) which control how the OMERO.tables is populated and the various edge cases/limitations.

Summarizing the discussion:

  • overall, there was a global agreement on keeping the detection target agnostic as proposed in this PR i.e. a well column should always map into a WellColumn moving forward. Also we want to support the suffixed <object> id and <object> name columns (with all versions of space, no space, underscore) for all objects
  • assuming the new auto-detection rules break some use cases, the manual specification via a # header row should allow to manual input the expected behavior
  • the new behavior will bring a few usability issues that will need to be collected and triaged

I will merge this PR and let @muhanadz open follow-up PRs with unit tests as well as documentation updates capturing and describing the new plugin behavior. We can then review this behavior in a wider context e.g. including in various scenarios like IDR or the training sessions for the open-source team, have a general sign-off on the new strategy, deprecate/remove the former code, prioritize the outstanding issues and then move towards a release of the plugin (0.11.0 or maybe simply 1.0.0??)

@sbesson sbesson merged commit 07a33f3 into ome:master Apr 5, 2022
@muhanadz muhanadz deleted the detect_header_datasetColumn branch April 7, 2022 10:54
This was referenced Apr 7, 2022
@sbesson sbesson added this to the 0.11.0 milestone Jun 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants