Add --detect_header and DatasetColumn support #67

muhanadz · 2022-02-03T14:11:50Z

Changes

This PR adds support to DatasetColumn, specifically support to a Dataset ID column
This PR adds an option CLI --detect_header parameter to omero metatdata populate that automatically detects column_type. This aims to reduce the workload and remove the need for users to predefine the header types through automation.

Dataset ID column was necessary so that the --detect_header is easier and better implemented.

Tests passed locally with no failures.

Edit: Changed the default behavior to use the detect header method unless the user supplies a CSV with a custom header with # header or the new --manual_header flag.

To do:

Update README.md to reflect proper use of --detect_header
Fix other contexts when using --detect_header

sbesson

A few questions around new dependencies, client/server requirements as well as the scope of the header detection utility.

I have not carried out any functional testing but as the metadata population is part of the IDR workflow, I suspect we should be able to test it in the context of upcoming submissions. This plugin is also heavily used in the context of our training/guides so this might be another context for testing and providing feedback.

setup.py

src/omero_metadata/cli.py

sbesson · 2022-02-04T14:33:31Z

src/omero_metadata/cli.py

@@ -533,7 +582,7 @@ def populate(self, args):
                            cfg=args.cfg, cfgid=cfgid, attach=args.attach,
                            options=localcfg, batch_size=args.batch,
                            loops=loops, ms=ms, dry_run=args.dry_run,
-                            allow_nan=args.allow_nan)
+                            allow_nan=args.allow_nan, column_types=header_type)


Similarly to #59, adding a new key-value argument will likely break other contexts e.g. BulkToMapAnnotationContext.
#60 is one way to fix this issue but as indicated in the PR, it would be good to find a more scalable strategy

I totally missed this, I was under the impression that column_types is already a present key-value argument, here in the populate option, and I'm just now passing in a non-None value instead of the default None.

But currently, omero metadata populate Screen:51 -n --file example.csv --detect_header --context bulkmap does fail with a similar exception. I will discuss with @emilroz to see if we could brainstorm a more scalable solution.

No worries. The absence of integration tests for the other contexts does not help detecting this use case.
As mentioned in #59 (comment), using **kwargs might be an option to handle undefined keywords arguments in a lenient manner.

Addressed in 9b158f5
@sbesson Now that **kwargs is used for DeleteMapAnnotationContext and BulkToMapAnnotationContext, there are some keyword arguments that are not used and could be cleaned up. Do you think it's appropriate to clean up here or later in a different PR?

If that's okay with you I think cleaning up unused arguments in a follow-up PR would likely simplify the review process

sbesson · 2022-02-04T14:34:57Z

src/omero_metadata/populate.py

@@ -825,7 +821,10 @@ def get_image_name_by_id(self, iid, did=None):

    def resolve_dataset(self, column, row, value):
        try:
-            return self.datasets_by_name[value].id.val
+            if column.name.lower() == 'dataset':


I assume these improvements build on top of the DatasetColumn implementation in ome/omero-py#309. Is the requirement for the new API only client-side or will a server need to be upgraded with a version of omero-py containing these changes to use this new functionality?

Correct. During testing, I needed to upgrade my server omero-py with the new DatasetColumn implementation to use the new --detect_header functionality. Otherwise, I'd get an exception about DatasetColum not implemented when using a dataset ID column with 'dataset' as a header type .

Do you remember which call led to the server exception and the exception type thrown by the server? As a general rule, we don't want this plugin to not make particular assumptions. It should be possible to handle both scenarios and fall back to the former logic if DatasetColumn is not implemented server-side. Catching the appropriate exception with a try/except block is the immediate solution that comes to mind unless @joshmoore can think of an alternate way to introspect the server-side OMERO.py version.

Other versions are inspectable from the config service, but we'd need to teach the Java process how to inspect the Python process to get that info.

Makes sense. In terms of functional review, I think this means we should probably test this tool against two types of servers: with and without upgraded OMERO.py. The population should succeed in both cases but only the server with an upgraded OMERO.py should populate a DatasetColumn.
@muhanadz can you add some minimal handling to distinguish between upgraded and non-upgraded servers?

will-moore · 2022-02-15T11:09:51Z

With csv:

image,dataset,ROI_Area,Channel_Index,Channel_Name
3659,1451,0.0469,1,DAPI
3686,1451,0.142,2,GFP-AuroraB
3746,1451,0.093,3,TRITC-edit

$ omero metadata populate --file /Users/wmoore/Desktop/METADATA/test-project-dataset-noheader.csv Project:1001 --detect_header
Created session for will@localhost:4064. Idle timeout: 10 min. Current group: Lab Data
INFO:omero_metadata.populate:Adding dataset:1451 image:3746
INFO:omero_metadata.populate:Adding dataset:1451 image:3686
INFO:omero_metadata.populate:Adding dataset:1451 image:3659
INFO:omero_metadata.populate:Created new table OriginalFile:75983

Created a table - viewed with ome/omero-web#355

The column types are:

"ImageColumn",
"DatasetColumn",
"DoubleColumn",
"LongColumn",
"StringColumn",
"StringColumn"

👍

will-moore · 2022-02-15T11:17:23Z

I noticed in the screenshot above that dataset column name isn't capitalised like image -> Image etc, and other column types as discussed at #56
Would be good to ensure 'Dataset' is the name for any 'DatasetColumn' to be consistent.

src/omero_metadata/populate.py

…e/image_name columns and only dataset_id/image_id

… Ensured DatasetColumn is named 'Dataset'

sbesson · 2022-03-01T17:45:07Z

src/omero_metadata/cli.py

@@ -242,6 +244,9 @@ def _configure(self, parser):
        populate.add_argument("--allow_nan", action="store_true", help=(
            "Allow empty values to become Nan in Long or Double columns"))

+        populate.add_argument("--detect_header", action="store_true", help=(


Echoing my enthusiastic comment in https://github.com/ome/omero-metadata/pull/67/files#r816998882, assuming the functionality works as expected, is there any reason not to make this behavior the default and allow to disable the auto-detection via CLI instead?

In terms of the legacy behavior, we might also need to clarify the expectation for users when both the header auto-detection and the # headers row are present. Based on my reading of the code, the header auto-detection should set column_types and take precedence over the content of the CSV, is that correct?

will-moore · 2022-03-01T22:55:33Z

Tested:

$ omero metadata populate Project:1851 --file no-header-imagename.csv --detect_header

With CSV from README, with no header:

Image Name,Dataset Name,ROI_Area,Channel_Index,Channel_Name
img-01.png,dataset01,0.0469,1,DAPI
img-02.png,dataset01,0.142,2,GFP

Got column types: "StringColumn", "StringColumn", "DoubleColumn", "LongColumn", "StringColumn", "ImageColumn"

With CSV from README, with no header, using --detect_header:

Well,Plate,Drug,Concentration,Cell_Count,Percent_Mitotic
A1,Index.idx.xml,DMSO,10.1,10,25.4
A2,Index.idx.xml,DMSO,0.1,1000,2.54

Got column types (with new cols Well Name & Plate Name)
"WellColumn","PlateColumn","StringColumn","DoubleColumn","LongColumn","DoubleColumn","StringColumn","StringColumn"

With CSV from the README, with no header, using --detect_header on target Image:

Roi,object,probability,area
230422,1,0.8,250
230423,1,0.9,500

Got new column "Roi Name" and column types:
"RoiColumn","LongColumn","DoubleColumn","LongColumn","StringColumn"

Looking good.
In all these cases, it seems that the default behaviour should be --detect_header = True so that users don't need to remember it every time.

will-moore · 2022-03-24T14:44:41Z

Discussing the detect_headers feature with @dominikl
It would be handy to have this option available directly in the ParsingContext() class, which is our entry-point from various python scripts (e.g. https://github.com/IDR/idr0101-payne-insitugenomeseq/blob/560e389dc248976693ae3e647858fdb247c41a72/scripts/csv_to_points.py#L151)

I guess you could do this (not tested yet):

from omero_metadata.cli import detect_headers

ParsingContext(....column_types=detect_headers(file))

but a detect_headers option for ParsingContext() would be nicer?
Best of all would be if detect_headers=True by default.

sbesson · 2022-03-24T15:16:58Z

but a detect_headers option for ParsingContext() would be nicer?

Or have some value e.g. ParsingContext(..., column_types='autodetect') or similar that would internally trigger the header detection logic?

Best of all would be if detect_headers=True by default.

Agreed, that echoes my https://github.com/ome/omero-metadata/pull/67/files#r817003444 and I am still for making this the default assuming the detection is robust (and there is a CLI way to disable it). I still think we need to establish the order of precedence if header detection is selected and the # header row are provided.

sbesson · 2022-03-25T12:41:28Z

See also IDR/idr0062-blin-nuclearsegmentation#12 for an ongoing IDR annotation investigation using the new logic proposed in this PR as a library

will-moore · 2022-03-25T13:00:04Z

@sbesson I don't see that IDR/idr0062-blin-nuclearsegmentation#12 is using this PR (yet)?

sbesson · 2022-03-25T13:25:52Z

Not formally, but the logic introduced currently as detect_headers is very similar to https://github.com/IDR/idr0062-blin-nuclearsegmentation/pull/12/files#diff-4e4de1329ccfc37299b279819006a581e50b7e8a449f2da8c3c009d17063839cR45-R56 so I assume that if this API was migrated to omero_metadata.library or even handled internally as discussed above, IDR/idr0062-blin-nuclearsegmentation#12 could be simplified to consume the new API?

will-moore · 2022-03-25T14:21:51Z

I think if a #header row is provided on the CSV, then it should take precedence over the auto-detection, since that gives you more control, and reduces the chance of a breaking change.

sbesson · 2022-03-25T15:52:09Z

I think if a #header row is provided on the CSV, then it should take precedence over the auto-detection, since that gives you more control, and reduces the chance of a breaking change.

👍 I was never a fan of this custom format but I can appreciate that if someone had made the effort of storing the expected column types into a CSV, it would feel unexpected to ignore it by default (unless some option along the lines of --detect-headers force was passed)

joshmoore · 2022-03-25T16:32:02Z

I was never a fan of this custom format

I was going to try to say something to this effect as well. CSV "specifications" are a nightmare, making this feel a bit like the "comments in JSON" conversation.

muhanadz · 2022-03-28T09:34:07Z

@sbesson @will-moore While working on the default behavior, an idea would be to change--detect_header to --manual_header which would disable the new automatic header detection. If # header is present in the CSV (and automatically detected), we can warn the user and auto-trigger --manual_header so that populate uses the# header row. Thoughts?

will-moore · 2022-03-28T09:41:54Z

Sounds good!
So, the only time a user would need the --manual_header would be if the csv didn't have #header and they wanted e.g. number columns to be treated as strings, or if there was some bug in the auto-detect?
👍

muhanadz · 2022-03-28T10:55:42Z

Sounds good! So, the only time a user would need the --manual_header would be if the csv didn't have #header and they wanted e.g. number columns to be treated as strings, or if there was some bug in the auto-detect? 👍

@will-moore: What I had in mind was that --manual_header would disable the automatic header detection (currently enabled via --detect_header which will be enabled by default). The user would use --manual_header when they want to explicitly define the header types via # header and pass in a CSV file with # header as the first row. Or as you mentioned, the auto-detect has some bug or is misbehaving.

Of course, this would come with some automation, for example, if a user passes in a # header CSV without using the --manual_header flag, the script would automatically switch to the previous behavior and use # header.

…method. User can now either pass '--manual_header' or a csv with '# header' header to bypass the auto-detect header method.

…lemented later and caused a bug

…eader' detection

will-moore · 2022-03-29T15:33:45Z

Looking to use detect_headers() from IDR/idr0062-blin-nuclearsegmentation#12
I didn't want to create an instance of MetadataControl class, so it would be helpful to make it a @staticmethod:

-    def detect_headers(self, csv_path):
+    @staticmethod
+    def detect_headers(csv_path):

Then, in that script you can do:

from omero_metadata.cli import MetadataControl

column_types = MetadataControl.detect_headers(file_path)
ctx = ParsingContext(..... column_types)

will-moore

Tested again as above without needing any extra arguments this time, and default behaviour created columns of the correct types:

$ omero metadata populate Project:801 --file no-header-imagename.csv

Looks good to me.
So - "good to merge" with the @staticmethod tweak above

sbesson

I did some functional testing the latest version of this PR using a simple P/D/I structure with one dataset and 2 images and a variation of CSV files with/without the header rows and with/without

(venv3) [sbesson@pilot-idr0124-omeroreadwrite ~]$ cat headers_datasetid.csv 
# header dataset,s,s,s
Dataset ID,Image Name,string,float
7459,test.fake,foo,0.1
7459,test2.fake,bar,0.2
(venv3) [sbesson@pilot-idr0124-omeroreadwrite ~]$ cat headers_datasetname.csv 
# header s,s,s,s
Dataset Name,Image Name,string,float
test,test.fake,foo,0.1
test,test2.fake,bar,0.2
(venv3) [sbesson@pilot-idr0124-omeroreadwrite ~]$ cat noheaders_datasetid.csv 
dataset id,image name,string,float
7459,test.fake,foo,0.1
7459,test2.fake,bar,0.2
(venv3) [sbesson@pilot-idr0124-omeroreadwrite ~]$ cat noheaders_datasetname.csv 
Dataset Name,Image Name,string,float
test,test.fake,foo,0.1
test,test2.fake,bar,0.2

The behavior of the various implementations (with/without --manual_header) matches my expectation from the last commit.

I also briefly tested the scenario discussed in https://github.com/ome/omero-metadata/pull/67/files#r799518087. Trying to create a DatasetColumn on a server without the minimal OMERO.py version results in an omero::InternalException with AttributeError: 'DatasetColumn' object has no attribute 'descriptor'. However, the same error can be thrown without this PR when using # header dataset,... so it's better to capture this ass a separate issue to handle this scenario more gracefully.

I spent a bit of time reviewing the different strategies used for choosing the column types. As of the current state of this PR:

a # header row can be stored as the first row of the CSV using the keys defined in

omero-metadata/src/omero_metadata/populate.py

Line 120 in 2c1b269

COLUMN_TYPES = {

. HeaderResolver.get_column_types contains most of the logic for mapping these keys into OMERO.tables columns
a column_types argument can be passed manually to ParsingContext using the same keys as above - see also Support --columns in CLI #6 for a RFE to expose this

prior to this PR, in the absence of a specified column_types and header row, the HeaderResolver logic maps column types using column names first and defaulting to StringColumn -

omero-metadata/src/omero_metadata/populate.py

Lines 280 to 293 in 99383af

    
           try: 
        
               keys = getattr(self, "%s_keys" % klass) 
        
               log.debug("Adding keys %r" % keys) 
        
               if keys[header_as_lower] is StringColumn: 
        
                   column = keys[header_as_lower]( 
        
                       name, description, 
        
                       self.DEFAULT_COLUMN_SIZE, list()) 
        
               else: 
        
                   column = keys[header_as_lower]( 
        
                       name, description, list()) 
        
           except KeyError: 
        
               log.debug("Adding string column %r" % name) 
        
               column = StringColumn( 
        
                   name, description, self.DEFAULT_COLUMN_SIZE, list())

contains the relevant logic

this PR introduces a new logic to detect column types based on column names first and then defaulting to the column type detecting by the pandas library. This is currently stored as a column_types argument and passed to the ParsingContext as above

The last two strategies are very similar in the intent but there are a few key divergences:

the HeaderResolver detection logic is target dependent i.e. the column names are specific to the target class and defined in the <class>_keys variables of the HeaderResolver instance. On the other side, the new detection logic is agnostic to the target class i.e. a well column will always be mapped as a WellColumn even if the population is run on a P/D/I hierarchy
there is not a perfect overlap between former column names and new column names. Also keys like "plate name" are mapped into PlateColumn while e.g. image_name is mapped into a StringColumn in the current logic

In summary, the new logic is functional and matches the description of the PR. The support for DatasetColumn support and pandas based type detection are very valuable additions.

My outstanding concern is about maintaining and supporting two separate code paths for the detection of column types using different sets of rules. Several open issues and conversations already point at the unclarity of the behavior of the current OMERO.tables population logic depending on the target type - see #66 (comment).

Before merging this my preferred option would be to collect and review the different behaviors and unify the name-based column types detection changes. From then we can decide whether we still want to keep the default StringColumn mapping or use exclusively the new pandas based detection.

@muhanadz are there some new name-based mappings that are essential to introduce for your workflows?

Happy to schedule a call to review the different options to move forward with this if it makes things easier /cc @pwalczysko

muhanadz · 2022-04-04T11:49:03Z

@sbesson Thank you for the in-depth review and analysis! Some of the issues you raised were definitely a concern of mine. I think a call to review the options would really help.

The workflow that pushed for the old --detect_header implementation was one that couldn't provide # header in the csv and the default StringColumn wasn't enough. In other words, a user would like to have set headers (eg dataset id) which would automatically be detected as the relevant column type (eg a dataset column in this case) without the # header row.

will-moore · 2022-04-04T13:16:00Z

Re Seb Also keys like "plate name" are mapped into PlateColumn while e.g. image_name is mapped into a StringColumn in the current logic...
Confusingly, you don't create a csv with plate_name but rather with Plate column which is changed from string to PlateColumn and a new Plate Name column gets added.
This is in contrast to the more logical behaviour of Dataset Name column which stays as a StringColumn although a Dataset column doesn't get added to the table.

sbesson · 2022-04-04T14:02:56Z

Confusingly, you don't create a csv with plate_name but rather with Plate column which is changed from string to PlateColumn and a new Plate Name column gets added.

Are you talking about the current behavior of the released omero-metadata plugin or the new behavior of this PR?

This is in contrast to the more logical behaviour of Dataset Name column which stays as a StringColumn although a Dataset column doesn't get added to the table.

Same as above, it would be great to capture the different scenarios for the added column and their types so that we can identify the places where we ca unify the behavior across OMERO objects.

dominikl · 2022-04-04T14:27:46Z

Sorry, for IDR/idr0062-blin-nuclearsegmentation#12 adding a few lines myself was just simpler than trying to build an omero-metadata with this PR and then using it, and also keeping it in line with this PR afterwards, etc.

sbesson · 2022-04-04T14:49:26Z

I tried to capture my understanding of the different scenarios for the released version of the plugin in https://hackmd.io/EahBQovyR_OyY2PgxewlIg. Feel free to review/amend ahead of the discussion

sbesson · 2022-04-05T11:02:01Z

Summarizing the conversation with @will-moore and @muhanadz. There was an extensive review of the current understanding of the support (https://hackmd.io/EahBQovyR_OyY2PgxewlIg#fn7) as well as the different locations in the code (header resolution, value resolution) which control how the OMERO.tables is populated and the various edge cases/limitations.

Summarizing the discussion:

overall, there was a global agreement on keeping the detection target agnostic as proposed in this PR i.e. a well column should always map into a WellColumn moving forward. Also we want to support the suffixed <object> id and <object> name columns (with all versions of space, no space, underscore) for all objects
assuming the new auto-detection rules break some use cases, the manual specification via a # header row should allow to manual input the expected behavior
the new behavior will bring a few usability issues that will need to be collected and triaged
- what should be the behavior when resolving an object outside the hierarchy e.g. well and Project. At the moment, the code will error but the feeling is that it could be turned to be lenient. This is also the behavior that was suggested in populate command fails if the target is a plate and the CSV contains a plate column #51
- once we have sorted the column mapping, we will likely need to fix the implementation object can be equivalently specified via name and/or ID in all cases - see Support Dataset IDs and Image IDs when target is Project #65 and Inconsistent CSV format required for populate metadata on Plate vv Project #29
- we might need to improve the code handling and the error message for OMERO.server with older OMERO.py without the DatasetColumn implementation

I will merge this PR and let @muhanadz open follow-up PRs with unit tests as well as documentation updates capturing and describing the new plugin behavior. We can then review this behavior in a wider context e.g. including in various scenarios like IDR or the training sessions for the open-source team, have a general sign-off on the new strategy, deprecate/remove the former code, prioritize the outstanding issues and then move towards a release of the plugin (0.11.0 or maybe simply 1.0.0??)

muhanadz added 5 commits February 2, 2022 12:10

Added support for DatasetColumn, specifically dataset ID column

29446de

Added CLI argument and tool to automatically detect header type

30dd5a7

Removed comment.

95d6106

Fix flake8

3f559af

Added pandas module requirment

0f76448

sbesson reviewed Feb 4, 2022

View reviewed changes

muhanadz mentioned this pull request Feb 8, 2022

Add link to datasets for dataset column values ome/omero-web#355

Merged

muhanadz commented Feb 18, 2022

View reviewed changes

src/omero_metadata/populate.py Outdated Show resolved Hide resolved

muhanadz force-pushed the detect_header_datasetColumn branch from 4fd6190 to 0f76448 Compare February 23, 2022 09:31

muhanadz added 3 commits February 23, 2022 09:41

Modified code to not detect dataset/image header type for dataset_nam…

cfe97c4

…e/image_name columns and only dataset_id/image_id

Added 'Dataset Name' column to be consistent with other column types.…

9daa9f3

… Ensured DatasetColumn is named 'Dataset'

Prevent other contexts from breaking when using --detect_header

9b158f5

sbesson reviewed Mar 1, 2022

View reviewed changes

muhanadz added 3 commits March 28, 2022 14:23

Changed the code's default behaviour to use the new header detection …

33de3ba

…method. User can now either pass '--manual_header' or a csv with '# header' header to bypass the auto-detect header method.

Removed the newely added 'Dataset Name' column as it wasn't fully imp…

3b8ff20

…lemented later and caused a bug

Improved '--manual_header' help description and fixed a bug with '# h…

06ad062

…eader' detection

will-moore approved these changes Mar 29, 2022

View reviewed changes

Made detect_headers method a static method

99383af

sbesson reviewed Apr 2, 2022

View reviewed changes

sbesson merged commit 07a33f3 into ome:master Apr 5, 2022

muhanadz deleted the detect_header_datasetColumn branch April 7, 2022 10:54

This was referenced Apr 7, 2022

Updating README #71

Merged

Automatic header detection tests #72

Merged

sbesson mentioned this pull request Apr 19, 2022

Refactor pandas-based header detection #74

Open

sbesson mentioned this pull request May 31, 2022

Header detection: default behavior and handling sparse columns #76

Closed

sbesson added this to the 0.11.0 milestone Jun 10, 2022

sbesson mentioned this pull request Jan 5, 2023

Only perform header detection when --file is specified #80

Merged

	try:
	keys = getattr(self, "%s_keys" % klass)
	log.debug("Adding keys %r" % keys)
	if keys[header_as_lower] is StringColumn:
	column = keys[header_as_lower](
	name, description,
	self.DEFAULT_COLUMN_SIZE, list())
	else:
	column = keys[header_as_lower](
	name, description, list())
	except KeyError:
	log.debug("Adding string column %r" % name)
	column = StringColumn(
	name, description, self.DEFAULT_COLUMN_SIZE, list())

Add --detect_header and DatasetColumn support #67

Add --detect_header and DatasetColumn support #67

Conversation

muhanadz commented Feb 3, 2022 • edited Loading

Changes

sbesson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

muhanadz Mar 1, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

will-moore commented Feb 15, 2022

will-moore commented Feb 15, 2022

Choose a reason for hiding this comment

will-moore commented Mar 1, 2022

will-moore commented Mar 24, 2022

sbesson commented Mar 24, 2022

sbesson commented Mar 25, 2022

will-moore commented Mar 25, 2022

sbesson commented Mar 25, 2022

will-moore commented Mar 25, 2022

sbesson commented Mar 25, 2022

joshmoore commented Mar 25, 2022

muhanadz commented Mar 28, 2022 • edited Loading

will-moore commented Mar 28, 2022

muhanadz commented Mar 28, 2022 • edited Loading

will-moore commented Mar 29, 2022

will-moore left a comment • edited Loading

Choose a reason for hiding this comment

sbesson left a comment

Choose a reason for hiding this comment

muhanadz commented Apr 4, 2022

will-moore commented Apr 4, 2022

sbesson commented Apr 4, 2022

dominikl commented Apr 4, 2022

sbesson commented Apr 4, 2022

sbesson commented Apr 5, 2022

muhanadz commented Feb 3, 2022 •

edited

Loading

muhanadz Mar 1, 2022 •

edited

Loading

muhanadz commented Mar 28, 2022 •

edited

Loading

muhanadz commented Mar 28, 2022 •

edited

Loading

will-moore left a comment •

edited

Loading