Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add --detect_header and DatasetColumn support #67

Merged
merged 12 commits into from
Apr 5, 2022
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -127,7 +127,8 @@ def read(fname):
'future',
'omero-py>=5.6.0',
'PyYAML',
'jinja2'
'jinja2',
'pandas'
sbesson marked this conversation as resolved.
Show resolved Hide resolved
],
python_requires='>=3',
tests_require=[
Expand Down
51 changes: 50 additions & 1 deletion src/omero_metadata/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,8 @@
from omero.grid import LongColumn
from omero.model.enums import UnitsLength

import pandas as pd

HELP = """Metadata utilities

Provides access to and editing of the metadata which
Expand Down Expand Up @@ -242,6 +244,9 @@ def _configure(self, parser):
populate.add_argument("--allow_nan", action="store_true", help=(
"Allow empty values to become Nan in Long or Double columns"))

populate.add_argument("--detect_header", action="store_true", help=(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Echoing my enthusiastic comment in https://github.com/ome/omero-metadata/pull/67/files#r816998882, assuming the functionality works as expected, is there any reason not to make this behavior the default and allow to disable the auto-detection via CLI instead?

In terms of the legacy behavior, we might also need to clarify the expectation for users when both the header auto-detection and the # headers row are present. Based on my reading of the code, the header auto-detection should set column_types and take precedence over the content of the CSV, is that correct?

"Automatically detect header row to populate"))

populateroi.add_argument(
"--measurement", type=int, default=None,
help="Index of the measurement to populate. By default, all")
Expand Down Expand Up @@ -483,6 +488,45 @@ def testtables(self, args):
if not initialized:
self.ctx.die(100, "Failed to initialize Table")

def detect_headers(self, csv_path):
'''
Function to automatically detect headers from a CSV file. This function
sbesson marked this conversation as resolved.
Show resolved Hide resolved
loads the table to pandas to detects the column type and match headers
'''

conserved_headers = ['well', 'plate', 'image', 'dataset', 'roi']
headers = []
table = pd.read_csv(csv_path)
col_types = table.dtypes.values.tolist()
cols = list(table.columns)

for index, col_type in enumerate(col_types):
col = cols[index]
if col.lower() in conserved_headers:
headers.append(col.lower())
elif col.lower() == 'image name' or col.lower() == 'imagename' or \
col.lower() == 'image_name':
headers.append('image')
elif col.lower() == 'dataset name' or \
col.lower() == 'datasetname' or \
col.lower() == 'dataset_name':
headers.append('dataset')
elif col.lower() == 'plate name' or col.lower() == 'platename' or \
col.lower() == 'plate_name':
headers.append('plate')
elif col.lower() == 'well name' or col.lower() == 'wellname' or \
col.lower() == 'well_name':
headers.append('well')
elif col_type.name == 'object':
headers.append('s')
elif col_type.name == 'float64':
headers.append('d')
elif col_type.name == 'int64':
headers.append('l')
elif col_type.name == 'bool':
headers.append('b')
return headers

# WRITE

def populate(self, args):
Expand Down Expand Up @@ -521,6 +565,11 @@ def populate(self, args):
cfgid = cfgann.getFile().getId()
md.linkAnnotation(cfgann)

header_type = None
if args.detect_header:
header_type = self.detect_headers(args.file)
if args.dry_run:
omero_metadata.populate.log.info(f"Header Types:{header_type}")
loops = 0
ms = 0
wait = args.wait
Expand All @@ -533,7 +582,7 @@ def populate(self, args):
cfg=args.cfg, cfgid=cfgid, attach=args.attach,
options=localcfg, batch_size=args.batch,
loops=loops, ms=ms, dry_run=args.dry_run,
allow_nan=args.allow_nan)
allow_nan=args.allow_nan, column_types=header_type)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly to #59, adding a new key-value argument will likely break other contexts e.g. BulkToMapAnnotationContext.
#60 is one way to fix this issue but as indicated in the PR, it would be good to find a more scalable strategy

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally missed this, I was under the impression that column_types is already a present key-value argument, here in the populate option, and I'm just now passing in a non-None value instead of the default None.

But currently, omero metadata populate Screen:51 -n --file example.csv --detect_header --context bulkmap does fail with a similar exception. I will discuss with @emilroz to see if we could brainstorm a more scalable solution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries. The absence of integration tests for the other contexts does not help detecting this use case.
As mentioned in #59 (comment), using **kwargs might be an option to handle undefined keywords arguments in a lenient manner.

Copy link
Member Author

@muhanadz muhanadz Mar 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 9b158f5
@sbesson Now that **kwargs is used for DeleteMapAnnotationContext and BulkToMapAnnotationContext, there are some keyword arguments that are not used and could be cleaned up. Do you think it's appropriate to clean up here or later in a different PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's okay with you I think cleaning up unused arguments in a follow-up PR would likely simplify the review process

ctx.parse()

def rois(self, args):
Expand Down
11 changes: 6 additions & 5 deletions src/omero_metadata/populate.py
Original file line number Diff line number Diff line change
Expand Up @@ -416,8 +416,6 @@ def resolve(self, column, value, row):
)
break
elif column.name.lower() == "dataset name":
# DatasetColumn unimplemented at the momnet
# We can still access column names though
images_by_id = self.wrapper.images_by_id[
self.wrapper.datasets_by_name[column_value].id.val
]
Expand All @@ -427,8 +425,6 @@ def resolve(self, column, value, row):
)
break
elif column.name.lower() == "dataset":
# DatasetColumn unimplemented at the momnet
# We can still access column names though
images_by_id = self.wrapper.images_by_id[
self.wrapper.datasets_by_id[
int(column_value)].id.val
Expand Down Expand Up @@ -825,7 +821,10 @@ def get_image_name_by_id(self, iid, did=None):

def resolve_dataset(self, column, row, value):
try:
return self.datasets_by_name[value].id.val
if column.name.lower() == 'dataset':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume these improvements build on top of the DatasetColumn implementation in ome/omero-py#309. Is the requirement for the new API only client-side or will a server need to be upgraded with a version of omero-py containing these changes to use this new functionality?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. During testing, I needed to upgrade my server omero-py with the new DatasetColumn implementation to use the new --detect_header functionality. Otherwise, I'd get an exception about DatasetColum not implemented when using a dataset ID column with 'dataset' as a header type .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you remember which call led to the server exception and the exception type thrown by the server? As a general rule, we don't want this plugin to not make particular assumptions. It should be possible to handle both scenarios and fall back to the former logic if DatasetColumn is not implemented server-side. Catching the appropriate exception with a try/except block is the immediate solution that comes to mind unless @joshmoore can think of an alternate way to introspect the server-side OMERO.py version.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other versions are inspectable from the config service, but we'd need to teach the Java process how to inspect the Python process to get that info.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. In terms of functional review, I think this means we should probably test this tool against two types of servers: with and without upgraded OMERO.py. The population should succeed in both cases but only the server with an upgraded OMERO.py should populate a DatasetColumn.
@muhanadz can you add some minimal handling to distinguish between upgraded and non-upgraded servers?

return self.datasets_by_id[int(value)].id.val
else:
return self.datasets_by_name[value].id.val
except KeyError:
log.warn('Project is missing dataset: %s' % value)
return Skip()
Expand Down Expand Up @@ -1160,6 +1159,8 @@ def preprocess_data(self, reader):
column.values.append(value)
elif column.name.lower() == "plate":
column.values.append(value)
elif column.name.lower() == "dataset":
column.values.append(value)
except TypeError:
log.error('Original value "%s" now "%s" of bad type!' % (
original_value, value))
Expand Down