-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Multi model columns #76
base: master
Are you sure you want to change the base?
Conversation
Can one of the admins verify this patch? |
e965cfc
to
5050a43
Compare
Hi can we please not rely on a branch in dependencies. Either a point release is needed or a specific commit checksum is needed |
Yep, the intention is to try things out in this branch until we're happy and then release dask-ms |
I think the changes are more complicated than they need to be because they implement some of the expression parsing behaviour in In general I think the change could be simplified to the following general idea: parts = args.data_column.split("=")
if parts == 1:
flag_data = parts[0]
else:
data_col = parts[0]
nds = data_column_expr(args.data_column, datasets)
flag_data = getattr(nds, data_col) Edit: Corrections |
Instead of reimplementing a parse tree that obeys botmath rules to evaluate this why not use numexpr with a blockwise map on the data? Parse out the column names between the operators using regex as you have now and assign them to variables then map numexpr.evaluate over the arrays. Assumption that the arrays have the same shape can be made
|
Well the parse tree is already implemented in plain python -- no need to add an extra dependency which unfortunately only has one maintainer. |
Also, I should add that the parse tree evaluates bodmas expressions involving dask arrays |
ok but as I see it it evaluates a limited subset of expressions, for example it is assumed that there is one column in front of the division operator, so what if I have corrected residuals that I would want to flag a quotient on? Just an example though - in this case the data was calibrated to MODEL_DATA - DE_MODEL1:DE_MODEL1 as the DI and DD directions. |
Currently yes, the PR needs to change to depend more fully on
data_column_expr should handle this: https://dask-ms.readthedocs.io/en/latest/api.html#daskms.expressions.data_column_expr
|
The regex split is mainly for xds = list(xds_from_ms(args.ms,
columns=tuple(columns),
................... which is faster than reading the entire ms, versus splitting for |
Ah I see, you need to read the MS to check whether the columns in the expression exist?
Just to be clear, because I think there's confusion here. The reason for the split is to distinguish between the two --data-column cases: 1.It contains an expression (detectable by the presence of "=" in the string), in which case args.data_column can be used as input to |
@bennahugo Regarding expressions containing division operations, what should happen in the case of zeroed/inf/nan visibilities? |
There is an explicit flagging step for this |
(Those visibilities are in error) |
2. extract visibility column name to support substract_model_column 3. removed columns arg from xds_from_ms to read entire ms 4. add warning for -smc arg depreciation
tricolour/apps/tricolour/app.py
Outdated
string_columns = None | ||
# extract the name of the visibility column | ||
# to support subtract_model_column | ||
data_column = re.split(r'\+|-|/|\*|\(|\)', args.data_column)[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a regex designed to detect the kind of expressions that https://github.com/ska-sa/dask-ms/blob/master/daskms/expressions.py#L103 supports. I think it's probably better to just split on the equals as follows as its easer to separate into the two cases
parts = args.data_column.split("=")
log.info("Flagging on the {0:s} {0:s}".format(parts[0], "column" if len(parts) == 1 else "expression))
Additionally, I don't think the regex contained an equals sign so I think it would have failed on an assignment statement.
Could you please correct and test the two cases?
- The simple case where data_column just references a column name
- The complicated case where data_column contains an assignment.
Is this ready for review? The CI seems to be failing. |
@IanHeywood do you mind trying this PR branch and give your initial feedback? |
I installed the
I tried |
|
Mental note, we mustn't merge this before:
|
Trying this out again. Using quotations marks as per the example and docs doesn't work, e.g.
gives:
If I provide an expression with no spaces and no quotations marks, e.g.:
starts flagging. Although I can't vouch for what's going on under the hood. Does I'm more confident when doing this sort of thing if I can use parentheses to be sure of the ordering, as is also shown in the dask documentation example. But bash doesn't like this unless I use quotes, e.g.:
at which point this happens:
If my slowness to engage means that the "This branch is out-of-date with the base branch" is relevant to the above, and I need to try a newer version, then please let me know and I'll have a go. Cheers. |
This reverts commit 86eeefb.
Thanks for the feedback @IanHeywood. I agree that having an assigment statement (
Would you mind having another go? |
Draft PR to handle multi-column expressions.