-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kylel/2023/hardening codebase #259
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes generally make sense to me; I called out the places I believe to be breaking and have suggested alternatives.
Whenever this is good to go I also suggest we merge/rebase in the latest main
(to get CI to rerun) and bump the MMDA version in pyproject.toml
id: Optional[int] = None, | ||
doc: Optional["Document"] = None, | ||
metadata: Optional[Metadata] = None, | ||
allow_overlap: Optional[bool] = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
allow_overlap=False
as the default is a behavioral change and will almost certainly break stuff in practice. Can we switch the default to True
?
id: Optional[int] = None, | ||
doc: Optional["Document"] = None, | ||
metadata: Optional[Metadata] = None, | ||
allow_overlap: Optional[bool] = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same story here as BoxGroup
-- False
as the default is a breaking change in behavior, and there's no telling how much data we already have that violates it.
@property | ||
def xywh(self) -> Tuple[float, float, float, float]: | ||
"""Return a tuple of the (left, top, width, height) format.""" | ||
return self.l, self.t, self.w, self.h |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's leave the method in but add a logging.warning() with a deprecation notice. This may break clients of mmda that use .xywh
""" | ||
Whether self overlaps with the other Box object. | ||
x, y distances for padding | ||
center (bool) if True, only consider overlapping if this box's center is contained by other | ||
""" | ||
if self.page != other.page: | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤯
f"Detected overlap with existing SpanGroup(s) {matches} for {span_group}" | ||
f"Detected overlap! While processing the Span {span} as part of query SpanGroup {span_group.to_json()}, we found that it overlaps with existing SpanGroup(s):\n" | ||
+ "\n".join( | ||
[f"\t{i}\t{m} " for i, m in zip(match_ids, matches)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks this is much more informative output.
if page < 0: | ||
raise ValueError(f"Page must be non-negative, got {page}") | ||
if l < 0 or t < 0: | ||
raise ValueError(f"Left and top must be non-negative, got {l} and {t}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you find any evidence of these negatively positioned or dimensioned boxes when you were running through things? As with the allow_overlap
flags in SpanGroup
and BoxGroup
, I think it won't be possible to audit what existing data already violates these constraints
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would love for this PR to include or be accompanied by documentation for the current best expectations for how SpanGroups and BoxGroups will be used.
I also support the more conservative approach Chris suggests of leaving the defaults in a state that will allow existing/old data to be loaded from the annotation store without throwing an error. But perhaps the documentation can explain that the defaults are for backwards compatibility.
I think what we'll try to do to handle the possible impact of these model changes is attempt to update the mmda version for the whole Semantic Reader more or less simultaneously, so we can hopefully have a tight response loop if we do discover some data incompatibilities with the updated mmda models.
Big PR. A lot of it is actually just auto linting/formatting. Here are main changes:
Box
, I added some assertions to make sure problematic Boxes are caught at creation time:This has been helpful for debugging issues & the library assumes "nice" Boxes.
Box
, I added another assertion for theis_overlap()
method:For
Span
, I added amerge_boxes: bool = True
option tosmall_spans_to_big_span()
. This should preserve the default behavior. But gives us the option to not do that and just create a bigSpan
that has no largerBox
. I would prefer that Not-Merging is actually the default behavior, but I think that might break too many things, so maybe in the future.For
Box
andSpan
, I added utility methodcluster_boxes()
andcluster_spans()
. It's based on overlap.For
Box
, I also added utility methodshrink()
. It's not used anywhere now, but it's helpful for debugging.For
BoxGroup
andSpanGroup
, I've added a new flag in the constructor:allow_overlap: Optional[bool] = False
. This should preserve default behavior, which is that it disallowsBox
orSpan
within itself to have overlaps. For example, this should catch if a singleBoxGroup
has duplicateBoxes
or a singleSpanGroup
has duplicateSpans
. Otherwise, classes behave as they originally did.For
Indexers
, I added a newBoxGroupIndexer
class that behaves similarly toSpanGroupIndexer
. It's not used, but it was helpful for debugging.Besides the above library changes, almost everything else I added was a missing test.