Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kylel/2023/hardening codebase #259

Open
wants to merge 25 commits into
base: main
Choose a base branch
from

Conversation

kyleclo
Copy link
Collaborator

@kyleclo kyleclo commented Jul 5, 2023

Big PR. A lot of it is actually just auto linting/formatting. Here are main changes:

  1. For Box, I added some assertions to make sure problematic Boxes are caught at creation time:
        if w < 0 or h < 0:
            raise ValueError(f"Width and height must be non-negative, got {w} and {h}")
        if page < 0:
            raise ValueError(f"Page must be non-negative, got {page}")
        if l < 0 or t < 0:
            raise ValueError(f"Left and top must be non-negative, got {l} and {t}")

This has been helpful for debugging issues & the library assumes "nice" Boxes.

  1. For Box, I added another assertion for the is_overlap() method:
    def is_overlap(
        self, other: "Box", x: float = 0.0, y: float = 0, center: bool = False
    ) -> bool:
         ...
        if self.page != other.page:
            return False
  1. For Span, I added a merge_boxes: bool = True option to small_spans_to_big_span(). This should preserve the default behavior. But gives us the option to not do that and just create a big Span that has no larger Box. I would prefer that Not-Merging is actually the default behavior, but I think that might break too many things, so maybe in the future.

  2. For Box and Span, I added utility method cluster_boxes() and cluster_spans(). It's based on overlap.

  3. For Box, I also added utility method shrink(). It's not used anywhere now, but it's helpful for debugging.

  4. For BoxGroup and SpanGroup, I've added a new flag in the constructor: allow_overlap: Optional[bool] = False. This should preserve default behavior, which is that it disallows Box or Span within itself to have overlaps. For example, this should catch if a single BoxGroup has duplicate Boxes or a single SpanGroup has duplicate Spans. Otherwise, classes behave as they originally did.

  5. For Indexers, I added a new BoxGroupIndexer class that behaves similarly to SpanGroupIndexer. It's not used, but it was helpful for debugging.

  6. Besides the above library changes, almost everything else I added was a missing test.

@kyleclo kyleclo requested review from cmwilhelm and regan-huff and removed request for cmwilhelm and regan-huff July 5, 2023 18:59
@kyleclo kyleclo changed the title Kylel/2023/spangroup overlap bug Kylel/2023/hardening codebase Jul 6, 2023
@kyleclo kyleclo marked this pull request as ready for review July 7, 2023 00:14
Copy link
Contributor

@cmwilhelm cmwilhelm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes generally make sense to me; I called out the places I believe to be breaking and have suggested alternatives.

Whenever this is good to go I also suggest we merge/rebase in the latest main (to get CI to rerun) and bump the MMDA version in pyproject.toml

id: Optional[int] = None,
doc: Optional["Document"] = None,
metadata: Optional[Metadata] = None,
allow_overlap: Optional[bool] = False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allow_overlap=False as the default is a behavioral change and will almost certainly break stuff in practice. Can we switch the default to True?

id: Optional[int] = None,
doc: Optional["Document"] = None,
metadata: Optional[Metadata] = None,
allow_overlap: Optional[bool] = False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same story here as BoxGroup -- False as the default is a breaking change in behavior, and there's no telling how much data we already have that violates it.

@property
def xywh(self) -> Tuple[float, float, float, float]:
"""Return a tuple of the (left, top, width, height) format."""
return self.l, self.t, self.w, self.h
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave the method in but add a logging.warning() with a deprecation notice. This may break clients of mmda that use .xywh

"""
Whether self overlaps with the other Box object.
x, y distances for padding
center (bool) if True, only consider overlapping if this box's center is contained by other
"""
if self.page != other.page:
return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤯

f"Detected overlap with existing SpanGroup(s) {matches} for {span_group}"
f"Detected overlap! While processing the Span {span} as part of query SpanGroup {span_group.to_json()}, we found that it overlaps with existing SpanGroup(s):\n"
+ "\n".join(
[f"\t{i}\t{m} " for i, m in zip(match_ids, matches)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks this is much more informative output.

if page < 0:
raise ValueError(f"Page must be non-negative, got {page}")
if l < 0 or t < 0:
raise ValueError(f"Left and top must be non-negative, got {l} and {t}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you find any evidence of these negatively positioned or dimensioned boxes when you were running through things? As with the allow_overlap flags in SpanGroup and BoxGroup, I think it won't be possible to audit what existing data already violates these constraints

Copy link
Contributor

@regan-huff regan-huff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would love for this PR to include or be accompanied by documentation for the current best expectations for how SpanGroups and BoxGroups will be used.

I also support the more conservative approach Chris suggests of leaving the defaults in a state that will allow existing/old data to be loaded from the annotation store without throwing an error. But perhaps the documentation can explain that the defaults are for backwards compatibility.

I think what we'll try to do to handle the possible impact of these model changes is attempt to update the mmda version for the whole Semantic Reader more or less simultaneously, so we can hopefully have a tight response loop if we do discover some data incompatibilities with the updated mmda models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants