Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge boxes from citation mention model #273

Merged
merged 7 commits into from
Aug 7, 2023
Merged

merge boxes from citation mention model #273

merged 7 commits into from
Aug 7, 2023

Conversation

gituser768
Copy link
Contributor

No description provided.

pyproject.toml Outdated
@@ -1,6 +1,6 @@
[project]
name = 'mmda'
version = '0.9.9'
version = '0.10.0'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I swear I'm always confused by semver so i might be wrong again but shouldn't this be 0.9.10 (only increase the last number because it is a minor change that doesn't really affect the whole MMDA library as a whole?)

Copy link
Contributor Author

@gituser768 gituser768 Aug 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess I considered it a breaking change, but I think you're right since no one actually depends on the specific behavior that's changing in this case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I think we usually consider the ai2 model updates as minor

Comment on lines 57 to 58
def merge_boxes_of_sg(sg):
sg.box_group.boxes = merge_boxes(sg.box_group.boxes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this work 🤣 I thought that the SpanGroups we're dealing with here have box info only on the Span, so this line would have to be sg.box_group.boxes = merge_boxes([box for box in each sg.spans]) and then delete the boxes from the Spans... but I've seen your images and that the boxes are being merged so... Can you just confirm that the final output only has boxes on sg.box_group and no boxes on sg.spans[0].box? otherwise we'll get an error in SPP

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ohhh maybe it's because the tokens are coming from the annotation store where we convert them into SpanGroupWithBoxGroup format?? I'm assuming that's it, and the tokens we start out with here have box info only on BoxGroup. but yeah maybe just confirming we only have box info in that one spot would be a good thing to check...

@@ -39,6 +39,34 @@ class PredictorConfig(BaseSettings):
pass


def group_by_line(boxes):
boxes = sorted(boxes, key=lambda box: box.t)
return [list(line_boxes) for t, line_boxes in groupby(boxes, key=lambda box: box.t)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you seeing that the line grouping is working well? I think it could be better if we had rows passed into this model, so that we can take advantage of pdfplumber's grouping by lines which I think is more robust than requiring exact t coordinates matching up (see

def _simple_line_detection(
) -- or maybe just box.t +/- a little amount calculated from getting average token heights? (see
return np.average([[span.box.w, span.box.h] for token in tokens
) -- just options, but if you're seeing good results with this grouping I wouldn't worry about making this more complicated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it actually seems to work well! if the lines dont have the same t then I think it's safest to just not do anything.

Comment on lines 60 to 61
boxes = merge_boxes([span.box for span in sg.spans])
sg.box_group = api.BoxGroup(boxes=[api.Box.from_mmda(box) for box in boxes]).to_mmda()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I initially expected it would need to look like, but now, won't your local run not work, even if tt verify does? I think it is a good idea to check what exactly SPP passes to the citation_mention model by running something locally. Because I am so unsure of what happens there, and feel like it's possible we need to keep this the way you first had it, and instead update the fixture.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I take that back -- we should be able to trust the config file in SPP which says that we are sending tokens with Spans on Boxes: https://github.com/allenai/spp/blob/3a394aa90265f2a852708abe1038f5993d3760e7/src/main/resources/spp/base.conf#L229

Copy link
Contributor

@geli-gel geli-gel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately we have to make a decision and change something here... i think we have 2 options:

  1. change SPP config to provide as input to citation_mentions tokens that have SpanGroupWithBoxGroup format, and revert your latest commit, needing to then update the fixture to match, or
  2. add another step in merge_boxes_of_sg to delete the boxes that with the latest commit are still in Span.box (because having boxes in both places will not be allowed by SPP)

-- option 2 is what we do in SPP Grobid: https://github.com/allenai/spp-grobid-api/blob/41571ec88219cda3b65cb02ddc34bc1c424bb944/spp_grobid_api/api/grobid_converter.py#L54, and seems to me the easiest way to get this PR merged


def build_box_group(sg):
boxes = [span.box for span in sg.spans]
return api.BoxGroup(boxes=[api.Box.from_mmda(box) for box in boxes]).to_mmda()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it necessary to convert .from_mmda .to_mmda?

@gituser768 gituser768 merged commit 36a538e into main Aug 7, 2023
5 checks passed
@gituser768 gituser768 deleted the dh-merge-boxes branch August 7, 2023 21:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants