merge boxes from citation mention model #273

gituser768 · 2023-08-07T17:18:58Z

No description provided.

geli-gel · 2023-08-07T18:03:38Z

pyproject.toml

@@ -1,6 +1,6 @@
 [project]
 name = 'mmda'
-version = '0.9.9'
+version = '0.10.0'


I swear I'm always confused by semver so i might be wrong again but shouldn't this be 0.9.10 (only increase the last number because it is a minor change that doesn't really affect the whole MMDA library as a whole?)

i guess I considered it a breaking change, but I think you're right since no one actually depends on the specific behavior that's changing in this case?

yeah, I think we usually consider the ai2 model updates as minor

geli-gel · 2023-08-07T18:21:52Z

src/ai2_internal/citation_mentions/interface.py

+def merge_boxes_of_sg(sg):
+    sg.box_group.boxes = merge_boxes(sg.box_group.boxes)


why does this work 🤣 I thought that the SpanGroups we're dealing with here have box info only on the Span, so this line would have to be sg.box_group.boxes = merge_boxes([box for box in each sg.spans]) and then delete the boxes from the Spans... but I've seen your images and that the boxes are being merged so... Can you just confirm that the final output only has boxes on sg.box_group and no boxes on sg.spans[0].box? otherwise we'll get an error in SPP

ohhh maybe it's because the tokens are coming from the annotation store where we convert them into SpanGroupWithBoxGroup format?? I'm assuming that's it, and the tokens we start out with here have box info only on BoxGroup. but yeah maybe just confirming we only have box info in that one spot would be a good thing to check...

geli-gel · 2023-08-07T18:35:50Z

src/ai2_internal/citation_mentions/interface.py

@@ -39,6 +39,34 @@ class PredictorConfig(BaseSettings):
    pass


+def group_by_line(boxes):
+    boxes = sorted(boxes, key=lambda box: box.t)
+    return [list(line_boxes) for t, line_boxes in groupby(boxes, key=lambda box: box.t)]


Are you seeing that the line grouping is working well? I think it could be better if we had rows passed into this model, so that we can take advantage of pdfplumber's grouping by lines which I think is more robust than requiring exact t coordinates matching up (see

mmda/src/mmda/parsers/pdfplumber_parser.py

Line 370 in 5c2fb76

def _simple_line_detection(

) -- or maybe just box.t +/- a little amount calculated from getting average token heights? (see

mmda/src/mmda/predictors/heuristic_predictors/figure_table_predictors.py

Line 53 in 6ab9004

return np.average([[span.box.w, span.box.h] for token in tokens

) -- just options, but if you're seeing good results with this grouping I wouldn't worry about making this more complicated

it actually seems to work well! if the lines dont have the same t then I think it's safest to just not do anything.

geli-gel · 2023-08-07T20:18:50Z

src/ai2_internal/citation_mentions/interface.py

+    boxes = merge_boxes([span.box for span in sg.spans])
+    sg.box_group = api.BoxGroup(boxes=[api.Box.from_mmda(box) for box in boxes]).to_mmda()


This is what I initially expected it would need to look like, but now, won't your local run not work, even if tt verify does? I think it is a good idea to check what exactly SPP passes to the citation_mention model by running something locally. Because I am so unsure of what happens there, and feel like it's possible we need to keep this the way you first had it, and instead update the fixture.

Actually I take that back -- we should be able to trust the config file in SPP which says that we are sending tokens with Spans on Boxes: https://github.com/allenai/spp/blob/3a394aa90265f2a852708abe1038f5993d3760e7/src/main/resources/spp/base.conf#L229

geli-gel

Unfortunately we have to make a decision and change something here... i think we have 2 options:

change SPP config to provide as input to citation_mentions tokens that have SpanGroupWithBoxGroup format, and revert your latest commit, needing to then update the fixture to match, or
add another step in merge_boxes_of_sg to delete the boxes that with the latest commit are still in Span.box (because having boxes in both places will not be allowed by SPP)

-- option 2 is what we do in SPP Grobid: https://github.com/allenai/spp-grobid-api/blob/41571ec88219cda3b65cb02ddc34bc1c424bb944/spp_grobid_api/api/grobid_converter.py#L54, and seems to me the easiest way to get this PR merged

geli-gel · 2023-08-07T20:56:58Z

src/ai2_internal/citation_mentions/interface.py

+
+def build_box_group(sg):
+    boxes = [span.box for span in sg.spans]
+    return api.BoxGroup(boxes=[api.Box.from_mmda(box) for box in boxes]).to_mmda()


is it necessary to convert .from_mmda .to_mmda?

merge boxes from citation mention model

09f42f6

gituser768 requested a review from geli-gel August 7, 2023 17:18

gituser768 added 2 commits August 7, 2023 12:19

Update interface.py

8bb41a5

major version bump

9175b30

geli-gel reviewed Aug 7, 2023

View reviewed changes

Update pyproject.toml

951ab36

geli-gel approved these changes Aug 7, 2023

View reviewed changes

pull boxgroup from span.box

a726090

geli-gel reviewed Aug 7, 2023

View reviewed changes

geli-gel requested changes Aug 7, 2023

View reviewed changes

remove unneeded boxes on spans

ae037c2

geli-gel reviewed Aug 7, 2023

View reviewed changes

geli-gel approved these changes Aug 7, 2023

View reviewed changes

cleanup like angele said

dcbfdd0

gituser768 merged commit 36a538e into main Aug 7, 2023
5 checks passed

gituser768 deleted the dh-merge-boxes branch August 7, 2023 21:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge boxes from citation mention model #273

merge boxes from citation mention model #273

gituser768 commented Aug 7, 2023

geli-gel Aug 7, 2023

gituser768 Aug 7, 2023 •

edited

Loading

geli-gel Aug 7, 2023

geli-gel Aug 7, 2023

geli-gel Aug 7, 2023

geli-gel Aug 7, 2023

gituser768 Aug 7, 2023

geli-gel Aug 7, 2023

geli-gel Aug 7, 2023

geli-gel left a comment •

edited

Loading

geli-gel Aug 7, 2023

		def merge_boxes_of_sg(sg):
		sg.box_group.boxes = merge_boxes(sg.box_group.boxes)

		boxes = merge_boxes([span.box for span in sg.spans])
		sg.box_group = api.BoxGroup(boxes=[api.Box.from_mmda(box) for box in boxes]).to_mmda()

merge boxes from citation mention model #273

merge boxes from citation mention model #273

Conversation

gituser768 commented Aug 7, 2023

Choose a reason for hiding this comment

gituser768 Aug 7, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

geli-gel left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gituser768 Aug 7, 2023 •

edited

Loading

geli-gel left a comment •

edited

Loading