-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
merge boxes from citation mention model #273
Conversation
pyproject.toml
Outdated
@@ -1,6 +1,6 @@ | |||
[project] | |||
name = 'mmda' | |||
version = '0.9.9' | |||
version = '0.10.0' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I swear I'm always confused by semver so i might be wrong again but shouldn't this be 0.9.10 (only increase the last number because it is a minor change that doesn't really affect the whole MMDA library as a whole?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i guess I considered it a breaking change, but I think you're right since no one actually depends on the specific behavior that's changing in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I think we usually consider the ai2 model updates as minor
def merge_boxes_of_sg(sg): | ||
sg.box_group.boxes = merge_boxes(sg.box_group.boxes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does this work 🤣 I thought that the SpanGroups we're dealing with here have box info only on the Span, so this line would have to be sg.box_group.boxes = merge_boxes([box for box in each sg.spans]) and then delete the boxes from the Spans... but I've seen your images and that the boxes are being merged so... Can you just confirm that the final output only has boxes on sg.box_group and no boxes on sg.spans[0].box? otherwise we'll get an error in SPP
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ohhh maybe it's because the tokens are coming from the annotation store where we convert them into SpanGroupWithBoxGroup format?? I'm assuming that's it, and the tokens we start out with here have box info only on BoxGroup. but yeah maybe just confirming we only have box info in that one spot would be a good thing to check...
@@ -39,6 +39,34 @@ class PredictorConfig(BaseSettings): | |||
pass | |||
|
|||
|
|||
def group_by_line(boxes): | |||
boxes = sorted(boxes, key=lambda box: box.t) | |||
return [list(line_boxes) for t, line_boxes in groupby(boxes, key=lambda box: box.t)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you seeing that the line grouping is working well? I think it could be better if we had rows
passed into this model, so that we can take advantage of pdfplumber's grouping by lines which I think is more robust than requiring exact t
coordinates matching up (see
mmda/src/mmda/parsers/pdfplumber_parser.py
Line 370 in 5c2fb76
def _simple_line_detection( |
return np.average([[span.box.w, span.box.h] for token in tokens |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it actually seems to work well! if the lines dont have the same t
then I think it's safest to just not do anything.
boxes = merge_boxes([span.box for span in sg.spans]) | ||
sg.box_group = api.BoxGroup(boxes=[api.Box.from_mmda(box) for box in boxes]).to_mmda() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I initially expected it would need to look like, but now, won't your local run not work, even if tt verify does? I think it is a good idea to check what exactly SPP passes to the citation_mention model by running something locally. Because I am so unsure of what happens there, and feel like it's possible we need to keep this the way you first had it, and instead update the fixture.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I take that back -- we should be able to trust the config file in SPP which says that we are sending tokens with Spans on Boxes: https://github.com/allenai/spp/blob/3a394aa90265f2a852708abe1038f5993d3760e7/src/main/resources/spp/base.conf#L229
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately we have to make a decision and change something here... i think we have 2 options:
- change SPP config to provide as input to citation_mentions
tokens
that haveSpanGroupWithBoxGroup
format, and revert your latest commit, needing to then update the fixture to match, or - add another step in
merge_boxes_of_sg
to delete the boxes that with the latest commit are still in Span.box (because having boxes in both places will not be allowed by SPP)
-- option 2 is what we do in SPP Grobid: https://github.com/allenai/spp-grobid-api/blob/41571ec88219cda3b65cb02ddc34bc1c424bb944/spp_grobid_api/api/grobid_converter.py#L54, and seems to me the easiest way to get this PR merged
|
||
def build_box_group(sg): | ||
boxes = [span.box for span in sg.spans] | ||
return api.BoxGroup(boxes=[api.Box.from_mmda(box) for box in boxes]).to_mmda() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it necessary to convert .from_mmda .to_mmda?
No description provided.